Skip to content Skip to sidebar Skip to footer

Summarising Features With Multiple Values In Python For Machine Learning Model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so: PregnancyID MotherID gestat

Solution 1:

You can try this. a bit of a complicated query but it seems to work:

(df.groupby(['MotherID', 'PregnancyID'])
    .apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
                        .groupby('tm')['abdomCirc']
                        .apply(max))
    .unstack()
)

produces


     tm                    1      2     3
MotherID    PregnancyID         
0           0              NaN    200.0 NaN
1           1              NaN    294.0 350.0
2           2              180.0  NaN   NaN

Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)

For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).

Then we unstack() that moves tm to the column names

You may want to rename this columns later, but I did not bother

Solution 2

Come to think of it you can simplify the above a bit:

(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
    .drop(columns = 'gestationalAgeInWeeks')
    .groupby(['MotherID', 'PregnancyID','tm'])
    .agg('max')
    .unstack()
    )

similar idea, same output.


Solution 2:

There is a magic command called query. This should do your work for now:

abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()

abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()

abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()

If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)

Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html


Post a Comment for "Summarising Features With Multiple Values In Python For Machine Learning Model"