Skip to content Skip to sidebar Skip to footer

Split Nested Array Values From Pandas Dataframe Cell Over Multiple Rows

I have a Pandas DataFrame of the following form There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Rain each cell contains an array of values

Solution 1:

You can run .apply(pd.Series) for each of your columns, then stack and concatenate the results.

For a series

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

it works as follows

s.apply(pd.Series).stack()
Out[104]: 
201100.011.0201202.013.024.0
dtype: float64

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaN value that has been later dropped.

Now, let's take a frame:

a=list(range(14))b=list(range(20,34))df=pd.DataFrame({'ID': [11111, 11111, 11112, 11112],'Year': [2011, 2012, 2011, 2012],'A': [a[:3], a[3:7], a[7:10], a[10:14]],'B': [b[:3], b[3:7], b[7:10], b[10:14]]})dfOut[108]:ABIDYear0         [0, 1, 2]      [20, 21, 22]  1111120111      [3, 4, 5, 6]  [23, 24, 25, 26]  1111120122         [7, 8, 9]      [27, 28, 29]  1111220113  [10, 11, 12, 13]  [30, 31, 32, 33]  111122012

Then we can run:

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

and get:

resultOut[115]:ABIDYear111112011 00.020.011.021.022.022.02012 03.023.014.024.025.025.036.026.0111122011 07.027.018.028.029.029.02012 010.030.0111.031.0212.032.0313.033.0

The rest (datetime index) is more less straightforward. For example:

# DatetimeIndexyears=pd.to_datetime(result.index.get_level_values(1).astype(str))# TimedeltaIndexdays=pd.to_timedelta(result.index.get_level_values(2),unit='D')# If the above line doesn't work (a bug in pandas), try this:# days = result.index.get_level_values(2).astype('timedelta64[D]')# the sum is again a DatetimeIndexdates=years+daysdates.name='Date'new_index=pd.MultiIndex.from_arrays([result.index.get_level_values(0),dates])result.index=new_indexresultOut[130]:ABIDDate111112011-01-01   0.020.02011-01-02   1.021.02011-01-03   2.022.02012-01-01   3.023.02012-01-02   4.024.02012-01-03   5.025.02012-01-04   6.026.0111122011-01-01   7.027.02011-01-02   8.028.02011-01-03   9.029.02012-01-01  10.030.02012-01-02  11.031.02012-01-03  12.032.02012-01-04  13.033.0

Post a Comment for "Split Nested Array Values From Pandas Dataframe Cell Over Multiple Rows"