Split Nested Array Values From Pandas Dataframe Cell Over Multiple Rows
I have a Pandas DataFrame of the following form There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Rain each cell contains an array of values
Solution 1:
You can run .apply(pd.Series)
for each of your columns, then stack
and concatenate the results.
For a series
s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])
s
Out[103]:
2011 [0, 1]
2012 [2, 3, 4]
dtype: object
it works as follows
s.apply(pd.Series).stack()
Out[104]:
201100.011.0201202.013.024.0
dtype: float64
The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack
, had a NaN
value that has been later dropped.
Now, let's take a frame:
a=list(range(14))b=list(range(20,34))df=pd.DataFrame({'ID': [11111, 11111, 11112, 11112],'Year': [2011, 2012, 2011, 2012],'A': [a[:3], a[3:7], a[7:10], a[10:14]],'B': [b[:3], b[3:7], b[7:10], b[10:14]]})dfOut[108]:ABIDYear0 [0, 1, 2] [20, 21, 22] 1111120111 [3, 4, 5, 6] [23, 24, 25, 26] 1111120122 [7, 8, 9] [27, 28, 29] 1111220113 [10, 11, 12, 13] [30, 31, 32, 33] 111122012
Then we can run:
# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)
and get:
resultOut[115]:ABIDYear111112011 00.020.011.021.022.022.02012 03.023.014.024.025.025.036.026.0111122011 07.027.018.028.029.029.02012 010.030.0111.031.0212.032.0313.033.0
The rest (datetime index) is more less straightforward. For example:
# DatetimeIndexyears=pd.to_datetime(result.index.get_level_values(1).astype(str))# TimedeltaIndexdays=pd.to_timedelta(result.index.get_level_values(2),unit='D')# If the above line doesn't work (a bug in pandas), try this:# days = result.index.get_level_values(2).astype('timedelta64[D]')# the sum is again a DatetimeIndexdates=years+daysdates.name='Date'new_index=pd.MultiIndex.from_arrays([result.index.get_level_values(0),dates])result.index=new_indexresultOut[130]:ABIDDate111112011-01-01 0.020.02011-01-02 1.021.02011-01-03 2.022.02012-01-01 3.023.02012-01-02 4.024.02012-01-03 5.025.02012-01-04 6.026.0111122011-01-01 7.027.02011-01-02 8.028.02011-01-03 9.029.02012-01-01 10.030.02012-01-02 11.031.02012-01-03 12.032.02012-01-04 13.033.0
Post a Comment for "Split Nested Array Values From Pandas Dataframe Cell Over Multiple Rows"