How To Correlate Scalar Values Of Two Pandas Dataframes
Solution 1:
I've simulated a setup that I think mimics yours--three dataframes with countries across rows and years across columns. I then concatenate the different sets of data into a single dataframe. And show how to compute the correlation between them. Let me know if something about this example doesn't match your setup.
import pandas as pdset1= pd.DataFrame({1980:[4, 11, 0], 1981:[5, 10, 2], 1982:[0, 3, 1]},
index=pd.Index(['USA', 'UK', 'Iran'], name='country'))
set1.columns.name = 'year'
set1
year1980 1981 1982countryUSA450UK11103Iran021
set2 = pd.DataFrame({1981:[2, 1, 10], 1982:[15, 1, 12], 1983:[10, 13, 1]},
index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set2.columns.name = 'year'
set2
year1981 1982 1983countryUSA21510UK1113Turkey10121
Notice that, like your setup, some countries/years are not present in different datasets.
set3 = pd.DataFrame({1980:[12, 11, 4], 1982:[9, 8, 11]},
index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set3.columns.name = 'year'
We can turns these into multi-indexed series by stacking along year
and then concatenate these across columns using pd.concat
.
df = pd.concat([set1.stack('year'), set2.stack('year'), set3.stack('year')],
keys=['set1', 'set2', 'set3'], names=['set'], axis=1)
df
setset1set2set3countryyearIran1980 0.0NaNNaN1981 2.0NaNNaN1982 1.0NaNNaNTurkey1980 NaNNaN4.01981 NaN10.0NaN1982 NaN12.011.01983 NaN1.0NaNUK1980 11.0NaN11.01981 10.01.0NaN1982 3.01.08.01983 NaN13.0NaNUSA1980 4.0NaN12.01981 5.02.0NaN1982 0.015.09.01983 NaN10.0NaN
And we can compute a 3x3 correlation matrix across the three different sets.
df.corr()
set set1 set2 set3
set
set1 1.000000 -0.723632 0.509902
set2 -0.723632 1.000000 0.606891
set3 0.509902 0.606891 1.000000
Solution 2:
Here's what I did, but it's still not as slick as if I had found a built-in pandas feature or package.
Because I ultimately wanted to do this with more than two tables, I put the tables (dataframes) into a dictionary.
Then, I changed each table into a one-column table that has a MultiIndex representing the original column names and index values. The field values are the original column values strung end to end.
Then, I merged these new tables into one full outer join on the MultiIndex. Now I can correlate any two of the original tables by correlating their respective columns in the final table.
import pandas as pd
gvtx_eiu_df = pd.read_csv('gvtx_eiu.csv',index_col=0,
skip_blank_lines=False)
gvtx_eiu_df.columns.name = 'year'
polpartix_eiu_df = pd.read_csv('polpartix_eiu.csv',index_col=0,
skip_blank_lines=False)
polpartix_eiu_df.columns.name = 'year'
clean_elec_idea_df = pd.read_csv('clean_elec_idea.csv', index_col=0,
skip_blank_lines=False)
clean_elec_idea_df.columns.name = 'year'
test_table_dict = {'gvtx_eiu': gvtx_eiu_df,
'polpartix_eiu': polpartix_eiu_df,
'clean_elec_idea': clean_elec_idea_df}
'''
# Updated to not use this anymore. Using stack now, thanks to @jtorca. So it# fits more neatly into one function.# Serialize df columns into MultiIndex df, index=(year, country), one column
def df_to_multidx_df(df: pd.DataFrame, cols_idx1_name: str = 'Previous Columns',
idx_idx2_name: str = 'Previous Index',
val_col_name: str = 'Values') -> pd.DataFrame:
#Takes 2d dataframe (df) with a single-level index and one or more#single-level columns. All df values must be the same type.#Parameters:# df: 2d dataframe with single-level index and one or more# single-level columns. All df values must be the same type.# cols_idx1_name: 1st index title for returned dataframe; index is df# column names.# idx_idx2_name: 2nd index title for returned dataframe; index is df# index.#Returns:# a 2d dataframe with a MultiIndex constructed of table_df column# names and index values. Has a single column with field values that are# all df columns strung end to end.# Create MultiIndex from product of index values and column names.
mult_idx = pd.MultiIndex.from_product([df.columns, df.index],
names=[cols_idx1_name, idx_idx2_name])
# 1D list of table values in same order as MultiIndex.
val_list = [val for col in df for val in df[col]]
return pd.DataFrame(val_list, index=mult_idx, columns=[val_col_name])
'''
def df_dict_to_multidx_df(df_dict: dict) -> pd.DataFrame:
# , cols_idx1_name: str = 'idx1',# idx_idx2_name: str = 'idx2') -> pd.DataFrame:
'''Converts a dictionary (df_dict) of 2d dataframes, each with single-level
indices and columns, into a 2d dataframe (multidx_df) with each column
containing the the values of one of df_dict's dataframes. The index of
multidx_df is a MultiIndex of the input dataframes' column names and index
values. Dataframes are joined in full outer join on the MultiIndex.
NOTE: each input dataframe's index and columns row must be named
beforehand in order to name the columns in the multiindex and join on it.
Parameters:
df_dict: dictionary of 2d dataframes, each with single-level
indices and columns.
Returns:
multidx_df = MultiIndex dataframe.'''
df_dict_copy = df_dict.copy()
# Full outer join each table to multidx_df on MultiIndex.# Start with first indicator to have a left df to merge.
first_key = next(iter(df_dict_copy))
multidx_df = pd.DataFrame(df_dict_copy.pop(first_key).stack(),
columns=[first_key])
for key, df in df_dict_copy.items():
df = pd.DataFrame(df.stack(), columns=[key])
multidx_df = multidx_df.merge(right=df, how='outer',
on=multidx_df.index.names[:2])
# concat twice as fast as merge# multidx_df = pd.concat([multidx_df, df], names=['indicator'], axis=1)
return multidx_df
###Test Code
print(gvtx_eiu_df)
# 2006 2007 2008 2009 2010 2011 2012 2013 2014 \# country # Afghanistan NaN 0.0395 0.079 0.079 0.079 0.079 0.079 0.079 0.114 # Albania 0.507 0.5070 0.507 0.507 0.507 0.471 0.400 0.400 0.400 # Algeria 0.221 0.2210 0.221 0.221 0.221 0.221 0.221 0.221 0.221 # Angola 0.214 0.2680 0.321 0.321 0.321 0.321 0.321 0.321 0.321 # Argentina 0.500 0.5000 0.500 0.535 0.571 0.571 0.571 0.571 0.571 # ... ... ... ... ... ... ... ... ... ... # Venezuela 0.364 0.3960 0.429 0.411 0.393 0.393 0.429 0.429 0.429 # Vietnam 0.429 0.4290 0.429 0.429 0.429 0.429 0.393 0.393 0.393 # Yemen 0.271 0.2610 0.250 0.214 0.179 0.036 0.143 0.143 0.143 # Zambia 0.464 0.4640 0.464 0.500 0.536 0.500 0.536 0.536 0.536 # Zimbabwe 0.079 0.0790 0.079 0.104 0.129 0.129 0.129 0.129 0.129 # 2015 2016 2017 2018 # country # Afghanistan 0.114 0.114 0.114 0.114 # Albania 0.436 0.436 0.471 0.471 # Algeria 0.221 0.221 0.221 0.221 # Angola 0.321 0.321 0.286 0.286 # Argentina 0.500 0.500 0.500 0.536 # ... ... ... ... ... # Venezuela 0.393 0.250 0.286 0.179 # Vietnam 0.393 0.321 0.321 0.321 # Yemen 0.036 NaN NaN NaN # Zambia 0.536 0.536 0.500 0.464 # Zimbabwe 0.200 0.200 0.200 0.200 # [164 rows x 13 columns]
test_serialized = df_to_multidx_df(df=gvtx_eiu_df, cols_idx1_name='Year',
idx_idx2_name='Country',
val_col_name='gvtx_eiu')
print(test_serialized)
# gvtx_eiu# Year Country # 2006 Afghanistan NaN# Albania 0.507# Algeria 0.221# Angola 0.214# Argentina 0.500# ... ...# 2018 Venezuela 0.179# Vietnam 0.321# Yemen NaN# Zambia 0.464# Zimbabwe 0.200# [2132 rows x 1 columns]
test_multidx_df = table_dict_to_multidx_df(test_table_dict, 'Year', 'Country')
print(test_multidx_df)
# gvtx_eiu polpartix_eiu clean_elec_idea# Year Country # 2006 Afghanistan NaN 0.222 0.475# Albania 0.507 0.444 0.541# Algeria 0.221 0.222 0.399# Angola 0.214 0.111 NaN# Argentina 0.500 0.556 0.778# ... ... ... ...# 2017 Somalia NaN NaN 0.394# South Sudan NaN NaN NaN# 2018 Georgia NaN NaN 0.605# Somalia NaN NaN NaN# South Sudan NaN NaN NaN# [6976 rows x 3 columns]
test_multidx_profile = ProfileReport(test_multidx_df, title='Test MultIdx Profile')
The output is exactly what I was going for, but in addition to wishing for a one- or two-statement solution, I'm not completely happy with iterating through an input dictionary of dataframes. I tried to make the input a dataframe of dataframes so I could apply(lambda) to save some memory I think, but no dice getting apply() to work right, and it's time to move on.
Post a Comment for "How To Correlate Scalar Values Of Two Pandas Dataframes"