Skip to content Skip to sidebar Skip to footer

How To Correlate Scalar Values Of Two Pandas Dataframes

How do I correlate two pandas dataframes, find a single r value for all values? I don't want to correlate columns or rows, but all scalar values. One dataframe is the x axis, and t

Solution 1:

I've simulated a setup that I think mimics yours--three dataframes with countries across rows and years across columns. I then concatenate the different sets of data into a single dataframe. And show how to compute the correlation between them. Let me know if something about this example doesn't match your setup.

import pandas as pdset1= pd.DataFrame({1980:[4, 11, 0], 1981:[5, 10, 2], 1982:[0, 3, 1]},
    index=pd.Index(['USA', 'UK', 'Iran'], name='country'))
set1.columns.name = 'year'
set1
year1980  1981  1982countryUSA450UK11103Iran021
set2 = pd.DataFrame({1981:[2, 1, 10], 1982:[15, 1, 12], 1983:[10, 13, 1]},
    index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set2.columns.name = 'year'
set2
year1981  1982  1983countryUSA21510UK1113Turkey10121

Notice that, like your setup, some countries/years are not present in different datasets.

set3 = pd.DataFrame({1980:[12, 11, 4], 1982:[9, 8, 11]},
    index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set3.columns.name = 'year'

We can turns these into multi-indexed series by stacking along year and then concatenate these across columns using pd.concat.

df = pd.concat([set1.stack('year'), set2.stack('year'), set3.stack('year')],
    keys=['set1', 'set2', 'set3'], names=['set'], axis=1)
df
setset1set2set3countryyearIran1980   0.0NaNNaN1981   2.0NaNNaN1982   1.0NaNNaNTurkey1980   NaNNaN4.01981   NaN10.0NaN1982   NaN12.011.01983   NaN1.0NaNUK1980  11.0NaN11.01981  10.01.0NaN1982   3.01.08.01983   NaN13.0NaNUSA1980   4.0NaN12.01981   5.02.0NaN1982   0.015.09.01983   NaN10.0NaN

And we can compute a 3x3 correlation matrix across the three different sets.

df.corr()
set       set1      set2      set3
set                               
set1  1.000000 -0.723632  0.509902
set2 -0.723632  1.000000  0.606891
set3  0.509902  0.606891  1.000000

Solution 2:

Here's what I did, but it's still not as slick as if I had found a built-in pandas feature or package.

Because I ultimately wanted to do this with more than two tables, I put the tables (dataframes) into a dictionary.

Then, I changed each table into a one-column table that has a MultiIndex representing the original column names and index values. The field values are the original column values strung end to end.

Then, I merged these new tables into one full outer join on the MultiIndex. Now I can correlate any two of the original tables by correlating their respective columns in the final table.

import pandas as pd

gvtx_eiu_df = pd.read_csv('gvtx_eiu.csv',index_col=0,
                          skip_blank_lines=False)
gvtx_eiu_df.columns.name = 'year'
polpartix_eiu_df = pd.read_csv('polpartix_eiu.csv',index_col=0,
                               skip_blank_lines=False)
polpartix_eiu_df.columns.name = 'year'
clean_elec_idea_df = pd.read_csv('clean_elec_idea.csv', index_col=0,
                                 skip_blank_lines=False)
clean_elec_idea_df.columns.name = 'year'

test_table_dict = {'gvtx_eiu': gvtx_eiu_df,
                   'polpartix_eiu': polpartix_eiu_df,
                   'clean_elec_idea': clean_elec_idea_df}
'''
# Updated to not use this anymore. Using stack now, thanks to @jtorca. So it# fits more neatly into one function.# Serialize df columns into MultiIndex df, index=(year, country), one column
def df_to_multidx_df(df: pd.DataFrame, cols_idx1_name: str = 'Previous Columns',
                     idx_idx2_name: str = 'Previous Index',
                     val_col_name: str = 'Values') -> pd.DataFrame:
    #Takes 2d dataframe (df) with a single-level index and one or more#single-level columns. All df values must be the same type.#Parameters:#    df: 2d dataframe with single-level index and one or more#        single-level columns. All df values must be the same type.#    cols_idx1_name: 1st index title for returned dataframe; index is df#        column names.#    idx_idx2_name: 2nd index title for returned dataframe; index is df#        index.#Returns:#    a 2d dataframe with a MultiIndex constructed of table_df column#    names and index values. Has a single column with field values that are#    all df columns strung end to end.# Create MultiIndex from product of index values and column names.
    mult_idx = pd.MultiIndex.from_product([df.columns, df.index],
                                          names=[cols_idx1_name, idx_idx2_name])
    # 1D list of table values in same order as MultiIndex.
    val_list = [val for col in df for val in df[col]]
    
    return pd.DataFrame(val_list, index=mult_idx, columns=[val_col_name])
'''

def df_dict_to_multidx_df(df_dict: dict) -> pd.DataFrame:
#     , cols_idx1_name: str = 'idx1',#     idx_idx2_name: str = 'idx2') -> pd.DataFrame:
    '''Converts a dictionary (df_dict) of 2d dataframes, each with single-level
    indices and columns, into a 2d dataframe (multidx_df) with each column
    containing the the values of one of df_dict's dataframes. The index of
    multidx_df is a MultiIndex of the input dataframes' column names and index
    values. Dataframes are joined in full outer join on the MultiIndex.
        NOTE: each input dataframe's index and columns row must be named
        beforehand in order to name the columns in the multiindex and join on it.
    Parameters:
        df_dict: dictionary of 2d dataframes, each with single-level
            indices and columns.
    Returns:
        multidx_df = MultiIndex dataframe.'''
    
    df_dict_copy = df_dict.copy()
        
    # Full outer join each table to multidx_df on MultiIndex.# Start with first indicator to have a left df to merge.
    first_key = next(iter(df_dict_copy))
    multidx_df = pd.DataFrame(df_dict_copy.pop(first_key).stack(),
                                     columns=[first_key])
    for key, df in df_dict_copy.items():
        df = pd.DataFrame(df.stack(), columns=[key])
        multidx_df = multidx_df.merge(right=df, how='outer',
                                     on=multidx_df.index.names[:2])

        # concat twice as fast as merge#         multidx_df = pd.concat([multidx_df, df], names=['indicator'], axis=1)
    
    return multidx_df

###Test Code

print(gvtx_eiu_df)

#               2006    2007   2008   2009   2010   2011   2012   2013   2014  \# country                                                                       # Afghanistan    NaN  0.0395  0.079  0.079  0.079  0.079  0.079  0.079  0.114   # Albania      0.507  0.5070  0.507  0.507  0.507  0.471  0.400  0.400  0.400   # Algeria      0.221  0.2210  0.221  0.221  0.221  0.221  0.221  0.221  0.221   # Angola       0.214  0.2680  0.321  0.321  0.321  0.321  0.321  0.321  0.321   # Argentina    0.500  0.5000  0.500  0.535  0.571  0.571  0.571  0.571  0.571   # ...            ...     ...    ...    ...    ...    ...    ...    ...    ...   # Venezuela    0.364  0.3960  0.429  0.411  0.393  0.393  0.429  0.429  0.429   # Vietnam      0.429  0.4290  0.429  0.429  0.429  0.429  0.393  0.393  0.393   # Yemen        0.271  0.2610  0.250  0.214  0.179  0.036  0.143  0.143  0.143   # Zambia       0.464  0.4640  0.464  0.500  0.536  0.500  0.536  0.536  0.536   # Zimbabwe     0.079  0.0790  0.079  0.104  0.129  0.129  0.129  0.129  0.129   #               2015   2016   2017   2018  # country                                  # Afghanistan  0.114  0.114  0.114  0.114  # Albania      0.436  0.436  0.471  0.471  # Algeria      0.221  0.221  0.221  0.221  # Angola       0.321  0.321  0.286  0.286  # Argentina    0.500  0.500  0.500  0.536  # ...            ...    ...    ...    ...  # Venezuela    0.393  0.250  0.286  0.179  # Vietnam      0.393  0.321  0.321  0.321  # Yemen        0.036    NaN    NaN    NaN  # Zambia       0.536  0.536  0.500  0.464  # Zimbabwe     0.200  0.200  0.200  0.200  # [164 rows x 13 columns]


test_serialized = df_to_multidx_df(df=gvtx_eiu_df, cols_idx1_name='Year',
                                   idx_idx2_name='Country',
                                   val_col_name='gvtx_eiu')
print(test_serialized)

#                       gvtx_eiu# Year Country                  # 2006 Afghanistan           NaN#      Albania             0.507#      Algeria             0.221#      Angola              0.214#      Argentina           0.500# ...                        ...# 2018 Venezuela           0.179#      Vietnam             0.321#      Yemen                 NaN#      Zambia              0.464#      Zimbabwe            0.200# [2132 rows x 1 columns]


test_multidx_df = table_dict_to_multidx_df(test_table_dict, 'Year', 'Country')

print(test_multidx_df)

#                       gvtx_eiu       polpartix_eiu  clean_elec_idea# Year Country                                                       # 2006 Afghanistan           NaN               0.222            0.475#      Albania             0.507               0.444            0.541#      Algeria             0.221               0.222            0.399#      Angola              0.214               0.111              NaN#      Argentina           0.500               0.556            0.778# ...                        ...                 ...              ...# 2017 Somalia               NaN                 NaN            0.394#      South Sudan           NaN                 NaN              NaN# 2018 Georgia               NaN                 NaN            0.605#      Somalia               NaN                 NaN              NaN#      South Sudan           NaN                 NaN              NaN# [6976 rows x 3 columns]

test_multidx_profile = ProfileReport(test_multidx_df, title='Test MultIdx Profile')

The output is exactly what I was going for, but in addition to wishing for a one- or two-statement solution, I'm not completely happy with iterating through an input dictionary of dataframes. I tried to make the input a dataframe of dataframes so I could apply(lambda) to save some memory I think, but no dice getting apply() to work right, and it's time to move on.

Post a Comment for "How To Correlate Scalar Values Of Two Pandas Dataframes"