Finding The Most Correlated Item
I have a restaurant sales details as below. +----------+------------+---------+----------+ | Location | Units Sold | Revenue | Footfall | +----------+------------+---------+-------
Solution 1:
If your correlation should be described like minimal euclidean distance, solution is:
#convert columns tonumeric
df1['Revenue'] = df1['Revenue'].str.replace(',','').astype(int)
df2['Revenue'] = df2['Revenue'].str.replace(',','').astype(int)
#distance ofall columns subtracted byfirstrowoffirst DataFrame
dist = np.sqrt((df2['Units Sold']-df1.loc[0, 'Units Sold'])**2+
(df2['Revenue']- df1.loc[0, 'Revenue'])**2+
(df2['Footfall']- df1.loc[0, 'Footfall'])**2)
print (dist)
0103.0776411160.390149255.3985563115.991379417.0587225115.542200
dtype: float64
#get index of minimal valueandselectrowofsecond df
print (df2.loc[[dist.idxmin()]])
Location Units Sold Revenue Footfall
4 Loc -0689115774
Solution 2:
Might be a better way to do this, but I think this works, it's quite verbose so I've tried to keep the code clean and readable:
First, lets use a self defined numpy function from this post.
import numpy as np
import pandas as pd
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
returnarray[idx]
then using the arrays of your dataframe, pass in the value from your first dataframe to find the closest match.
us = find_nearest(df2['Units Sold'],df['Units Sold'][0])
ff = find_nearest(df2['Footfall'],df['Footfall'][0])
rev = find_nearest(df2['Revenue'],df['Revenue'][0])
print(us,ff,rev,sep=',')
100,87,1157
then return a data frame with all three conditions
new_ df = (df2.loc[
(df2['Units Sold'] == us) |
(df2['Footfall'] == ff) |
(df2['Revenue'] == rev)])
which gives us :
LocationUnitsSoldRevenueFootfall0Loc-021001250 603Loc-051151035 874Loc-06891157 74
Solution 3:
Fix Data
For numeric columns. I generalized this too much probably. Also, I set the index to be the 'Location'
column
def fix(d):
d.update(
d.astype(str).replace(',', '', regex=True)
.apply(pd.to_numeric, errors='ignore')
)
d.set_index('Location', inplace=True)
fix(df1)
fix(df2)
Manhattan Distance
df2.loc[[df2.sub(df1.loc['Loc - 01']).abs().sum(1).idxmin()]]
Units Sold Revenue Footfall
Location
Loc - 0689115774
Euclidean Distance
df2.loc[[df2.sub(df1.loc['Loc - 01']).pow(2).sum(1).pow(.5).idxmin()]]
Units Sold Revenue Footfall
Location
Loc - 0689115774
Post a Comment for "Finding The Most Correlated Item"