Pandas: Merge_asof() Sum Multiple Rows / Don't Duplicate
Solution 1:
You are asking for the rows from B that are between the previous and current row of A. I can get the first and last index pretty easily with this:
# get the previous dates from A:
prev_dates = np.roll(df_a.date, 1)
prev_dates[0] = pd.to_datetime(0)
# get the first and last index of B:
start = np.searchsorted(df_b.date, prev_dates)
stop = np.searchsorted(df_b.date, df_a.date, side='right') - 1
And now I can use a little list comprehension to get my results:
>>> [df_b.num.values[begin:end+1].sum() forbegin, endin zip(start, stop)]
[1, 0, 110, 11000]
Solution 2:
Thanks for posting this question. It prompted me to spend an educational couple of hours studying the merge_asof
source. I do not think that your solution can be improved considerably, but I would offer a couple of tweaks to speed it up a few percent.
# if we concat the original date vector, we will only need to merge once
df_ax = pd.concat([df_a, df_a.rename(columns={'date':'date1'})], axis=1)
# do the outer merge
df_m = pd.merge(df_ax, df_b, on='date', how='outer').sort_values(by='date')
# do a single rename, inplace
df_m.rename(columns={'date': 'datex', 'date1': 'date'}, inplace=True)
# fill the gaps to allow the groupby and sum
df_m['num'].fillna(0, inplace=True)
df_m['date'].fillna(method='bfill', inplace=True)
# roll up the results.
x = df_m.groupby('date').num.sum().reset_index()
Solution 3:
Ok, answered my own question, but it seems a little hackish and I'd be interested in hearing other answers. Also, this does not rely on merge_asof()
.
Using the same DataFrames as above:
df_m = pd.merge(df_a, df_b, on = 'date', how = 'outer').sort_values(by = 'date')
df_a = df_a.rename(columns = {'date':'date1'})
df_m = pd.merge(df_m, df_a, left_on = 'date', right_on = 'date1', how = 'outer')
df_m['num'].fillna(0, inplace = True)
df_m['date1'].fillna(method = 'bfill', inplace = True)
x = df_m.groupby('date1').num.sum().reset_index().rename(columns = {'date1':'date'})
Solution 4:
We can use conditional_join from pyjanitor to get the multiple rows, before aggregating:
# pip install pyjanitorimport pandas as pd
import janitor
(df_a.assign(start_date = df_a.date.shift().fillna(pd.to_datetime(0)))
.conditional_join(df_b.rename(columns={'date':'date_r'}),
# variable args of conditions
('date', 'date_r', '>='),
('start_date', 'date_r', '<='),
how = 'left')
.groupby('date', as_index=False)
.num
.sum(numeric_only=True)
)
date num
02016-01-151.012016-03-150.022016-05-15110.032016-07-1511000.0
Solution 5:
Here's a solution which still uses merge_asof()
along with duplicated()
. I'll use the OP's example as a starting point
df_a = pd.DataFrame({'date':pd.to_datetime(['1/15/2016','3/15/2016','5/15/2016','7/15/2016'])})
df_b = pd.DataFrame({'date':pd.to_datetime(['1/1/2016','4/1/2016','5/1/2016','6/1/2016','7/1/2016']), 'num':[1,10,100,1000,10000]})
df_x = pd.merge_asof(df_a, df_b, on = 'date')
# We have duplicate values in num column. Set all but first match to 0 (or None, or...)# Be careful to avoid nulls since Pandas sees those as values
df_x.loc[(df_x['num'].notnull()) & (df_x.duplicated('num')==True), 'num'] = 0
df_x
now has the value:
datenum02016-01-15 112016-03-15 022016-05-15 10032016-07-15 10000
Post a Comment for "Pandas: Merge_asof() Sum Multiple Rows / Don't Duplicate"