Pandas: Select Df Rows Based On Another Df
Solution 1:
Using np.searchsorted
:
Here's a variation based on np.searchsorted
that seems to be an order of magnitude faster than using intervaltree
or merge
, assuming my larger sample data is correct.
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
This may need to be modified if the intervals in df2
are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.
Using an Interval Tree
Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2
, and querying it against your intervals in df1
to find the ones that overlap.
The intervaltree
package on PyPI seems to have good performance and easy to use syntax.
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree
package was built with pd.Timestamp
in mind, so there probably some intermediate conversion steps that slow things down a bit.
Also, note that intervals in the intervaltree
package do not include the end point, although the start point is included. That's why I have the + [0, 1]
when creating tree
; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10)
to get the interval end when querying the tree, instead of adding only 9m 59s.
The resulting output for either method:
datevalue02016-11-24 00:00:00 1759.19995112016-11-24 00:10:00 992.40002462016-11-24 01:00:00 82.49999972016-11-24 01:10:00 37.40000382016-11-24 01:20:00 159.899994
Timings
Using the following setup to produce larger sample data:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
Which yields the following for df1
and df2
:
df1datevalue02016-11-24 00:00:00 0.44493912016-11-24 00:10:00 0.40755422016-11-24 00:20:00 0.46014832016-11-24 00:30:00 0.46523942016-11-24 00:40:00 0.462691...549952017-12-10 21:50:00 0.754123549962017-12-10 22:00:00 0.401820549972017-12-10 22:10:00 0.146284549982017-12-10 22:20:00 0.394759549992017-12-10 22:30:00 0.907233df2start_dateend_date02016-11-24 00:00:19 2016-11-24 00:41:2412016-11-24 18:22:44 2016-11-24 18:36:4422016-11-25 12:44:44 2016-11-25 13:03:1332016-11-26 07:07:05 2016-11-26 07:49:2942016-11-27 01:29:31 2016-11-27 01:34:32...4952017-12-07 21:36:04 2017-12-07 22:14:294962017-12-08 15:58:14 2017-12-08 16:10:354972017-12-09 10:20:21 2017-12-09 10:26:404982017-12-10 04:42:41 2017-12-10 05:22:474992017-12-10 23:04:42 2017-12-10 23:44:53
And using the following functions for timing purposes:
defroot_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
defroot_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
defptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
defparfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERINGreturn df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
defroot_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.return df1.drop('date_end', axis=1)
I get the following timings:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop
Solution 2:
This solution (I believe it works) uses pandas.Series.asof
. Under the hood, it's some version of searchsorted - but for some reason it's four times faster than it's comparable in speed with @root's function.
I assume that all date columns are in the pandas datetime
format, sorted, and that df2 intervals are non-overlapping.
The code is pretty short but somewhat intricate (explanation below).
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):defget_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
The advantage of this approach is twofold: sdate
and edate
are evaluated only once and the main function can take chunks of df1
if df1
is very large.
Explanation
pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.
For the sake of this explanation, let s[j] = sdate.index[j]
be the jth date in sdate
and x
be some arbitrary date (timestamp).
There is always s[sdate.asof(x)] <= x
(this is exactly how asof
works) and it's not difficult to show that:
j <= sdate.asof(x)
if and only ifs[j] <= x
sdate.asof(x) < j
if and only ifx < s[j]
Similarly for edate
. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.
Two intervals [a, b) and [x, y] intersect iff x < b and a <= y.
(We may think of a, b as coming from sdate.index
and edate.index
- the interval [a, b) is chosen to be closed-open because of properties 1. and 2.)
In our case x is a date from df1
, y = x + 10min - epsilon,
a = s[j], b = e[j] (note that epsilon has been added to edate
), where j is some number.
So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is
"sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r
inside the function get_it
(modulo some technicalities).
Solution 3:
This is not exactly straightforward but you can do the following:
First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)
dfm=pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)print(dfm)0level_0level_102016-11-23 23:55:32 0.0start_date02016-11-24 00:00:00 NaNNaN12016-11-24 00:10:00 NaNNaN12016-11-24 00:14:03 0.0end_date22016-11-24 00:20:00 NaNNaN32016-11-24 00:30:00 NaNNaN42016-11-24 00:40:00 NaNNaN52016-11-24 00:50:00 NaNNaN62016-11-24 01:00:00 NaNNaN22016-11-24 01:03:18 1.0start_date32016-11-24 01:07:12 1.0end_date72016-11-24 01:10:00 NaNNaN42016-11-24 01:11:32 2.0start_date82016-11-24 01:20:00 NaNNaN52016-11-24 02:00:00 2.0end_date
You can see that the values from df1 have NaN
in the right two columns and since we have sorted the dates, these rows fall in between the start_date
and end_date
rows (from df2).
In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0
column which gives us:
dfm['level_0']=dfm['level_0'].interpolate()0level_0level_102016-11-23 23:55:32 0.000000start_date02016-11-24 00:00:00 0.000000NaN12016-11-24 00:10:00 0.000000NaN12016-11-24 00:14:03 0.000000end_date22016-11-24 00:20:00 0.166667NaN32016-11-24 00:30:00 0.333333NaN42016-11-24 00:40:00 0.500000NaN52016-11-24 00:50:00 0.666667NaN62016-11-24 01:00:00 0.833333NaN22016-11-24 01:03:18 1.000000start_date32016-11-24 01:07:12 1.000000end_date72016-11-24 01:10:00 1.500000NaN42016-11-24 01:11:32 2.000000start_date82016-11-24 01:20:00 2.000000NaN52016-11-24 02:00:00 2.000000end_date
Notice that the level_0
column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).
Now we can just filter out the rows originally in df1:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
And merge back with the original dataframe
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
which gives:
print(df_final)datevaluefalls_index02016-11-24 00:00:00 1759.199951 0.012016-11-24 00:10:00 992.4000240.022016-11-24 00:20:00 1404.800049 NaN32016-11-24 00:30:00 45.799999NaN42016-11-24 00:40:00 24.299999NaN52016-11-24 00:50:00 159.899994NaN62016-11-24 01:00:00 82.499999NaN72016-11-24 01:10:00 37.400003NaN82016-11-24 01:20:00 159.8999942.0
Which is the same as the original dataframe with the extra column falls_index
which indicates the index of the row in df2
that that row falls into.
Solution 4:
Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on
argument. Then, run a filter on large returned set using pd.series.between()
. Specifically, the series between()
keeps rows where start date falls within the 9:59 range of date
or date
falls within start and end times.
However, prior to the merge, create a df1['date']
column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row']
column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row# date # 2016-11-24 00:00:00 1759.199951 0# 2016-11-24 00:10:00 992.400024 0# 2016-11-24 01:00:00 82.499999 1# 2016-11-24 01:10:00 37.400003 2# 2016-11-24 01:20:00 159.899994 2
Solution 5:
I tried to modify the @root's code with the experimental query
pandas method see.
It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.
defroot_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.return df1.drop('date_end', axis=1)
Post a Comment for "Pandas: Select Df Rows Based On Another Df"