Pandas: Select Df Rows Based On Another Df

November 10, 2024 Post a Comment

I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:

Solution 1:

Using `np.searchsorted`:

Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.

# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.

Using an Interval Tree

Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.

The intervaltree package on PyPI seems to have good performance and easy to use syntax.

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.

Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.

The resulting output for either method:

datevalue02016-11-24 00:00:00  1759.19995112016-11-24 00:10:00   992.40002462016-11-24 01:00:00    82.49999972016-11-24 01:10:00    37.40000382016-11-24 01:20:00   159.899994

Timings

Using the following setup to produce larger sample data:

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

Which yields the following for df1 and df2:

df1datevalue02016-11-24 00:00:00  0.44493912016-11-24 00:10:00  0.40755422016-11-24 00:20:00  0.46014832016-11-24 00:30:00  0.46523942016-11-24 00:40:00  0.462691...549952017-12-10 21:50:00  0.754123549962017-12-10 22:00:00  0.401820549972017-12-10 22:10:00  0.146284549982017-12-10 22:20:00  0.394759549992017-12-10 22:30:00  0.907233df2start_dateend_date02016-11-24 00:00:19 2016-11-24 00:41:2412016-11-24 18:22:44 2016-11-24 18:36:4422016-11-25 12:44:44 2016-11-25 13:03:1332016-11-26 07:07:05 2016-11-26 07:49:2942016-11-27 01:29:31 2016-11-27 01:34:32...4952017-12-07 21:36:04 2017-12-07 22:14:294962017-12-08 15:58:14 2017-12-08 16:10:354972017-12-09 10:20:21 2017-12-09 10:26:404982017-12-10 04:42:41 2017-12-10 05:22:474992017-12-10 23:04:42 2017-12-10 23:44:53

And using the following functions for timing purposes:

defroot_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

defroot_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

defptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.Timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

defparfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERINGreturn df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

defroot_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----# Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.return df1.drop('date_end', axis=1)

I get the following timings:

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

Solution 2:

This solution (I believe it works) uses pandas.Series.asof. Under the hood, it's some version of searchsorted - ~~but for some reason it's four times faster than~~ it's comparable in speed with @root's function.

I assume that all date columns are in the pandas datetime format, sorted, and that df2 intervals are non-overlapping.

The code is pretty short but somewhat intricate (explanation below).

# The smallest amount of time - handy when using open intervals: 
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

# The main function (see explanation below):defget_it(df1):
    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values
    return df1[mask]

The advantage of this approach is twofold: sdate and edate are evaluated only once and the main function can take chunks of df1 if df1 is very large.

Explanation

pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.

For the sake of this explanation, let s[j] = sdate.index[j] be the jth date in sdate and x be some arbitrary date (timestamp). There is always s[sdate.asof(x)] <= x (this is exactly how asof works) and it's not difficult to show that:

j <= sdate.asof(x) if and only if s[j] <= x
sdate.asof(x) < j if and only if x < s[j]

Similarly for edate. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.

Two intervals [a, b) and [x, y] intersect iff x < b and a <= y. (We may think of a, b as coming from sdate.index and edate.index - the interval [a, b) is chosen to be closed-open because of properties 1. and 2.) In our case x is a date from df1, y = x + 10min - epsilon, a = s[j], b = e[j] (note that epsilon has been added to edate), where j is some number.

So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is "sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r inside the function get_it (modulo some technicalities).

Solution 3:

This is not exactly straightforward but you can do the following:

First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)

dfm=pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)print(dfm)0level_0level_102016-11-23 23:55:32      0.0start_date02016-11-24 00:00:00      NaNNaN12016-11-24 00:10:00      NaNNaN12016-11-24 00:14:03      0.0end_date22016-11-24 00:20:00      NaNNaN32016-11-24 00:30:00      NaNNaN42016-11-24 00:40:00      NaNNaN52016-11-24 00:50:00      NaNNaN62016-11-24 01:00:00      NaNNaN22016-11-24 01:03:18      1.0start_date32016-11-24 01:07:12      1.0end_date72016-11-24 01:10:00      NaNNaN42016-11-24 01:11:32      2.0start_date82016-11-24 01:20:00      NaNNaN52016-11-24 02:00:00      2.0end_date

You can see that the values from df1 have NaN in the right two columns and since we have sorted the dates, these rows fall in between the start_date and end_date rows (from df2).

In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0 column which gives us:

dfm['level_0']=dfm['level_0'].interpolate()0level_0level_102016-11-23 23:55:32  0.000000start_date02016-11-24 00:00:00  0.000000NaN12016-11-24 00:10:00  0.000000NaN12016-11-24 00:14:03  0.000000end_date22016-11-24 00:20:00  0.166667NaN32016-11-24 00:30:00  0.333333NaN42016-11-24 00:40:00  0.500000NaN52016-11-24 00:50:00  0.666667NaN62016-11-24 01:00:00  0.833333NaN22016-11-24 01:03:18  1.000000start_date32016-11-24 01:07:12  1.000000end_date72016-11-24 01:10:00  1.500000NaN42016-11-24 01:11:32  2.000000start_date82016-11-24 01:20:00  2.000000NaN52016-11-24 02:00:00  2.000000end_date

Notice that the level_0 column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).

Now we can just filter out the rows originally in df1:

df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']

And merge back with the original dataframe

df_final = pd.merge(df1, right=df_falls, on='date', how='outer')

which gives:

print(df_final)datevaluefalls_index02016-11-24 00:00:00  1759.199951          0.012016-11-24 00:10:00   992.4000240.022016-11-24 00:20:00  1404.800049          NaN32016-11-24 00:30:00    45.799999NaN42016-11-24 00:40:00    24.299999NaN52016-11-24 00:50:00   159.899994NaN62016-11-24 01:00:00    82.499999NaN72016-11-24 01:10:00    37.400003NaN82016-11-24 01:20:00   159.8999942.0

Which is the same as the original dataframe with the extra column falls_index which indicates the index of the row in df2 that that row falls into.

Solution 4:

Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on argument. Then, run a filter on large returned set using pd.series.between(). Specifically, the series between() keeps rows where start date falls within the 9:59 range of date or date falls within start and end times.

However, prior to the merge, create a df1['date'] column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row'] column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:

from io import StringIO
import pandas as pd
import datetime as dt

data1 = '''
date                     value
"2016-11-24 00:00:00"    1759.199951
"2016-11-24 00:10:00"     992.400024
"2016-11-24 00:20:00"    1404.800049
"2016-11-24 00:30:00"      45.799999
"2016-11-24 00:40:00"      24.299999
"2016-11-24 00:50:00"     159.899994
"2016-11-24 01:00:00"      82.499999
"2016-11-24 01:10:00"      37.400003
"2016-11-24 01:20:00"     159.899994
'''    
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values

data2 = '''
start_date  end_date
"2016-11-23 23:55:32"  "2016-11-24 00:14:03"
"2016-11-24 01:03:18"  "2016-11-24 01:07:12"
"2016-11-24 01:11:32"  "2016-11-24 02:00:00"
'''    
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])

# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])

# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
          (df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

print(df3)
#                            value  row# date                                 # 2016-11-24 00:00:00  1759.199951    0# 2016-11-24 00:10:00   992.400024    0# 2016-11-24 01:00:00    82.499999    1# 2016-11-24 01:10:00    37.400003    2# 2016-11-24 01:20:00   159.899994    2

Solution 5:

I tried to modify the @root's code with the experimental query pandas method see. It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.

defroot_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----# Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.return df1.drop('date_end', axis=1)

Python Dummy

Pandas: Select Df Rows Based On Another Df

Solution 1:

Using `np.searchsorted`:

Using an Interval Tree

Timings

Solution 2:

Solution 3:

Solution 4:

Solution 5:

Post a Comment for "Pandas: Select Df Rows Based On Another Df"

Pandas: Select Df Rows Based On Another Df

Solution 1:

Using np.searchsorted:

Using an Interval Tree

Timings

Solution 2:

Solution 3:

Solution 4:

Solution 5:

Post a Comment for "Pandas: Select Df Rows Based On Another Df"

Using `np.searchsorted`: