Most Pythonic Way To Concatenate Pandas Cells With Conditions
Solution 1:
What is the most pythonic way to do this ?
It depends by definion. If it is more preferable, most common and fastest way then np.where
solution is here most pythonic way.
Use numpy.where
, if need pandaic also this solutions are vectorized, so should be more preferable like apply
(loops under the hood):
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str),
df['city'])
Pandas alternatives:
df['final_target'] = df['city'].mask(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str))
df['final_target'] = df['city'].where(df['city'].ne('paris'),
df['city'] + '_' + df['arr'].astype(str))
print (df)
city arr final_target
0 paris 11 paris_11
1 paris 12 paris_12
2 dallas 22 dallas
3 miami 15 miami
4 paris 16 paris_16
Performance:
#50k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [157]: %%timeit
...: df['final_target'] = np.where(df['city'].eq('paris'),
...: df['city'] + '_' + df['arr'].astype(str),
...: df['city'])
...:
48.6 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %%timeit
...: df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
...:
...:
49.2 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [159]: %%timeit
...: df['final_target'] = df['city']
...: df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
...:
63.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [160]: %%timeit
...: df['final_target'] = df.apply(lambda x: x.city + '_' + str(x.arr) if x.city == 'paris' else x.city, axis = 1)
...:
...:
1.33 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
A one-liner code does the trick:
df['final_target'] = df.apply(lambda x: x.city + '_' + str(x.arr) if x.city == 'paris' else x.city, axis = 1)
Solution 3:
Try this neat and and short two lines with loc
:
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += '_' + df.loc[df['city'] == 'paris', 'arr'].astype(str)
This solution firstly assigns df['city']
as the final_target
column, then adds the arr
column separated by underscore if the city
column is paris
.
IMO this is probably the most Pythonic and neat way here.
print(df)
city arr final_target
0 paris 11 paris_11
1 paris 12 paris_12
2 dallas 22 dallas
3 miami 15 miami
4 paris 16 paris_16
Solution 4:
Pretty self explanatory, one line, looks pythonic
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am not sure why this example fails(update: failing due to sampling) but memory error is still a mystery Why memory error when using .loc in pandas with sampling instead of direct computing
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64
Post a Comment for "Most Pythonic Way To Concatenate Pandas Cells With Conditions"