Skip to content Skip to sidebar Skip to footer

Are The Outcomes Of The Numpy.where Method On A Pandas Dataframe Calculated On The Full Array Or The Filtered Array?

I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list

Solution 1:

Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here

if we split your line of code, we get the following statements:

  1. df.A.str.contains('_')
  2. df.A.apply(lambda x: x.split('_')[1])
  3. df.A.str[0]

Python has to evaluate these statements before it can pass them as arguments to np.where

to see all this happening, we can rewrite the above as little functions that displays some output:

deffn_contains(x):
    print('contains', x)
    return'_'in x

deffn_split(x):
    s = x.split('_')
    print('split', x, s)
    # check for errors hereiflen(s) > 1:
        return s[1]

deffn_first(x):
    print('first', x)
    return x[0]

and then you can run them on your data with:

s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
  s.apply(fn_contains),
  s.apply(fn_split),
  s.apply(fn_first)
)

and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things

Solution 2:

In my opinion numpy.where only set values by condition, so second and third arrays are counted for all data - filtered and also non filtered.

If need apply some function only for filtered values:

mask = df.A.str.contains('_')
df.loc[mask, "B"] = df.loc[mask, "A"].str.split('_').str[1]

In your solution is error, but problem is not connected with np.where. After split by _ if not exist value, get one eleemnt list, so selecting second value of list by [1] raise error:

print (df.A.apply(lambda x: x.split('_')))
0[a]1[a, 1]2[b, ]3[b, 2, 3]Name: A, dtype: objectprint (df.A.apply(lambda x: x.split('_')[1]))
IndexError: listindexoutofrange

So here is possible use pandas solution, if performance is not important, because strings functions are slow:

df["B"] = np.where(df.A.str.contains('_'), 
                   df.A.str.split('_').str[1],
                   df.A.str[0])

Post a Comment for "Are The Outcomes Of The Numpy.where Method On A Pandas Dataframe Calculated On The Full Array Or The Filtered Array?"