Using Numpy.unique On Multiple Columns Of A Pandas.DataFrame
I am looking to use numpy.unique to obtain the reverse unique indexes of two columns of a pandas.DataFrame. I know how to use it on one column: u, rev = numpy.unique(df[col], retur
Solution 1:
Approach #1
Here's one NumPy approach converting each row to a scalar each thinking of each row as one indexing tuple on a two-dimensional (for 2 columns of data) grid -
def unique_return_inverse_2D(a): # a is array
a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1))
return np.unique(a1D, return_inverse=1)[1]
If you have negative numbers in the data, we need to use min
too to get those scalars. So, in that case, use a.max(0) - a.min(0) + 1
in place of a.max(0) + 1
.
Approach #2
Here's another NumPy's views based solution with focus on performance inspired by this smart solution by @Eric
-
def unique_return_inverse_2D_viewbased(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
return np.unique(a.view(void_dt).ravel(), return_inverse=1)[1]
Sample runs -
In [209]: df
Out[209]:
0 1 2 3
0 21 7 31 69
1 62 75 22 62 # ----|
2 16 46 9 31 # |==> Identical rows, so must have same IDs
3 62 75 22 62 # ----|
4 24 12 88 15
In [210]: unique_return_inverse_2D(df.values)
Out[210]: array([1, 3, 0, 3, 2])
In [211]: unique_return_inverse_2D_viewbased(df.values)
Out[211]: array([1, 3, 0, 3, 2])
Solution 2:
I think you can convert columns to strings
and then sum
:
u, rev = np.unique(df.astype(str).values.sum(axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]
As pointed DSM
(thank you), it is dangerous.
Another solution is convert rows to tuples
:
u, rev = np.unique(df.apply(tuple, axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]
Post a Comment for "Using Numpy.unique On Multiple Columns Of A Pandas.DataFrame"