Average Of All Rows Corresponing To All Unique Rows

June 25, 2024 Post a Comment

I have a numpy array with two columns: A = [[1,1,1,2,3,1,2,3],[0.1,0.2,0.2,0.1,0.3,0.2,0.2,0.1]] for all uniques in first column, I want average of the values corresponding to it.

Solution 1:

I think the following is the standard numpy approach for these kind of computations. The call to np.unique can be skipped if the entries of A[0] are small integers, but it makes the whole operation more robust and independent of the actual data.

>>>A = [[1,1,1,2,3,1,2,3],[0.1,0.2,0.2,0.1,0.3,0.2,0.2,0.1]]>>>unq, unq_idx = np.unique(A[0], return_inverse=True)>>>unq_sum = np.bincount(unq_idx, weights=A[1])>>>unq_counts = np.bincount(unq_idx)>>>unq_avg = unq_sum / unq_counts>>>unq
array([1, 2, 3])
>>>unq_avg
array([ 0.175,  0.15 ,  0.2  ])

You could of course then stack both arrays, although that will convert unq to float dtype:

>>> np.vstack((unq, unq_avg))
array([[ 1.   ,  2.   ,  3.   ],
       [ 0.175,  0.15 ,  0.2  ]])

Solution 2:

One possible solution is:

In [37]: a=np.array([[1,1,1,2,3,1,2,3],[0.1,0.2,0.2,0.1,0.3,0.2,0.2,0.1]])
In [38]: np.array([list(set(a[0])), [np.average(np.compress(a[0]==i, a[1])) for i in set(a[0])]])
Out[38]:
array([[ 1.   ,  2.   ,  3.  ],
       [ 0.175,  0.15 ,  0.2 ]])

Solution 3:

You can probably do this more efficiently by using np.histogram to first get the sums of the values in A[1] corresponding to each unique index in A[1], then the total number of occurrences for each unique index.

For example:

import numpy as np

A = np.array([[1,1,1,2,3,1,2,3],[0.1,0.2,0.2,0.1,0.3,0.2,0.2,0.1]])

# NB for n unique values in A[0] we want (n + 1) bin edges, such that# A[0].max() < bin_edges[-1]
bin_edges = np.arange(A[0].min(), A[0].max()+2, dtype=np.int)

# the `weights` parameter means that the count for each bin is weighted# by the corresponding value in A[1]
weighted_sums,_ = np.histogram(A[0], bins=bin_edges, weights=A[1])

# by calling `np.histogram` again without the `weights` parameter, we get# the total number of occurrences of each unique index
index_counts,_ = np.histogram(A[0], bins=bin_edges)

# now just divide the weighted sums by the total occurrences
urow_avg = weighted_sums / index_counts

print urow_avg
# [ 0.175  0.15   0.2  ]

Solution 4:

yet another efficient numpy only solution, using reduceat:

A=np.array(zip(*[[1,1,1,2,3,1,2,3],[0.1,0.2,0.2,0.1,0.3,0.2,0.2,0.1]]),
        dtype=[('id','int64'),('value','float64')])
A.sort(order='id')
unique_ids,idxs = np.unique(A['id'],return_index=True)
avgs = np.add.reduceat(A['value'],idxs)
#divide by the number of samples to obtain the actual averages.
avgs[:-1]/=np.diff(idxs)
avgs[-1]/=A.size-idxs[-1]

Solution 5:

You could approach this as follows:

values = {}

# get all values for each indexforindex, value in zip(*A):
    ifindexnot in values:
        values[index] = []
    values[index].append(value)

# create average for each indexforindex in values:
    values[index] = sum(values[index]) / float(len(values[index]))

B = np.array(zip(*values.items()))

For your example, this gives me:

>>> B
array([[ 1.   ,  2.   ,  3.   ],
       [ 0.175,  0.15 ,  0.2  ]])

You could simplify slightly using collections.defaultdict:

from collections import defaultdict

values = defaultdict(list)

forindex, value in zip(*A):
    values[index].append(value)

Python Dummy