Skip to content Skip to sidebar Skip to footer

When Turning A List Of Lists Of Tuples To An Array, How Can I Stop Tuples From Creating A 3rd Dimension?

I have a list of lists (each sublist of the same length) of tuples (each tuple of the same length, 2). Each sublist represents a sentence, and the tuples are bigrams of that senten

Solution 1:

To np.array, your list of lists of tuples isn't any different from a list of lists of lists. It's iterables all the way down. np.array tries to create as high a dimensional array as possible. In this case that is 3d.

There are ways of side stepping that and making a 2d array that contains objects, where those objects are things like tuples. But as noted in the comments, why would you want that?

In a recent SO question, I came up with this way of turning a n-d array into an object array of (n-m)-d shape:

In [267]: res = np.empty((3,2),object)
In [268]: arr = np.array(alist)
In [269]: for ij in np.ndindex(res.shape):
     ...:     res[ij] = arr[ij]
     ...:     
In [270]: res
Out[270]: 
array([[array([1, 2]), array([2, 3])],
       [array([4, 5]), array([5, 6])],
       [array([7, 8]), array([8, 9])]], dtype=object)

But that's a 2d array of arrays, not of tuples.

In [271]: for ij in np.ndindex(res.shape):
     ...:     res[ij] = tuple(arr[ij].tolist())
     ...:     
     ...:     
In [272]: res
Out[272]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=object)

That's better (or is it?)

Or I could index the nested list directly:

In [274]: for i,j in np.ndindex(res.shape):
     ...:     res[i,j] = alist[i][j]
     ...:     
In [275]: res
Out[275]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=object)

I'm using ndindex to generate the all the indices of a (3,2) array.

The structured array mentioned in the comments works because for a compound dtype, tuples are distinct from lists.

In [277]: np.array(alist, 'i,i')
Out[277]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=[('f0', '<i4'), ('f1', '<i4')])

Technically, though, that isn't an array of tuples. It just represents the elements (or records) of the array as tuples.

In the object dtype array, the elements of the array are pointers to the tuples in the list (at least in the Out[275] case). In the structured array case the numbers are stored in the same as with a 3d array, as bytes in the array data buffer.


Solution 2:

Here are two more methods to complement @hpaulj's answer. One of them, the frompyfunc methods seems to scale a bit better than the other methods, although hpaulj's preallocation method is also not bad if we get rid of the loop. See timings below:

import numpy as np
import itertools

bi_grams = [[(1, 2), (2, 3)], [(4, 5), (5, 6)], [(7, 8), (8, 9)]]

def f_pp_1(bi_grams):
    return np.frompyfunc(itertools.chain.from_iterable(bi_grams).__next__, 0, 1)(np.empty((len(bi_grams), len(bi_grams[0])), dtype=object))

def f_pp_2(bi_grams):
    res = np.empty((len(bi_grams), len(bi_grams[0])), dtype=object)
    res[...] = bi_grams
    return res

def f_hpaulj(bi_grams):
    res = np.empty((len(bi_grams), len(bi_grams[0])), dtype=object)
    for i, j in np.ndindex(res.shape):
        res[i, j] = bi_grams[i][j]
    return res

print(np.all(f_pp_1(bi_grams) == f_pp_2(bi_grams)))
print(np.all(f_pp_1(bi_grams) == f_hpaulj(bi_grams)))

from timeit import timeit
kwds = dict(globals=globals(), number=1000)

print(timeit('f_pp_1(bi_grams)', **kwds))
print(timeit('f_pp_2(bi_grams)', **kwds))
print(timeit('f_hpaulj(bi_grams)', **kwds))

big = 10000 * bi_grams

print(timeit('f_pp_1(big)', **kwds))
print(timeit('f_pp_2(big)', **kwds))
print(timeit('f_hpaulj(big)', **kwds))

Sample output:

True                      <- same result for
True                      <- different methods
0.004281356999854324      <- frompyfunc          small input
0.002839841999957571      <- prealloc ellipsis   small input
0.02361366100012674       <- prealloc loop       small input
2.153144505               <- frompyfunc          large input
5.152567720999741         <- prealloc ellipsis   large input
33.13142323599959         <- prealloc looop      large input

Post a Comment for "When Turning A List Of Lists Of Tuples To An Array, How Can I Stop Tuples From Creating A 3rd Dimension?"