Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently create arrays from a next n elements from an array

Short version:

I'm trying to efficiently create an array like x:

input = [0, 1, 2, 3, 4, 5, 6]

x = [ [0,1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6] ]

I've tried simple for looping and it takes too long for the real usecase.

Long version:

(extends short version)

I've got a 400k rows long dataframe, which I need to partition into arrays of a next n elements from the element currently iterated over. Currently I group it just like presented below in the process_data function.

A simple for based iteration takes forever here (2.5min on my hardware to be specific). I've searched itertools and pandas documentation, tried searching here too and couldn't find any fitting solution.

My current super time consuming implementation:

class ModelInputParsing(object):
    def __init__(self, data):
        self.parsed_dataframe = data.fillna(0)
    
    def process_data(self, lb=50):
        self.X, self.Y = [],[]
        for i in range(len(self.parsed_dataframe)-lb):
            self.X.append(self.parsed_dataframe.iloc[i:(i+lb),-2])
            self.Y.append(self.parsed_dataframe.iloc[(i+lb),-1])
        return (np.array(self.X), np.array(self.Y))

The input data looks like this (where Bid is the mentioned input):

    Bid     Changes     Expected
0   1.20102 NaN         0.000000
1   1.20102 0.000000    0.000000
2   1.20102 0.000000    0.000042
3   1.20102 0.000000    0.000017
4   1.20102 0.000000    0.000025
5   1.20102 0.000000    0.000025
6   1.20102 0.000000    0.000100
...

And the output should look like this:

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          8.34465027e-06,  -8.34465027e-06,   0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
         -8.34465027e-06,   0.00000000e+00,   3.33786011e-05],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   3.33786011e-05,   0.00000000e+00],
       ..., 
       [  0.00000000e+00,   8.34465027e-06,   1.66893005e-05, ...,
         -8.34465027e-06,   0.00000000e+00,   0.00000000e+00],
       [  8.34465027e-06,   1.66893005e-05,  -8.34465027e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.66893005e-05,  -8.34465027e-06,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.66893005e-05]], dtype=float32)
len(x)
399950

Below I've presented x[0] and x[1]. Key here is how the the values move one position back in the next array. For example a first non-zero value moved from 7 to 6 position (0 based position).

The first element:

x[0]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -4.16040421e-05,   2.49147415e-05,
        -8.34465027e-06,   0.00000000e+00,  -7.49230385e-05,
         ...,
         2.50339508e-05,  -8.34465027e-06,   3.33786011e-05,
        -2.50339508e-05,  -8.34465027e-06,   8.34465027e-06,
        -8.34465027e-06,   0.00000000e+00], dtype=float32)
len(x[0])
50

The second element:

x[1]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -4.16040421e-05,   2.49147415e-05,  -8.34465027e-06,
         0.00000000e+00,  -7.49230385e-05,  -1.58131123e-04,
         ....,
        -8.34465027e-06,   3.33786011e-05,  -2.50339508e-05,
        -8.34465027e-06,   8.34465027e-06,  -8.34465027e-06,
         0.00000000e+00,   3.33786011e-05], dtype=float32)
len(x[1])
50

I'm curious if there is a way to get this done more efficiently as I'm soon planning to parse +20m rows long datasets.

like image 402
trust512 Avatar asked Apr 23 '18 18:04

trust512


People also ask

What is the easiest way to add a new element to an array?

By using ArrayList as intermediate storage: Create an ArrayList with the original array, using asList() method. Simply add the required element in the list using add() method. Convert the list to an array using toArray() method.

How do you append an array to an array?

To append one array to another, use the push() method on the first array, passing it the values of the second array. The push method is used to add one or more elements to the end of an array. The method changes the contents of the original array. Copied!

How do you fill an array with zeroes?

Approach: The idea is to first create an array filled with zeroes of size N. Then for every iteration, we search if the element has occurred in the near past. If yes, then we follow rule 1. Else, rule 2 is followed to fill the array.

How to create an array in NumPy?

An array can be created using the following functions: ndarray (shape, type): Creates an array of the given shape with random numbers full (shape,array_object, dtype): Create an array of the given shape with complex numbers How to Access Array Elements in NumPy?

How do I iterate over an array and create new arrays?

You can use Array.prototype.forEach to iterate over the original array's items, then create the new items and push them into new arrays:

How do you fill an array with 2 elements?

Therefore, the array is filled according to the rule 2. arr = {0, 0}. For i = 2: There are two elements in the array. The second most occurrence of arr [i – 1] = arr [0] = 0. So, arr [2] = 1. arr [] = {0, 0, 1}. For i = 3: There is no second occurrence of arr [i – 1] = 1. Therefore, arr [3] = 0. arr [] = {0, 0, 1, 0}


3 Answers

zip() plus some slicing can do that:

>>> list(zip(input[0:], input[1:], input[2:]))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

if you need the list elements to be lists, use this:

>>> list(map(list, zip(input[0:], input[1:], input[2:])))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

In general, if you need n-tuples instead of triples, you can do:

>>> list(zip(*(input[i:] for i in range(3))))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

or

>>> list(map(list, zip(*(input[i:] for i in range(3)))))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Another way to do it:

>>> [input[i:i+3] for i in range(len(input)-3+1)]
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Some benchmarks:

Setup:

import timeit

def ff1(input):
    return list(map(list, zip(input[0:], input[1:], input[2:])))

def ff2(input):
    return list(map(list, zip(*(input[i:] for i in range(3)))))

def ff3(input):
    return [input[i:i+3] for i in range(len(input)-3+1)]

def jg(input):
    for i in range(0, len(input) - 2):
        yield input[i:i+3]

def jg1(input):
    return list(jg(input))

import itertools

def n(input, n=3):
    i = list(itertoopls.tee(input, n))
    for p, it in enumerate(i):
        next(itertools.slice(it, p, p), None)
    return zip(*i)

def n1(input, _n=3):
    return list(map(list, n(input, _n)))

from numpy.lib.stride_tricks import as_strided

def strided_groupby(n, l=3):
    s = n.strides[0]
    return as_strided(n, shape=(n.size-l+1,l), strides=(s,s))

Results:

>>> input = list(range(10000))
>>> timeit.timeit(stmt='ff1(input)', globals=globals(), number=1000)
1.4750333260162733
>>> timeit.timeit(stmt='ff2(input)', globals=globals(), number=1000)
1.486136345018167
>>> timeit.timeit(stmt='ff3(input)', globals=globals(), number=1000)
1.6864491199958138
>>> timeit.timeit(stmt='jg1(input)', globals=globals(), number=1000)
2.300399674975779
>>> timeit.timeit(stmt='n1(input)', globals=globals(), number=1000)
2.2269885840360075
>>> input_arr = np.array(input)
>>> timeit.timeit(stmt='strided_groupby(input_arr)', globals=globals(), number=1000)
0.01855822204379365

Note that the inner list conversion waste a significant amount of CPU cycles. If you can afford to have tuples instead of lists, as the innermost sequences (i.e. (0,1,2), (1,2,3), ...) that is going to perform better.

For fairness of comparison I applied the same list conversion to all algorithms.

like image 50
fferri Avatar answered Sep 30 '22 07:09

fferri


If you are using numpy or pandas then you can use strides as @miradulo suggested. You need to be really careful when using them though. They can have very unexpected results when using vectorized operations on them, but miradulo is right in that it should be incredibly fast.

here is an example implementation:

def strided_groupby(n, l):
    s = n.strides[0]
    return as_strided(n, shape=(n.size-l+1,l), strides=(s,s))

Adapted from the documentation here scipy-strides

output looks like:

[[0 1 2]
 [1 2 3]
 [2 3 4]
 [3 4 5]
 [4 5 6]]

edit on my machine i got the following results:

>>> timeit.timeit(stmt='ff1(n)', globals=globals(), number=1000)
0.2299177199965925

>>> timeit.timeit(stmt='strided_groupby(n, 3)', globals=globals(), number=1000)
0.012110635001590708

which is actually a very significant difference.

like image 27
Grant Williams Avatar answered Sep 30 '22 07:09

Grant Williams


Is this what you called inefficient?

def answer(data): return [[data[k], data[k+1], data[k+2]] for k in range(len(data)-2)]

like image 38
Andrija Radica Avatar answered Sep 30 '22 06:09

Andrija Radica