Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I construct a Pandas DataFrame from individual 1D Numpy arrays without copying

Unlike every other question I can find, I do not want to create a DataFrame from a homogeneous Numpy array, nor do I want to convert a structured array into a DataFrame.

What I want is to create a DataFrame from individual 1D Numpy arrays for each column. I tried the obvious DataFrame({"col": nparray, "col": nparray}), but this shows up at the top of my profile, so it must be doing something really slow.

It is my understanding that Pandas DataFrames are implemented in pure Python, where each column is backed by a Numpy array, so I would think there is an efficient way to do it.

What I'm actually trying to do is to fill a DataFrame efficiently from Cython. Cython has memoryviews that allow efficient access to Numpy arrays. So my strategy is to allocate a Numpy array, fill it with data and then put it in a DataFrame.

The opposite works quite fine, creating a memoryview from a Pandas DataFrame. So if there is a way to preallocate the entire DataFrame and then just pass the columns to Cython, this is also acceptable.

cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

A section of the profile of my code looks like this, where everything the code does is completely dwarfed by creating the DataFrame at the end.

         1100546 function calls (1086282 primitive calls) in 4.345 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.345    4.345 profile:0(<code object <module> at 0x7f4e693d1c90, file "test.py", line 1>)
    445/1    0.029    0.000    4.344    4.344 :0(exec)
        1    0.006    0.006    4.344    4.344 test.py:1(<module>)
     1000    0.029    0.000    2.678    0.003 :0(run_df)
     1001    0.017    0.000    2.551    0.003 frame.py:378(__init__)
     1001    0.018    0.000    2.522    0.003 construction.py:170(init_dict)

Corresponding code:

def run_df(self, df):
    cdef int arx_rows = len(df)
    cdef int arx_idx

    cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

    data_out_np = np.zeros(arx_rows, dtype="int32")
    cdef int32_t[:] data_out = data_out_np

    for arx_idx in range(arx_rows):
        self.cpp_sec_par.run(data_in[arx_idx],data_out[arx_idx],)

    return pd.DataFrame({
        'data_out': data_out_np,
    })
like image 272
Pepijn Avatar asked Mar 04 '19 11:03

Pepijn


2 Answers

pandas.DataFrame ({"col": nparray, "col": nparray})

This works if you try list (nparray) instead. Here's a generic example:

import numpy as np
import pandas as pd

alpha = np.array ([1, 2, 3])
beta = np.array ([4, 5, 6])
gamma = np.array ([7, 8, 9])

dikt = {"Alpha" : list (alpha), "Beta" : list (beta), "Gamma":list (gamma)}

data_frame = pd.DataFrame (dikt)
print (data_frame)
like image 102
Athanasios Tsiaras Avatar answered Sep 20 '22 20:09

Athanasios Tsiaras


I don't think this fully answers the question but it might help.

1-when you initialize your dataframe directly from 2D array, a copy is not made.

2-you don't have 2D arrays, you have 1D arrays, how do you get 2D arrays from 1D arrays without making copies, I don't know.

To illustrate the points, see below:

a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)
a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)

print(c)
[[1 2 3]
 [4 5 6]]

print(df)
   0  1  2
0  1  2  3
1  4  5  6

c[1,1]=10
print(df)
   0   1  2
0  1   2  3
1  4  10  6

So, changing c indeed changes df. However if you try changing a or b, that does not affect c (or df).

like image 30
user2677285 Avatar answered Sep 19 '22 20:09

user2677285