Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build pandas data frame from list of numpy arrays

I wonder if there is an easy way for the obvious task to generate a pandas DataFrame from a list of numpy arrays, where the columns are the arrays. The default behavior seems to let the arrays be the rows, which I totally don't understand why. Here is a quick example:

names = ['data1', 'data2', 'data3']
data = [np.arange(10) for _ in names]
df = pd.DataFrame(data=data, columns=names)

This gives an error, indicating pandas expects 10 columns.

If I do

df = pd.DataFrame(data=data)

I get a DataFrame with 10 columns and 3 rows.

Given that it is generally much more difficult to append rows than columns to a DataFrame I wonder about this behavior, e.g. let's say I quickly want to put a 4th data-array into the DataFrame I want the data to be organized in columns to do

df['data4'] = new_array

How can I quickly build the DataFrame I want?

like image 601
Whir Avatar asked Mar 22 '17 13:03

Whir


2 Answers

As @MaxGhenis pointed out in the comments, from_items is deprecated as of version 0.23. The link suggests to use from_dict instead, so the old answer can be modified to:

pd.DataFrame.from_dict(dict(zip(names, data)))

--------------------------------------------------OLD ANSWER-------------------------------------------------------------

I would use .from_items:

pd.DataFrame.from_items(zip(names, data))

which gives

  data1  data2  data3
0      0      0      0
1      1      1      1
2      2      2      2
3      3      3      3
4      4      4      4
5      5      5      5
6      6      6      6
7      7      7      7
8      8      8      8
9      9      9      9

That should also be faster than transposing:

%timeit pd.DataFrame.from_items(zip(names, data))

1000 loops, best of 3: 281 µs per loop

%timeit pd.DataFrame(data, index=names).T

1000 loops, best of 3: 730 µs per loop

Adding a fourth column is then also fairly simple:

df['data4'] = range(1, 11)

which gives

  data1  data2  data3  data4
0      0      0      0      1
1      1      1      1      2
2      2      2      2      3
3      3      3      3      4
4      4      4      4      5
5      5      5      5      6
6      6      6      6      7
7      7      7      7      8
8      8      8      8      9
9      9      9      9     10

As mentioned by @jezrael in the comments, a third option would be (beware: order not guaranteed)

pd.DataFrame(dict(zip(names, data)), columns=names)

Timing:

%timeit pd.DataFrame(dict(zip(names, data)))

1000 loops, best of 3: 281 µs per loop

like image 87
Cleb Avatar answered Oct 04 '22 03:10

Cleb


from_items is now deprecated. Use from_dict instead:

df = pd.DataFrame.from_dict({
  'data1': np.arange(10),
  'data2': np.arange(10),
  'data3': np.arange(10)
})

This returns:

    data1   data2   data3
0   0   0   0
1   1   1   1
2   2   2   2
3   3   3   3
4   4   4   4
5   5   5   5
6   6   6   6
7   7   7   7
8   8   8   8
9   9   9   9
like image 22
Lak Avatar answered Oct 04 '22 02:10

Lak