I wonder if there is an easy way for the obvious task to generate a pandas DataFrame from a list of numpy arrays, where the columns are the arrays. The default behavior seems to let the arrays be the rows, which I totally don't understand why. Here is a quick example:
names = ['data1', 'data2', 'data3']
data = [np.arange(10) for _ in names]
df = pd.DataFrame(data=data, columns=names)
This gives an error, indicating pandas expects 10 columns.
If I do
df = pd.DataFrame(data=data)
I get a DataFrame with 10 columns and 3 rows.
Given that it is generally much more difficult to append rows than columns to a DataFrame I wonder about this behavior, e.g. let's say I quickly want to put a 4th data-array into the DataFrame I want the data to be organized in columns to do
df['data4'] = new_array
How can I quickly build the DataFrame I want?
As @MaxGhenis pointed out in the comments, from_items
is deprecated as of version 0.23. The link suggests to use from_dict
instead, so the old answer can be modified to:
pd.DataFrame.from_dict(dict(zip(names, data)))
--------------------------------------------------OLD ANSWER-------------------------------------------------------------
I would use .from_items
:
pd.DataFrame.from_items(zip(names, data))
which gives
data1 data2 data3
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
That should also be faster than transposing:
%timeit pd.DataFrame.from_items(zip(names, data))
1000 loops, best of 3: 281 µs per loop
%timeit pd.DataFrame(data, index=names).T
1000 loops, best of 3: 730 µs per loop
Adding a fourth column is then also fairly simple:
df['data4'] = range(1, 11)
which gives
data1 data2 data3 data4
0 0 0 0 1
1 1 1 1 2
2 2 2 2 3
3 3 3 3 4
4 4 4 4 5
5 5 5 5 6
6 6 6 6 7
7 7 7 7 8
8 8 8 8 9
9 9 9 9 10
As mentioned by @jezrael in the comments, a third option would be (beware: order not guaranteed)
pd.DataFrame(dict(zip(names, data)), columns=names)
Timing:
%timeit pd.DataFrame(dict(zip(names, data)))
1000 loops, best of 3: 281 µs per loop
from_items
is now deprecated. Use from_dict
instead:
df = pd.DataFrame.from_dict({
'data1': np.arange(10),
'data2': np.arange(10),
'data3': np.arange(10)
})
This returns:
data1 data2 data3
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With