Sub-title: Dumb it down pandas, stop trying to be clever.
I've a list (res
) of single-column pandas data frames, each containing the same kind of numeric data, but each with a different column name. The row indices have no meaning. I want to put them into a single, very long, single-column data frame.
When I do pd.concat(res)
I get one column per input file (and loads and loads of NaN cells). I've tried various values for the parameters (*), but none that do what I'm after.
Edit: Sample data:
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[100,200,300,400]}),
]
I have an ugly-hack solution: copy every data frame and giving it a new column name:
newList = []
for r in res:
r.columns = ["same"]
newList.append(r)
pd.concat( newList, ignore_index=True )
Surely that is not the best way to do it??
BTW, pandas: concat data frame with different column name is similar, but my question is even simpler, as I don't want the index maintained. (I also start with a list of N single-column data frames, not a single N-column data frame.)
*: E.g. axis=0
is default behaviour. axis=1
gives an error. join="inner"
is just silly (I only get the index). ignore_index=True
renumbers the index, but I stil gets lots of columns, lots of NaNs.
UPDATE for empty lists
I was having problems (with all the given solutions) when the data had an empty list, something like:
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[]}),
pd.DataFrame({'D':[100,200,300,400]}),
]
The trick was to force the type, by adding .astype('float64')
. E.g.
pd.Series(np.concatenate([df.values.ravel().astype('float64') for df in res]))
or:
pd.concat(res,axis=0).astype('float64').stack().reset_index(drop=True)
I think you need concat
with stack
:
print (pd.concat(res, axis=1))
A B C
0 1.0 9 100.0
1 2.0 8 200.0
2 3.0 7 300.0
3 NaN 6 400.0
4 NaN 5 NaN
5 NaN 4 NaN
print (pd.concat(res, axis=1).stack().reset_index(drop=True))
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
9 6.0
10 400.0
11 5.0
12 4.0
dtype: float64
Another solution with numpy.ravel
for flattening:
print (pd.Series(pd.concat(res, axis=1).values.ravel()).dropna())
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
10 6.0
11 400.0
13 5.0
16 4.0
dtype: float64
print (pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna())
col
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
10 6.0
11 400.0
13 5.0
16 4.0
Solution with list comprehension
:
print (pd.Series(np.concatenate([df.values.ravel() for df in res])))
0 1
1 2
2 3
3 9
4 8
5 7
6 6
7 5
8 4
9 100
10 200
11 300
12 400
dtype: int64
I would use list comphrension such has:
import pandas as pd
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[100,200,300,400]}),
]
x = []
[x.extend(df.values.tolist()) for df in res]
pd.DataFrame(x)
Out[49]:
0
0 1
1 2
2 3
3 9
4 8
5 7
6 6
7 5
8 4
9 100
10 200
11 300
12 400
I tested speed for you.
%timeit x = []; [x.extend(df.values.tolist()) for df in res]; pd.DataFrame(x)
10000 loops, best of 3: 196 µs per loop
%timeit pd.Series(pd.concat(res, axis=1).values.ravel()).dropna()
1000 loops, best of 3: 920 µs per loop
%timeit pd.concat(res, axis=1).stack().reset_index(drop=True)
1000 loops, best of 3: 902 µs per loop
%timeit pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna()
1000 loops, best of 3: 1.07 ms per loop
%timeit pd.Series(np.concatenate([df.values.ravel() for df in res]))
10000 loops, best of 3: 70.2 µs per loop
looks like
pd.Series(np.concatenate([df.values.ravel() for df in res]))
is the fastest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With