I'm looking at creating a Dataframe that is the combination of two unrelated series.
If we take two dataframes:
A = ['a','b','c']
B = [1,2,3,4]
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
I'm looking for this output:
A B
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.
result = []
for i in A:
for j in B:
result.append([i,j])
result_DF = pd.DataFrame(result,columns=['A','B'])
Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:
from datetime import datetime
start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)
A = [UUID('d48259a6-80b5-43ca-906c-8405ab40f9a8'),
UUID('873a65d7-582c-470e-88b6-0d02df078c04'),
UUID('624c32a6-9998-49f4-92b6-70e712355073'),
UUID('7207ab0c-3c7f-477e-b5bc-fbb8059c1dec')]
dfA = pd.DataFrame(A)
result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
newdf = pd.DataFrame(index=benefit.index)
newdf['A'] = dfA.iloc[i,0]
newdf['month'] = newdf.index
result = pd.concat([result,newdf])
result
Series is a type of list in pandas which can take integer values, string values, double values and more. But in Pandas Series we return an object in the form of list, having index starting from 0 to n, Where n is the length of values in series.
You can use np.meshgrid
:
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
to get a roughly ~2000x
speedup on DataFrame
objects of length 300
and 400
, respectively:
A = ['a', 'b', 'c'] * 100
B = [1, 2, 3, 4] * 100
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
np.meshgrid
:
%%timeit
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
100 loops, best of 3: 8.45 ms per loop
vs cross
:
%timeit cross(dfA, dfB)
1 loop, best of 3: 16.3 s per loop
So if I understand your example correctly, you could:
A = ['a', 'b', 'c']
dfA = pd.DataFrame(A)
start = datetime(year=2016, month=1, day=1)
end = datetime(year=2016, month=4, day=1)
months = pd.DatetimeIndex(start=start, end=end, freq="MS")
dfB = pd.DataFrame(months.month)
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
to also get:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
Using itertools.product
:
from itertools import product
result = pd.DataFrame(list(product(dfA.iloc[:,0], dfB.iloc[:,0])))
Not quite as efficient as np.meshgrid
, but it's more efficient than the other solutions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With