Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient way to create a DataFrame from two unrelated series?

I'm looking at creating a Dataframe that is the combination of two unrelated series.

If we take two dataframes:

A = ['a','b','c']
B = [1,2,3,4]

dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)

I'm looking for this output:

    A  B
0   a  1
1   a  2
2   a  3
3   a  4
4   b  1
5   b  2
6   b  3
7   b  4
8   c  1
9   c  2
10  c  3
11  c  4

One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.

result = []
for i in A:
    for j in B:
        result.append([i,j])

result_DF = pd.DataFrame(result,columns=['A','B'])

Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:

from datetime import datetime

start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)

A = [UUID('d48259a6-80b5-43ca-906c-8405ab40f9a8'),
   UUID('873a65d7-582c-470e-88b6-0d02df078c04'),
   UUID('624c32a6-9998-49f4-92b6-70e712355073'),
   UUID('7207ab0c-3c7f-477e-b5bc-fbb8059c1dec')]
dfA = pd.DataFrame(A)

result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
    newdf = pd.DataFrame(index=benefit.index)
    newdf['A'] = dfA.iloc[i,0]
    newdf['month'] = newdf.index
    result = pd.concat([result,newdf])
result
like image 464
ludofet Avatar asked Jun 10 '16 17:06

ludofet


People also ask

Can Pandas DataFrame be created using series?

Series is a type of list in pandas which can take integer values, string values, double values and more. But in Pandas Series we return an object in the form of list, having index starting from 0 to n, Where n is the length of values in series.


2 Answers

You can use np.meshgrid:

pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))

    0  1
0   a  1
1   a  2
2   a  3
3   a  4
4   b  1
5   b  2
6   b  3
7   b  4
8   c  1
9   c  2
10  c  3
11  c  4

to get a roughly ~2000x speedup on DataFrame objects of length 300 and 400, respectively:

A = ['a', 'b', 'c'] * 100
B = [1, 2, 3, 4] * 100

dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)

np.meshgrid:

%%timeit
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
100 loops, best of 3: 8.45 ms per loop

vs cross:

%timeit cross(dfA, dfB)
1 loop, best of 3: 16.3 s per loop

So if I understand your example correctly, you could:

A = ['a', 'b', 'c']
dfA = pd.DataFrame(A)

start = datetime(year=2016, month=1, day=1)
end = datetime(year=2016, month=4, day=1)
months = pd.DatetimeIndex(start=start, end=end, freq="MS")
dfB = pd.DataFrame(months.month)

pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))

to also get:

    0  1
0   a  1
1   a  2
2   a  3
3   a  4
4   b  1
5   b  2
6   b  3
7   b  4
8   c  1
9   c  2
10  c  3
11  c  4
like image 142
Stefan Avatar answered Oct 20 '22 01:10

Stefan


Using itertools.product:

from itertools import product

result = pd.DataFrame(list(product(dfA.iloc[:,0], dfB.iloc[:,0])))

Not quite as efficient as np.meshgrid, but it's more efficient than the other solutions.

like image 26
root Avatar answered Oct 20 '22 00:10

root