What is the most efficient way to create a DataFrame from two unrelated series?

Tags:

I'm looking at creating a Dataframe that is the combination of two unrelated series.

If we take two dataframes:

A = ['a','b','c']
B = [1,2,3,4]

dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)

I'm looking for this output:

One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.

result = []
for i in A:
    for j in B:
        result.append([i,j])

result_DF = pd.DataFrame(result,columns=['A','B'])

Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:

from datetime import datetime

start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)

A = [UUID('d48259a6-80b5-43ca-906c-8405ab40f9a8'),
   UUID('873a65d7-582c-470e-88b6-0d02df078c04'),
   UUID('624c32a6-9998-49f4-92b6-70e712355073'),
   UUID('7207ab0c-3c7f-477e-b5bc-fbb8059c1dec')]
dfA = pd.DataFrame(A)

result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
    newdf = pd.DataFrame(index=benefit.index)
    newdf['A'] = dfA.iloc[i,0]
    newdf['month'] = newdf.index
    result = pd.concat([result,newdf])
result

464

asked Jun 10 '16 17:06

ludofet

2 Answers

You can use np.meshgrid:

pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))

    0  1
0   a  1
1   a  2
2   a  3
3   a  4
4   b  1
5   b  2
6   b  3
7   b  4
8   c  1
9   c  2
10  c  3
11  c  4

to get a roughly ~2000x speedup on DataFrame objects of length 300 and 400, respectively:

A = ['a', 'b', 'c'] * 100
B = [1, 2, 3, 4] * 100

dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)

np.meshgrid:

%%timeit
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
100 loops, best of 3: 8.45 ms per loop

vs cross:

%timeit cross(dfA, dfB)
1 loop, best of 3: 16.3 s per loop

So if I understand your example correctly, you could:

A = ['a', 'b', 'c']
dfA = pd.DataFrame(A)

start = datetime(year=2016, month=1, day=1)
end = datetime(year=2016, month=4, day=1)
months = pd.DatetimeIndex(start=start, end=end, freq="MS")
dfB = pd.DataFrame(months.month)

pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))

to also get:

142

answered Oct 20 '22 01:10

Stefan

Using itertools.product:

from itertools import product

result = pd.DataFrame(list(product(dfA.iloc[:,0], dfB.iloc[:,0])))

Not quite as efficient as np.meshgrid, but it's more efficient than the other solutions.

answered Oct 20 '22 00:10

root

Related questions
                            
                                Docker container keeps growing
                            
                                # -*- coding: utf-8 -*- on python3 [duplicate]
                            
                                Why do my nested python class instances become tuples?
                            
                                How to speed up pandas groupby - apply function to be comparable to R's data.table
                            
                                efficiently read one file from a zip containing a lot of files in python
                            
                                Pybind11 Type Error
                            
                                BeagleBone Black OpenCV Python is too slow
                            
                                "SignatureError: Failed to verify signature" - Okta, pySAML2
                            
                                How to see full HTTPS URL in wireShark
                            
                                Bash pass string argument to python script
                            
                                Python xarray.concat then xarray.to_netcdf generates huge new file size
                            
                                Converting hard integral to lambda function with lambdify
                            
                                How do you run a python script from a C++ program?
                            
                                How can I shade an area under a curve between two lines in matplotlib / pandas?
                            
                                Add background image to 3d plot
                            
                                Where does a Python list hold its values?
                            
                                Deploy Django project on RedHat
                            
                                Access json content of http post request with Klein in python
                            
                                Sum the squared difference between 2 Numpy arrays [duplicate]
                            
                                How do I read a CSV file that's Gzipped from URL - Python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the most efficient way to create a DataFrame from two unrelated series?

Tags:

python

pandas

numpy

ludofet

People also ask

2 Answers

Stefan

root

Recent Activity

Donate For Us