Best way to join two large datasets in Pandas

Tags:

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

601

asked Jun 10 '16 20:06

Nickpick

Video Answer

2 Answers

This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

df = dd.concat([df1, df2]).drop_duplicates()

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

answered Sep 28 '22 19:09

root

I would recommend you to use RDBMS like MySQL for that...

So you would need to load your CSV files into tables first.

After that you can perform your checks:

which products and versions are in the left table only

SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;

which products and versions are in the right table only

SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;

in both

SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;

Configure your MySQL Server, so that it uses at least 2GB of RAM

You may also want to use MyISAM engine for your tables, in this case check this

It might work slower compared to Pandas, but you definitely won't have memory issues.

Another possible solutions:

increase your RAM
use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM

answered Sep 28 '22 18:09

MaxU - stop WAR against UA

Related questions
                            
                                How to convert a list in to queryset django
                            
                                How to alter the time in python?
                            
                                Python PIL - Draw Circle [duplicate]
                            
                                Using bisect in a list of tuples?
                            
                                Defining multiple plots to be animated with a for loop in matplotlib
                            
                                Make an action when the QLineEdit text is changed (programmatically)
                            
                                How to choose which version of python runs from terminal?
                            
                                Colouring edges by weight in networkx
                            
                                matplotlib: log transform counts in hist2d
                            
                                python flask browsing through directory with files
                            
                                How to know if a user has pressed the Enter key using Python
                            
                                How to obtain CSS value (eg. color) of element using selenium webdriver in Python
                            
                                How can you clear a Matplotlib text box that was previously drawn?
                            
                                why python regex is so slow?
                            
                                What is the datetime format for flask-restful parser?
                            
                                How to use a custom SVM kernel?
                            
                                Getting a bounded polygon coordinates from Voronoi cells
                            
                                How to subtract datetimes / timestamps in python
                            
                                SSL3_GET_SERVER_CERTIFICATE certificate verify failed on Python when requesting (only) *.google.com
                            
                                How to run a background process and do *not* wait?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to join two large datasets in Pandas

Tags:

python

memory-management

pandas