Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to join two large datasets in Pandas

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')
like image 601
Nickpick Avatar asked Jun 10 '16 20:06

Nickpick


People also ask

How large is too large for pandas?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage. When it needs access again, it will read the data from the swap file and back into memory.


Video Answer


2 Answers

This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

df = dd.concat([df1, df2]).drop_duplicates()

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

like image 83
root Avatar answered Sep 28 '22 19:09

root


I would recommend you to use RDBMS like MySQL for that...

So you would need to load your CSV files into tables first.

After that you can perform your checks:

which products and versions are in the left table only

SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;

which products and versions are in the right table only

SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;

in both

SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;

Configure your MySQL Server, so that it uses at least 2GB of RAM

You may also want to use MyISAM engine for your tables, in this case check this

It might work slower compared to Pandas, but you definitely won't have memory issues.

Another possible solutions:

  • increase your RAM
  • use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM
like image 33
MaxU - stop WAR against UA Avatar answered Sep 28 '22 18:09

MaxU - stop WAR against UA