Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame.merge MemoryError

Goal

My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby on patients and cells and sum all scores from each. The ultimate data frame should look like this:

    patient  cell 
    Pat_1    22RV1    12
             DU145    15
             LN18      9
    Pat_2    22RV1    12
             DU145    15
             LN18      9
    Pat_3    22RV1    12
             DU145    15
             LN18      9

That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.

Data

cell_s =

    Description          Name                      level_2  0
0  LOC100009676  100009676_at  LN18_CENTRAL_NERVOUS_SYSTEM  1
1  LOC100009676  100009676_at               22RV1_PROSTATE  2
2  LOC100009676  100009676_at               DU145_PROSTATE  3
3          AKT3      10000_at  LN18_CENTRAL_NERVOUS_SYSTEM  4
4          AKT3      10000_at               22RV1_PROSTATE  5
5          AKT3      10000_at               DU145_PROSTATE  6
6          MED6      10001_at  LN18_CENTRAL_NERVOUS_SYSTEM  7
7          MED6      10001_at               22RV1_PROSTATE  8
8          MED6      10001_at               DU145_PROSTATE  9

cell_s is about 10,000,000 rows

patient_s =

             id level_1  0
0          MED6   Pat_1  1
1          MED6   Pat_2  1
2          MED6   Pat_3  1
3  LOC100009676   Pat_1  2
4  LOC100009676   Pat_2  2
5  LOC100009676   Pat_3  2
6          ABCD   Pat_1  3
7          ABCD   Pat_2  3
8          ABCD   Pat_3  3
    ....

patient_s is about 1,200,000 rows

Code

def get_score(cell, patient):
    cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
    cell_s.columns = ['Description', 'Name', 'cell', 's1']

    patient_s = patient.set_index('id').stack().reset_index()
    patient_s.columns = ['id', 'patient', 's2']

    # fails here:
    merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
    merged['score'] = merged.s1 * merged.s2

    scores = merged.groupby(['patient','cell'])['score'].sum()
    return scores

I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?

Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

like image 664
Thomas Matthew Avatar asked Aug 01 '15 18:08

Thomas Matthew


People also ask

How do I merge two DataFrames in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How can pandas avoid memory errors?

One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or columns in the dataset. In my case, however, I was only loading 20% of the available data, so this wasn't an option as I would exclude too many important elements in my dataset.

How merge columns in same DataFrame pandas?

Merge Default Pandas DataFrame Without Any Key Column You can pass two DataFrame to be merged to the pandas. merge() method. This collects all common columns in both DataFrames and replaces each common column in both DataFrame with a single one. It merges the DataFrames df and df1 assigns to merged_df .


2 Answers

Consider two workarounds:

CSV By CHUNKS

Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.

cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)

patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)

CSV VIA SQL

As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:

import sqlite3

dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')

strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION

merged = pd.read_sql(strSQL, cxn)
like image 84
Parfait Avatar answered Oct 18 '22 02:10

Parfait


You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

like image 24
Skorpeo Avatar answered Oct 18 '22 03:10

Skorpeo