Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with modifying large pandas dataframes

I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this:

def combined(row):
    row['combined'] = row['col1'].join(str(row['col2']))
return row

df = df.apply(combined, axis=1)

This results in my python process being killed, presumably because of memory issues.

A more iterative solution to the problem seems to be:

df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')

for row_pos in range(0, len(df) - 1):
    df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))

This of course seems very unpandas. And is very slow.

Ideally I would like something like apply_chunk() which is the same as apply but only works on a piece of the dataframe. I thought dask might be an option for this, but dask dataframes seemed to have other issues when I used them. This has to be a common problem though, is there a design pattern I should be using for adding columns to large pandas dataframes?

like image 338
Christopher Avatar asked Jul 22 '15 20:07

Christopher


1 Answers

I would try using list comprehension + itertools:

df = pd.DataFrame({
    'a': ['ab'] * 200,
    'b': ['ffff'] * 200
})


import itertools

[a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

It might be "unpandas", but pandas doesn't seem to have a .str method that helps you here, and it isn't "unpythonic".

To create another column, just use:

df['c'] = [a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

Incidentally, you can also get your chunking using:

[a.join(b) for (a, b) in itertools.izip(df.a[10: 20], df.b[10: 20])]

If you'd like to play with parallelization. I would first try the above version, as list comprehension and itertools are often surprisingly fast, and parallelization would require an overhead that would need to be outweighed.

like image 131
Ami Tavory Avatar answered Oct 11 '22 00:10

Ami Tavory