How to deal with modifying large pandas dataframes

Question

I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this:

def combined(row):
    row['combined'] = row['col1'].join(str(row['col2']))
return row

df = df.apply(combined, axis=1)

This results in my python process being killed, presumably because of memory issues.

A more iterative solution to the problem seems to be:

df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')

for row_pos in range(0, len(df) - 1):
    df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))

This of course seems very unpandas. And is very slow.

Ideally I would like something like apply_chunk() which is the same as apply but only works on a piece of the dataframe. I thought dask might be an option for this, but dask dataframes seemed to have other issues when I used them. This has to be a common problem though, is there a design pattern I should be using for adding columns to large pandas dataframes?

Ami Tavory · Accepted Answer

I would try using list comprehension + itertools:

df = pd.DataFrame({
    'a': ['ab'] * 200,
    'b': ['ffff'] * 200
})


import itertools

[a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

It might be "unpandas", but pandas doesn't seem to have a .str method that helps you here, and it isn't "unpythonic".

To create another column, just use:

df['c'] = [a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

Incidentally, you can also get your chunking using:

[a.join(b) for (a, b) in itertools.izip(df.a[10: 20], df.b[10: 20])]

If you'd like to play with parallelization. I would first try the above version, as list comprehension and itertools are often surprisingly fast, and parallelization would require an overhead that would need to be outweighed.

How to deal with modifying large pandas dataframes

Tags:

python

pandas

dask

Christopher

1 Answers

Ami Tavory

Recent Activity

Donate For Us

How to deal with modifying large pandas dataframes

Tags:

python

pandas

dask

Christopher

1 Answers

Ami Tavory

Related questions

Recent Activity

Donate For Us