I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this:
def combined(row):
row['combined'] = row['col1'].join(str(row['col2']))
return row
df = df.apply(combined, axis=1)
This results in my python process being killed, presumably because of memory issues.
A more iterative solution to the problem seems to be:
df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')
for row_pos in range(0, len(df) - 1):
df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))
This of course seems very unpandas. And is very slow.
Ideally I would like something like apply_chunk()
which is the same as apply but only works on a piece of the dataframe. I thought dask
might be an option for this, but dask
dataframes seemed to have other issues when I used them. This has to be a common problem though, is there a design pattern I should be using for adding columns to large pandas dataframes?
I would try using list comprehension + itertools
:
df = pd.DataFrame({
'a': ['ab'] * 200,
'b': ['ffff'] * 200
})
import itertools
[a.join(b) for (a, b) in itertools.izip(df.a, df.b)]
It might be "unpandas", but pandas doesn't seem to have a .str
method that helps you here, and it isn't "unpythonic".
To create another column, just use:
df['c'] = [a.join(b) for (a, b) in itertools.izip(df.a, df.b)]
Incidentally, you can also get your chunking using:
[a.join(b) for (a, b) in itertools.izip(df.a[10: 20], df.b[10: 20])]
If you'd like to play with parallelization. I would first try the above version, as list comprehension and itertools are often surprisingly fast, and parallelization would require an overhead that would need to be outweighed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With