Pandas Merge Error: MemoryError

Problem:

I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.

Here's the setup:

The attempted merge:

df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

Basic data structure:

    Year    Reporter_Code   Trade_Flow_Code Partner_Code    Classification  Commodity Code  Quantity Unit Code  Supplementary Quantity  Netweight (kg)  Value   Estimation Code
0    2003    381     2   36  H2  070951  8   1274    1274    13810   0
1    2003    381     2   36  H2  070930  8   17150   17150   30626   0
2    2003    381     2   36  H2  0709    8   20493   20493   635840  0
3    2003    381     1   36  H2  0507    8   5200    5200    27619   0
4    2003    381     1   36  H2  050400  8   56439   56439   683104  0

df:

    mporter  cod     CC ComTrade_CC Distance_miles
0    110     215     215     757     428.989
1    110     215     215     757     428.989
2    110     215     215     757     428.989
3    110     215     215     757     428.989
4    110     215     215     757     428.989

Error Traceback:

 MemoryError                      Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
      1 for i in c_list:
----> 2     df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    193                                       copy=self.copy)
    194 
--> 195         result_data = join_op.get_result()
    196         result = DataFrame(result_data)
    197 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    693                 if klass in mapping:
    694                     klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695             res_blk = self._get_merged_block(klass_blocks)
    696 
    697             # if we have a unique result index, need to clear the _ref_locs

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
    706     def _get_merged_block(self, to_merge):
    707         if len(to_merge) > 1:
--> 708             return self._merge_blocks(to_merge)
    709         else:
    710             unit, block = to_merge[0]

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
    728         # Should use Fortran order??
    729         block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730         out = np.empty(out_shape, dtype=block_dtype)
    731 
    732         sofar = 0

MemoryError:

Thanks for your thoughts!

895

asked Sep 30 '13 01:09

agconti

1 Answers

In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.

UPDATE

Example:

df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()

df1

    A   B   C
0   1   2   3
1   2   3   4

df2

    A   B   D
0   1   2   7
1   2   3   8

both

    A   B   C   D
0   1   2   3   7
1   2   3   4   8

I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.

193

answered Oct 23 '22 07:10

Gordon Bean

Related questions
                            
                                Recommended way to generate XHTML documents with lxml
                            
                                Paramiko SFTP with key and username/password - " Oops, unhandled type 3" [duplicate]
                            
                                How to speed up matplotlib when plotting and saving lots of figures?
                            
                                tkinter scale with two sliders?
                            
                                Unable to load a previously dumped pickle file of large size in Python
                            
                                Why can't pip find packages listed in `pip search` results?
                            
                                Properly format multipart/form-data body
                            
                                Recover from dropped connection in redis pub/sub
                            
                                Make the sum of integers in a matrix maximal by multiplying rows and columns by -1
                            
                                Python throws TypeError on issubclass() when issubclass() is never called
                            
                                Python console application - output above input line
                            
                                OpenCV match template
                            
                                Set up mutually exclusive sets in argparse
                            
                                How to have sub-parser arguments in separate namespace with argparse?
                            
                                Wrapping C-enum in a Python module with Swig
                            
                                uwsgi service is not starting
                            
                                In celery how to get the task status for all the tasks for specific task name?
                            
                                Can't find the source of exception
                            
                                What are some general tips to reduce the file size for a Pyinstaller generated executable
                            
                                How to stop argparse.FileType creating the file specified as default

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas Merge Error: MemoryError

Tags:

python

merge

pandas

Problem:

agconti

People also ask

1 Answers

Gordon Bean

Recent Activity

Donate For Us