Pandas read csv out of memory

Tags:

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep='\t',delimiter='\t')

it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks

The input is like this:

Click to copy

columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915

And the desired output is like this:

Click to copy

columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1

if the data set is small, the below code could be used as provided by another

Click to copy

df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum':  'sum',
                                             'recordNum': 'count' })

df2['errorRate'] = df2['errorNum'] / df2['recordNum']

ka kb_1  recordNum  errorNum  errorRate

3M 2345          1         0        0.0
   2958          2         1        0.5
GE 2183          2         1        0.5
   2598          1         0        0.0

(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)

592

asked May 14 '15 19:05

sunxd

2 Answers

You haven't stated what your intended aggregation would be, but if it's just sum and count, then you could aggregate in chunks:

Click to copy

dfs = pd.DataFrame()
reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriate
for chunk in reader:
    temp = chunk.agg(...)  # your logic here
    dfs.append(temp)
df = dfs.agg(...)  # redo your logic here

answered Sep 20 '22 17:09

chrisaycock

Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

I assume that kb_2 is the error indicator,

Click to copy

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.

answered Sep 18 '22 17:09

Uri Goren

Related questions
                            
                                Django Rest Framework upload file to a method
                            
                                Pandas Filter function returned a Series, but expected a scalar bool
                            
                                Applets, embedding and the bokeh-server
                            
                                Open file in __init__() python
                            
                                ImportError ropevim using ropevim plugin in vim
                            
                                Cython Metaclass .pxd: How should I implement `__eq__()`?
                            
                                How can I prevent the inheritance of python loggers and handlers during multiprocessing based on fork?
                            
                                imported modules becomes None when replacing current module in sys.modules using a class object
                            
                                python matplotlib save graph without showing
                            
                                Django data migration fails when running manage.py test, but not when running manage.py migrate
                            
                                Upgrading transaction.commit_manually() to Django > 1.6
                            
                                Download files over SSH using Python
                            
                                Long Polling in Python with Flask
                            
                                How to find out if the elements in one list are in another? [duplicate]
                            
                                Why am I getting a ConnectionResetError here?
                            
                                Is there any smart way to combine overlapping paths in python?
                            
                                python password rules validation
                            
                                How to get a padded slice of a multidimensional array?
                            
                                Image Gradient Vector Field in Python
                            
                                Python regex pattern max length in re.compile?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas read csv out of memory

Tags:

python

memory

csv

sunxd

People also ask

2 Answers

chrisaycock

Uri Goren

Recent Activity

Donate For Us