I try to manipulate a large CSV file using Pandas, when I wrote this
df = pd.read_csv(strFileName,sep='\t',delimiter='\t')
it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks
The input is like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915
And the desired output is like this:
columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1
if the data set is small, the below code could be used as provided by another
df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum': 'sum',
'recordNum': 'count' })
df2['errorRate'] = df2['errorNum'] / df2['recordNum']
ka kb_1 recordNum errorNum errorRate
3M 2345 1 0 0.0
2958 2 1 0.5
GE 2183 2 1 0.5
2598 1 0 0.0
(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)
Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
You haven't stated what your intended aggregation would be, but if it's just sum and count, then you could aggregate in chunks:
dfs = pd.DataFrame()
reader = pd.read_table(strFileName, chunksize=16*1024) # choose as appropriate
for chunk in reader:
temp = chunk.agg(...) # your logic here
dfs.append(temp)
df = dfs.agg(...) # redo your logic here
Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.
I assume that kb_2
is the error indicator,
groups={}
with open("data/petaJoined.csv", "r") as large_file:
for line in large_file:
arr=line.split('\t')
#assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
k=arr[0]+','+arr[1]
if not (k in groups.keys())
groups[k]={'record_count':0, 'error_sum': 0}
groups[k]['record_count']=groups[k]['record_count']+1
groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))
This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.
It will encounter an out-of-memory exception, if there are too many combinations of groups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With