I have the following code to analyze a huge dataframe file (22G, over 2million rows and 3K columns). I tested the code in a smaller dataframe and it ran OK (head -1000 hugefile.txt
). However, when I ran the code on the huge dataframe, it gave me "segmentation fault" core dump. It output a core.number binary file.
I did some internet search and came up with using low_memory =False
, and trying to read the DataFrame by defining chunksize=1000, iterator= True
, and then pandas.concat the chunks, but this still gave me memory problem (core dump). It wouldn't even read in the entire file before the core dump as I tested just read the file and print some text. Please help and let me know if there are solutions that I can analyze this huge file.
Version
python version: 3.6.2
numpy version: 1.13.1
pandas version: 0.20.3
OS: Linux/Unix
Script
#!/usr/bin/python
import pandas as pd
import numpy as np
path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)
#######
i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
j=5 #starts at 6
missing = 0
NoNmiss = 0
while (j < (data.shape[1]-2)):
if pd.isnull(data.iloc[i,j]) == True:
missing = missing +1
j= j+3
elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
NoNmiss = NoNmiss +1
j=j+3
else:
missing = missing +1
j= j+3
if (NoNmiss/(missing+NoNmiss)) >= 0.5:
marker_keep = marker_keep + 1
else:
marker_remove = marker_remove +1
i=i+1
a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c)
result.close()
Sample dataset:
Index Group Number1 Number2 DummyCol sample1.NA sample1.NA.score sample1.NA.coverage sample2.NA sample2.NA.score sample2.NA.coverage sample3.NA sample3.NA.score sample3.NA.coverage
1 group1 13247 13249 Marker CC 3 1 NA 0 0 NA 0 0
2 group1 13272 13274 Marker GG 7 6 GG 3 1 GG 3 1
4 group1 13301 13303 Marker CC 11 12 CC 5 4 CC 5 3
5 group1 13379 13381 Marker CC 6 5 CC 5 4 CC 5 3
7 group1 13417 13419 Marker GG 7 6 GG 4 2 GG 5 4
8 group1 13457 13459 Marker CC 13 15 CC 9 9 CC 11 13
9 group1 13493 13495 Marker AA 17 21 AA 11 12 AA 11 13
10 group1 13503 13505 Marker GG 14 17 GG 9 10 GG 13 15
11 group1 13549 13551 Marker GG 6 5 GG 4 2 GG 6 5
12 group1 13648 13650 Marker NA 0 0 NA 0 0 NA 0 0
13 group1 13759 13761 Marker NA 0 0 NA 0 0 NA 0 0
14 group1 13867 13869 Marker NA 0 0 NA 0 0 NA 0 0
15 group1 13895 13897 Marker CC 3 1 NA 0 0 NA 0 0
20 group1 14430 14432 Marker GG 15 18 NA 0 0 GG 5 3
21 group1 14435 14437 Marker GG 16 20 GG 3 1 GG 4 2
22 group1 14463 14465 Marker AT 0 24 AA 3 1 TT 4 6
23 group1 14468 14470 Marker CC 18 23 CC 3 1 CC 6 5
25 group1 14652 14654 Marker CC 3 8 NA 0 0 CC 3 1
26 group1 14670 14672 Marker GG 10 11 NA 0 0 NA 0 0
Error message:
Traceback (most recent call last):
File "test_script.py", line 8, in <module>
data = pd.concat(data1, ignore_index=True)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
copy=copy)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
objs = list(objs)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
return self.get_chunk()
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
return self.read(nrows=size)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault (core dumped) python3.6 test_script.py
You aren't processing your data in chunks at all.
With data1 = pd.read_csv('...', chunksize=10000, iterator=True)
,
data1
becomes a pandas.io.parser.TextFileReader
, which is an iterator that yields chunks of 10000 rows of your CSV data as a DataFrame.
But then pd.concat
consumes this entire iterator, and so attempts to load the whole CSV into memory, defeating the purpose of using chunksize
and iterator
entirely.
Properly using chunksize
and iterator
In order to process your data in chunks, you have to iterate over the actual DataFrame chunks yielded by the iterator read.csv
provides.
data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)
for chunk in data1:
# do my processing of DataFrame chunk of 1000 rows here
Suppose we have a CSV bigdata.txt
A1, A2
B1, B2
C1, C2
D1, D2
E1, E2
that we want to process 1 row at a time (for whatever reason).
Incorrect usage of chunksize
and iterator
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
df = pd.concat(df_iter)
df
## 0 1
## 0 A1 A2
## 1 B1 B2
## 2 C1 C2
## 3 D1 D2
## 4 E1 E2
We can see that we've loaded the entire CSV into memory, despite having a chunksize
of 1.
Correct usage
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
for iter_num, chunk in enumerate(df_iter, 1):
print('Processing iteration {0}'.format(iter_num))
print(chunk)
## Processing iteration 1
## 0 1
## 0 A1 A2
## Processing iteration 2
## 0 1
## 1 B1 B2
## Processing iteration 3
## 0 1
## 2 C1 C2
## Processing iteration 4
## 0 1
## 3 D1 D2
## Processing iteration 5
## 0 1
## 4 E1 E2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With