Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pandas to analyzing a over 20G data frame, out of memory, when specifying chunksize still wouldn't work

I have the following code to analyze a huge dataframe file (22G, over 2million rows and 3K columns). I tested the code in a smaller dataframe and it ran OK (head -1000 hugefile.txt). However, when I ran the code on the huge dataframe, it gave me "segmentation fault" core dump. It output a core.number binary file.

I did some internet search and came up with using low_memory =False, and trying to read the DataFrame by defining chunksize=1000, iterator= True, and then pandas.concat the chunks, but this still gave me memory problem (core dump). It wouldn't even read in the entire file before the core dump as I tested just read the file and print some text. Please help and let me know if there are solutions that I can analyze this huge file.

Version

python version: 3.6.2
numpy version: 1.13.1
pandas version: 0.20.3
OS: Linux/Unix

Script

#!/usr/bin/python
import pandas as pd
import numpy as np

path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)

#######

i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
    j=5 #starts at 6
    missing = 0
    NoNmiss = 0
    while (j < (data.shape[1]-2)):
        if pd.isnull(data.iloc[i,j]) == True:
            missing = missing +1
            j= j+3
        elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
            NoNmiss = NoNmiss +1
            j=j+3  
        else:
            missing = missing +1
            j= j+3       
    if (NoNmiss/(missing+NoNmiss)) >= 0.5:
        marker_keep = marker_keep + 1
    else: 
        marker_remove = marker_remove +1
    i=i+1


a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c) 
result.close()

Sample dataset:

Index   Group   Number1 Number2 DummyCol    sample1.NA  sample1.NA.score    sample1.NA.coverage sample2.NA  sample2.NA.score    sample2.NA.coverage sample3.NA  sample3.NA.score    sample3.NA.coverage
1   group1  13247   13249   Marker  CC  3   1   NA  0   0   NA  0   0
2   group1  13272   13274   Marker  GG  7   6   GG  3   1   GG  3   1
4   group1  13301   13303   Marker  CC  11  12  CC  5   4   CC  5   3
5   group1  13379   13381   Marker  CC  6   5   CC  5   4   CC  5   3
7   group1  13417   13419   Marker  GG  7   6   GG  4   2   GG  5   4
8   group1  13457   13459   Marker  CC  13  15  CC  9   9   CC  11  13
9   group1  13493   13495   Marker  AA  17  21  AA  11  12  AA  11  13
10  group1  13503   13505   Marker  GG  14  17  GG  9   10  GG  13  15
11  group1  13549   13551   Marker  GG  6   5   GG  4   2   GG  6   5
12  group1  13648   13650   Marker  NA  0   0   NA  0   0   NA  0   0
13  group1  13759   13761   Marker  NA  0   0   NA  0   0   NA  0   0
14  group1  13867   13869   Marker  NA  0   0   NA  0   0   NA  0   0
15  group1  13895   13897   Marker  CC  3   1   NA  0   0   NA  0   0
20  group1  14430   14432   Marker  GG  15  18  NA  0   0   GG  5   3
21  group1  14435   14437   Marker  GG  16  20  GG  3   1   GG  4   2
22  group1  14463   14465   Marker  AT  0   24  AA  3   1   TT  4   6
23  group1  14468   14470   Marker  CC  18  23  CC  3   1   CC  6   5
25  group1  14652   14654   Marker  CC  3   8   NA  0   0   CC  3   1
26  group1  14670   14672   Marker  GG  10  11  NA  0   0   NA  0   0

Error message:

Traceback (most recent call last):
  File "test_script.py", line 8, in <module>
    data = pd.concat(data1, ignore_index=True)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
    copy=copy)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
    objs = list(objs)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
    return self.get_chunk()
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
    return self.read(nrows=size)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault      (core dumped) python3.6 test_script.py
like image 967
user1687130 Avatar asked Aug 04 '17 15:08

user1687130


1 Answers

You aren't processing your data in chunks at all.

With data1 = pd.read_csv('...', chunksize=10000, iterator=True), data1 becomes a pandas.io.parser.TextFileReader, which is an iterator that yields chunks of 10000 rows of your CSV data as a DataFrame.

But then pd.concat consumes this entire iterator, and so attempts to load the whole CSV into memory, defeating the purpose of using chunksize and iterator entirely.

Properly using chunksize and iterator

In order to process your data in chunks, you have to iterate over the actual DataFrame chunks yielded by the iterator read.csv provides.

data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)

for chunk in data1:
    # do my processing of DataFrame chunk of 1000 rows here

Minimal Example

Suppose we have a CSV bigdata.txt

A1, A2
B1, B2
C1, C2
D1, D2
E1, E2

that we want to process 1 row at a time (for whatever reason).

Incorrect usage of chunksize and iterator

df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

df = pd.concat(df_iter)
df
##     0    1
## 0  A1   A2
## 1  B1   B2
## 2  C1   C2
## 3  D1   D2
## 4  E1   E2

We can see that we've loaded the entire CSV into memory, despite having a chunksize of 1.

Correct usage

df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

for iter_num, chunk in enumerate(df_iter, 1):
    print('Processing iteration {0}'.format(iter_num))
    print(chunk)

##  Processing iteration 1
##      0    1
##  0  A1   A2
##  Processing iteration 2
##      0    1
##  1  B1   B2
##  Processing iteration 3
##      0    1
##  2  C1   C2
##  Processing iteration 4
##      0    1
##  3  D1   D2
##  Processing iteration 5
##      0    1
##  4  E1   E2
like image 77
miradulo Avatar answered Nov 15 '22 06:11

miradulo