Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1        M      1    12
County1   2        M      1    3
County1   2        M      2    0

To:

Geography Count
County1   15
County2   23

This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.

Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.

Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

like image 657
HFBrowning Avatar asked Jul 05 '16 16:07

HFBrowning


People also ask

How do I process a large dataset in Python?

The dataset we are going to use is gender_voice_dataset. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

How to read large text files in Python?

- GeeksforGeeks How to read large text files in Python? Python is an open-source dynamically typed and interpreted programming language. Reading and writing files are an integral part of programming. In Python, files are read by using the readlines () method.

How to read a file in Python without memory leakage?

In this technique, we use the fileinput module in Python. The input () method of fileinput module can be used to read files. The advantage of using this method over readlines () is fileinput.input () does not load the entire file into memory. Hence, there is no chance of memory leakage.

How to read and write a file in Python?

Luckily, Python provides a more practical way of reading and writing files. We do not need to explicitly call the close method on the file anymore. Let’s add a new file to our file. We should select the append option. Here is the new content. The write mode overwrites a file. Thus, it deletes what is already in the file and writes the new content.


2 Answers

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
like image 152
root Avatar answered Oct 20 '22 01:10

root


I do like @root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:

cols = ['Geography','Count']
df = pd.DataFrame()

chunksize = 2   # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
                          usecols=cols,
                          chunksize=chunksize)
             ):
    # merge previously aggregated DF with a new portion of data and aggregate it again
    df = (pd.concat([df,
                     chunk.groupby('Geography')['Count'].sum().to_frame()])
            .groupby(level=0)['Count']
            .sum()
            .to_frame()
         )

df.reset_index().to_csv('c:/temp/result.csv', index=False)

test data:

Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111

output.csv:

Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111

PS using this approach will you can process huge files.

PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first

I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

like image 23
MaxU - stop WAR against UA Avatar answered Oct 20 '22 01:10

MaxU - stop WAR against UA