How to stream in and manipulate a large data file in python

Tags:

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1        M      1    12
County1   2        M      1    3
County1   2        M      2    0

To:

Geography Count
County1   15
County2   23

This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.

Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.

Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

657

asked Jul 05 '16 16:07

HFBrowning

2 Answers

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

152

answered Oct 20 '22 01:10

root

I do like @root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:

cols = ['Geography','Count']
df = pd.DataFrame()

chunksize = 2   # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
                          usecols=cols,
                          chunksize=chunksize)
             ):
    # merge previously aggregated DF with a new portion of data and aggregate it again
    df = (pd.concat([df,
                     chunk.groupby('Geography')['Count'].sum().to_frame()])
            .groupby(level=0)['Count']
            .sum()
            .to_frame()
         )

df.reset_index().to_csv('c:/temp/result.csv', index=False)

test data:

Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111

output.csv:

Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111

PS using this approach will you can process huge files.

PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first

I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

answered Oct 20 '22 01:10

MaxU - stop WAR against UA

Related questions
                            
                                Valid syntax in both Python 2.x and 3.x for raising exception?
                            
                                psutil in Apache Spark
                            
                                PEP0484 Type Hinting: Annotating argument of given class, not instance
                            
                                How is an ICMP packet constructed in python
                            
                                Pandas dataframe with MultiIndex: check if string is contained in index level
                            
                                Tornado - What is the difference between RequestHandler's get_argument(), get_query_argument() and get_body_argument()?
                            
                                Run pip in python idle
                            
                                Access next sibling <li> element with BeautifulSoup
                            
                                Using datetime timedelta with a series in a pandas DF
                            
                                Bulk update in Pymongo using multiple ObjectId
                            
                                start node app from python script
                            
                                Apply multiple functions with map
                            
                                double curly brace {{
                            
                                Extract the text from `p` within `div` with BeautifulSoup
                            
                                Django - The current URL, , didn't match any of these
                            
                                Convert a column in pandas dataframe from String to Float
                            
                                faster geometric average on ASCII
                            
                                toctree nested drop down
                            
                                Python regex for finding all words in a string [duplicate]
                            
                                How to check empty gzip file in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to stream in and manipulate a large data file in python

Tags:

python

pandas

dataframe

itertools

HFBrowning

People also ask

2 Answers

root

MaxU - stop WAR against UA

Recent Activity

Donate For Us