Load pandas dataframe with chunksize determined by column variable

Tags:

If I have a csv file that's too large to load into memory with pandas (in this case 35gb), I know it's possible to process the file in chunks, with chunksize.

However I want to know if it's possible to change chunksize based on values in a column.

I have an ID column, and then several rows for each ID with information, like this:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...

I don't want to separate IDs into different chunks. for example chunks of size 4 would be processed:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk

ID,   Time,  x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...

Is it possible?

If not perhaps using the csv library with a for loop along the lines of:

for line in file:
    x += 1
    if x > 1000000 and curid != line[0]:
        break
    curid = line[0]
    #code to append line to a dataframe

although I know this would only create one chunk, and for loops take a long time to process.

623

asked Feb 14 '17 14:02

Josh Kidd

1 Answers

If you iterate through the csv file line by line, you can yield chunks with a generator dependent on any column.

Working example:

import pandas as pd

def iter_chunk_by_id(file):
    csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
    first_chunk = csv_reader.get_chunk()
    id = first_chunk.iloc[0,0]
    chunk = pd.DataFrame(first_chunk)
    for l in csv_reader:
        if id == l.iloc[0,0]:
            id = l.iloc[0,0]
            chunk = chunk.append(l)
            continue
        id = l.iloc[0,0]
        yield chunk
        chunk = pd.DataFrame(l)
    yield chunk

## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc   

chunk_iter = iter_chunk_by_id("data.csv")

for chunk in chunk_iter:
    print(chunk)
    print("_____")

Output:

   0     1     2
0  1   foo   bla
1  1   off   aff
_____
   0     1     2
2  2   roo   laa
3  2   jkl   xds
_____
   0     1     2
4  3   asd   fds
5  3   qwe   tre
6  3   tre   yxc
_____

130

answered Oct 25 '22 16:10

elcombato

Related questions
                            
                                How can I test the standard input and standard output in Python Script with a Unittest test?
                            
                                Concurrency with subprocess module. How can I do this?
                            
                                Anyone successfully bundled data files into a single file with Pyinstaller?
                            
                                Keeping partly-offline sqlite db in sync with postgresql
                            
                                Docker Build can't find pip
                            
                                Python Argparse: Raw string input
                            
                                Why does printing a dataframe break python when constructed from numpy empty_like
                            
                                Derivative of summations
                            
                                Thread-safe version of mock.call_count
                            
                                Python 3 import hooks
                            
                                Computed static property in python
                            
                                Add streaming step to MR job in boto3 running on AWS EMR 5.0
                            
                                Chain lookup through queryset
                            
                                Django Rest Framework invalid username/password
                            
                                Numpy blockwise reduce operations
                            
                                Is it possible to merge multiple TensorFlow graphs into one?
                            
                                How to pipe live video frames from ffmpeg to PIL?
                            
                                Storing day and month without year in Python
                            
                                How do I determine what to put inside the __init__ function in a python class?
                            
                                Thread garbage collection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Load pandas dataframe with chunksize determined by column variable

Tags:

python

pandas

chunks

Josh Kidd

People also ask

1 Answers

elcombato

Recent Activity

Donate For Us