pandas read_json in chunks but still has memory error

Tags:

pandas

large-files

I'm trying to read and process a large json file(~16G) but it keeps having memory error even if I read in small chunks by specifying chunksize=500. My code:

i=0
header = True
for chunk in pd.read_json('filename.json.tsv', lines=True, chunksize=500):
        print("Processing chunk ", i)
        process_chunk(chunk, i)
        i+=1
        header = False

def process_chunk(chunk, header, i):
    pk_file = 'data/pk_files/500_chunk_'+str(i)+'.pk'
    get_data_pk(chunk, pk_file) #load and process some columns and save into a pk file for future processing
    preds = get_preds(pk_file) #SVM prediction
    chunk['prediction'] = preds #append result column
    chunk.to_csv('result.csv', header = header, mode='a')

The process_chunk function basically reads in each chunk and append a new column to it.

When I use a smaller file it works, also works well if I specify nrows=5000 in the read_json function. Seems for some reason it still requires full file-size memory despite the chunksize parameter.

Any idea? Thanks!

934

asked Aug 10 '20 16:08

Sandy

1 Answers

I had the same strange problem in one of my project's virtual env with pandas v1.1.2. Downgrading pandas to v1.0.5 seems to solve the problem.

183

answered Oct 11 '22 01:10

Olmo

Related questions
                            
                                Using Python to analyze large set of sensor-data
                            
                                Matplotlib - Move labels into middle of pie chart
                            
                                Attempting to write a few lines of code to create a master date lookup table
                            
                                Removing multiple phrases from string column efficiently
                            
                                Difference between pd.df.plot.box() and pd.df.boxplot()
                            
                                PyArrow: Store list of dicts in parquet using nested types
                            
                                How to categorize a range of values in Pandas DataFrame
                            
                                AttributeError: 'DataFrame' object has no attribute 'droplevel' in pandas
                            
                                Effective-Date-Range One-Hot-Encode groupby
                            
                                Using seaborn lineplot with grouping variable
                            
                                Every product/combination of nested dictionaries saved to DataFrame
                            
                                Is there a way to turn a date-indexed dataframe containing durations of events, into a dataframe of binary data showing event for each day?
                            
                                pd.Series assignment with pd.IndexSlice results in NaN values despite matching indices
                            
                                Jupyter notebook color different parentheses by different colors
                            
                                how do I use pd.melt() across multiple columns?
                            
                                How to move the timestamp bounds for datetime in pandas (working with historical data)?
                            
                                Create an excel file from BytesIO using python
                            
                                Python with DataFrame merge of aggregations ...error: '' is both an index level and a column label, which is ambiguous
                            
                                How to find set of lowest sum of distinct column elements in python?
                            
                                How to add a row to every group with pandas groupby?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With