Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_json in chunks but still has memory error

I'm trying to read and process a large json file(~16G) but it keeps having memory error even if I read in small chunks by specifying chunksize=500. My code:

i=0
header = True
for chunk in pd.read_json('filename.json.tsv', lines=True, chunksize=500):
        print("Processing chunk ", i)
        process_chunk(chunk, i)
        i+=1
        header = False
def process_chunk(chunk, header, i):
    pk_file = 'data/pk_files/500_chunk_'+str(i)+'.pk'
    get_data_pk(chunk, pk_file) #load and process some columns and save into a pk file for future processing
    preds = get_preds(pk_file) #SVM prediction
    chunk['prediction'] = preds #append result column
    chunk.to_csv('result.csv', header = header, mode='a')

The process_chunk function basically reads in each chunk and append a new column to it.

When I use a smaller file it works, also works well if I specify nrows=5000 in the read_json function. Seems for some reason it still requires full file-size memory despite the chunksize parameter.

Any idea? Thanks!

like image 934
Sandy Avatar asked Aug 10 '20 16:08

Sandy


People also ask

How can pandas avoid memory errors?

One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or columns in the dataset.

What does Read_json do in pandas?

pandas read_json() function can be used to read JSON file or string into DataFrame. It supports JSON in several formats by using orient param. JSON is shorthand for JavaScript Object Notation which is the most used file format that is used to exchange data between two systems or web applications.

How do you use Chunksize pandas?

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.

What is chunking in pandas?

Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows. The object returned is not a data frame but a TextFileReader which needs to be iterated to get the data.


1 Answers

I had the same strange problem in one of my project's virtual env with pandas v1.1.2. Downgrading pandas to v1.0.5 seems to solve the problem.

like image 183
Olmo Avatar answered Oct 11 '22 01:10

Olmo