I'm trying to read and process a large json file(~16G) but it keeps having memory error even if I read in small chunks by specifying chunksize=500. My code:
i=0
header = True
for chunk in pd.read_json('filename.json.tsv', lines=True, chunksize=500):
print("Processing chunk ", i)
process_chunk(chunk, i)
i+=1
header = False
def process_chunk(chunk, header, i):
pk_file = 'data/pk_files/500_chunk_'+str(i)+'.pk'
get_data_pk(chunk, pk_file) #load and process some columns and save into a pk file for future processing
preds = get_preds(pk_file) #SVM prediction
chunk['prediction'] = preds #append result column
chunk.to_csv('result.csv', header = header, mode='a')
The process_chunk function basically reads in each chunk and append a new column to it.
When I use a smaller file it works, also works well if I specify nrows=5000 in the read_json function. Seems for some reason it still requires full file-size memory despite the chunksize parameter.
Any idea? Thanks!
One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or columns in the dataset.
pandas read_json() function can be used to read JSON file or string into DataFrame. It supports JSON in several formats by using orient param. JSON is shorthand for JavaScript Object Notation which is the most used file format that is used to exchange data between two systems or web applications.
Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.
Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows. The object returned is not a data frame but a TextFileReader which needs to be iterated to get the data.
I had the same strange problem in one of my project's virtual env with pandas v1.1.2. Downgrading pandas to v1.0.5 seems to solve the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With