Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a huge json file in R , issues

Tags:

json

r

I am trying to read very huge json file using R , and I am using the RJSON library with this commend json_data <- fromJSON(paste(readLines("myfile.json"), collapse=""))

The problem is that I am getting this error message

Error in paste(readLines("myfile.json"), collapse = "") : 

could not allocate memory (2383 Mb) in C function 'R_AllocStringBuffer'

Can anyone help me with this issue

like image 251
Rabe Avatar asked Apr 17 '15 02:04

Rabe


People also ask

How do I handle large JSON files?

Instead of reading the whole file at once, the 'chunksize' parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you ...

How do I view large JSON files?

With Gigasheet, you can open large JSON files with millions of rows or billions of cells, and work with them just as easily as you'd work with a much smaller file in Excel or Google Sheets. In one spot, the JSON data is loaded, flattened, and ready for analysis.

How large can a JSON file be?

One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.


3 Answers

Well, just sharing my experience about read json file. the progress of I am trying to read 52.8MB,19.7MB,1.3GB,93.9MB,158.5MB json files cost me 30minutes and finally auto resume R session, after that tried to apply parallel computing and would like to see the progress but failed.

https://github.com/hadley/plyr/issues/265

And then I tried to add the parameter pagesize = 10000, its work and more efficient then ever. Well, we only need read once and later save as RData/Rda/Rds format as by saveRDS.

> suppressPackageStartupMessages(library('BBmisc'))
> suppressAll(library('jsonlite'))
> suppressAll(library('plyr'))
> suppressAll(library('dplyr'))
> suppressAll(library('stringr'))
> suppressAll(library('doParallel'))
> 
> registerDoParallel(cores=16)
> 
> ## https://www.kaggle.com/c/yelp-recsys-2013/forums/t/4465/reading-json-files-with-r-how-to
> ## https://class.coursera.org/dsscapstone-005/forum/thread?thread_id=12
> fnames <- c('business','checkin','review','tip','user')
> jfile <- paste0(getwd(),'/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_',fnames,'.json')
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.parallel=TRUE)
> dat
list()
> jfile
[1] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
[2] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json" 
[3] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json"  
[4] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_tip.json"     
[5] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json"    
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.progress='=')
opening file input connection.
 Imported 61184 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
 Imported 45166 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
 Found 470000 records...
like image 161
Rγσ ξηg Lιαη Ημ Avatar answered Oct 05 '22 23:10

Rγσ ξηg Lιαη Ημ


I got the same problem while working with huge datasets in R.I had used jsonlite package in R for reading json in R.I had used the following code to read json in R:

library(jsonlite)
get_tweets <- stream_in(file("tweets.json"),pagesize = 10000)

here tweets.json is the my file name and the location where it exists,pagesize represents how many number of lines it reads in one iteration.Hope it helps.

like image 40
Kanwar_Singh Avatar answered Oct 06 '22 00:10

Kanwar_Singh


For some reason the above solutions all caused R to terminate or worse.

This solution worked for me, with the same data set:

library(jsonlite)
file_name <- 'C:/Users/Downloads/yelp_dataset/yelp_dataset~/dataset/business.JSON'
business<-jsonlite::stream_in(textConnection(readLines(file_name, n=100000)),verbose=F)

Took about 15 minutes

like image 33
user2723494 Avatar answered Oct 05 '22 23:10

user2723494