Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient alternative to merge() when building dataframe from json files with R?

I have written the following code which works, but is painfully slow once I start executing it over thousands of records:

require("RJSONIO")
people_data <- data.frame(person_id=numeric(0))

json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(person in 1:n_people) {
        person_dataframe <- as.data.frame(t(unlist(json_data[[person]])))
        people_data <- merge(people_data, person_dataframe, all=TRUE)
    }

output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)

I am attempting to build a unified data table from a series of json-formated files. The fromJSON() function reads in the data as lists of lists. Each element of the list is a person, which then contains a list of the attributes for that person.

For example:

[[1]]
    person_id
    name
    gender
    hair_color
[[2]]
    person_id
    name
    location
    gender
    height

[[...]]

structure(list(person_id = "Amy123", name = "Amy", gender = "F",
               hair_color = "brown"), 
          .Names = c("person_id", "name", "gender", "hair_color"))

structure(list(person_id = "matt53", name = "Matt", 
               location = structure(c(47231, "IN"), 
                                    .Names = c("zip_code", "state")), 
               gender = "M", height = 172), 
          .Names = c("person_id", "name", "location", "gender", "height"))

The end result of the code above is matrix where the columns are every person-attribute that appears in the structure above, and the rows are the relevant values for each person. As you can see though, some data is missing for some of the people, so I need to ensure those show up as NA and make sure things end up in the right columns. Further, location itself is a vector with two components: state and zip_code, meaning it needs to be flattened to location.state and location.zip_code before it can be merged with another person record; this is what I use unlist() for. I then keep the running master table in people_data.

The above code works, but do you know of a more efficient way to accomplish what I'm trying to do? It appears the merge() is slowing this to a crawl... I have hundreds of files with hundreds of people in each file.

Thanks! Bryan

UPDATE: Based on the feedback below, I tried to build a list of all the people, and then convert it all at one time into a dataframe. I let it run overnight and still didn't finish making the dataframe. There are around 1/2 million people in the list. That codes looks like this:

require("RJSONIO")
require("plyr")
people_data <- data.frame(person_id=numeric(0))
people_list <- list()

json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(person in 1:n_people) {
        people_list[[person]] <- t(unlist(json_data[[person]]))
    }

#PROBLEM CODE, SLOW, 1/2 million records in people_list
people_data <- rbind.fill(lapply(people_list, as.data.frame))

output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)
like image 538
Bryan Avatar asked Oct 12 '22 14:10

Bryan


2 Answers

If you don't expect duplicate records to exist, you can use rbind.fill from the plyr package.

like image 73
Daniel Dickison Avatar answered Oct 17 '22 14:10

Daniel Dickison


I assume that you know all names for fields. Here is solution.

Example data

data_list <- list(list(person_id = "Amy123", name = "Amy", gender = "F",hair_color = "brown"), list(person_id = "matt53", name = "Matt",location = list(c("zip_code"=47231, "state"="IN")),gender = "M", height = 172))

Names for example records

nm1 <- names(unlist(data_list[[1]])) nm2 <- names(unlist(data_list[[2]])) nm <- c(nm2,nm1[!nm1 %in% nm2])

Code of program

record_template <- vector(mode="character",length=length(nm))
names(record_template) <- nm
record_template <- "NA"
rec <- function(x) {
              tmp <- record_template
              tmp[names(unlist(x))] <- unlist(x)
              return(tmp)
       }

do.call(rbind,lapply(data_list,rec))

EDIT

You can get names from list using:

nm <- unique(unlist(lapply(data_list,function(x) names(unlist(x))),use.names=F))
like image 45
Wojciech Sobala Avatar answered Oct 17 '22 15:10

Wojciech Sobala