Is there a more efficient way to replace NULL with NA in a list?

Tags:

I quite often come across data that is structured something like this:

employees <- list(     list(id = 1,              dept = "IT",              age = 29,              sportsteam = "softball"),     list(id = 2,              dept = "IT",              age = 30,              sportsteam = NULL),     list(id = 3,              dept = "IT",              age = 29,              sportsteam = "hockey"),     list(id = 4,              dept = NULL,              age = 29,              sportsteam = "softball"))

In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.

I would like to turn the list into a dataframe but if I run:

library(data.table) employee.df <- rbindlist(employees)

I get errors because of the NULL values. My normal strategy is to use a function like:

nullToNA <- function(x) {     x[sapply(x, is.null)] <- NA     return(x) }

and then:

employees <- lapply(employees, nullToNA) employee.df <- rbindlist(employees)

which returns

   id dept age sportsteam 1:  1   IT  29   softball 2:  2   IT  30         NA 3:  3   IT  29     hockey 4:  4   NA  29   softball

However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.

One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).

Any advice on how to do this operation efficiently on a large dataset?

239

asked Apr 04 '14 18:04

Jon M

1 Answers

Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.

If you bring all the data together with rbind, your nullToNA function no longer has to search though nested lists, and therefore sapply serves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.

Good question, by the way.

> dat <- do.call(rbind, lapply(employees, rbind)) > dat      id dept age sportsteam [1,] 1  "IT" 29  "softball" [2,] 2  "IT" 30  NULL       [3,] 3  "IT" 29  "hockey"   [4,] 4  NULL 29  "softball"  > nullToNA(dat)      id dept age sportsteam [1,] 1  "IT" 29  "softball" [2,] 2  "IT" 30  NA         [3,] 3  "IT" 29  "hockey"   [4,] 4  NA   29  "softball"

104

answered Sep 20 '22 00:09

Rich Scriven

Related questions
                            
                                Represent numeric value with typical dollar amount format
                            
                                Stratified random sampling from data frame
                            
                                ggplot2 : Plot mean with geom_bar
                            
                                Creating a symmetric matrix in R
                            
                                Add x and y axis to all facet_wrap
                            
                                Add a footnote citation outside of plot area in R?
                            
                                How to calculate combination and permutation in R?
                            
                                R and SPSS difference
                            
                                Is there a weighted.median() function?
                            
                                Function to calculate R2 (R-squared) in R
                            
                                R tm package invalid input in 'utf8towcs'
                            
                                Fuzzy search box widget with `Shiny` in R?
                            
                                R: legend with points and lines being different colors (for the same legend item)
                            
                                NOTE in R CRAN Check: No repository set, so cyclic dependency check skipped
                            
                                There is pmin and pmax each taking na.rm, why no psum?
                            
                                Check if character string is a valid color representation
                            
                                What are the differences between concatenating strings with cat() and paste()?
                            
                                Implementation of standard recycling rules
                            
                                What constitutes a good package name according to CRAN? [closed]
                            
                                Why does the number 1e9999... (31 9s) cause problems in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a more efficient way to replace NULL with NA in a list?

Tags:

performance

list

null

r

Jon M

People also ask

1 Answers

Rich Scriven

Recent Activity

Donate For Us