Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading csv files in chunks with `readr::read_csv_chunked()`

Tags:

chunks

r

csv

readr

I want to read larger csv files but run into memory problems. Thus, I would like to try reading them in chunks with read_csv_chunked() from the readr package. My problem is that I do not really understand the callback argument.

This is a minimal example of what I have tried so far (I know I would have to include the desired operations into f(), otherwise there would not be an advandate in terms of memory usage, right?):

library(tidyverse)
data(diamonds)
write_csv(diamonds, "diamonds.csv") # to have a csv to read

f <- function(x) {x}
diamonds_chunked <- read_csv_chunked("diamonds.csv", 
                                     callback = DataFrameCallback$new(f),
                                     chunk_size = 10000)

I tried to keep the callback argument close to the example from the official documentation:

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), 
                 DataFrameCallback$new(f), 
                 chunk_size = 5)

However, I receive the error below which seems to appear after the first chunk has been read since I see the progress bar moving to 18%.

Error in eval(substitute(expr), envir, enclos) : unused argument (index)

I already tried to include the manipulations that I want to make inside of f(), but I still got the same error.

like image 379
der_grund Avatar asked Apr 28 '17 09:04

der_grund


2 Answers

I figured out that the function to be called in DataFrameCallback$new() always needs to have one additional argument (pos in the example from the documentation). This argument does not have to be used so I do not really understand its purpose. But at least, it works this way.

Does anyone know more details about this second argument?

like image 90
der_grund Avatar answered Nov 05 '22 07:11

der_grund


pos means position, it's the index number of the first line in every chunk. Using this callback function, you can process every line in the chunk.

Below is the official example from https://readr.tidyverse.org/reference/callback.html

ChunkCallback Callback interface definition, all callback functions should inherit from this class.

SideEffectChunkCallback Callback function that is used only for side effects, no results are returned.

DataFrameCallback Callback function that combines each result together at the end.

AccumulateCallBack Callback function that accumulates a single result. Requires the parameter acc to specify the initial value of the accumulator. The parameter acc is NULL by default.

# Print starting line of each chunk
f <- function(x, pos) print(pos)
read_lines_chunked(readr_example("mtcars.csv"), SideEffectChunkCallback$new(f), chunk_size = 5)

# The ListCallback can be used for more flexible output
f <- function(x, pos) x$mpg[x$hp > 100]
read_csv_chunked(readr_example("mtcars.csv"), ListCallback$new(f), chunk_size = 5)
like image 34
苏东远 Avatar answered Nov 05 '22 07:11

苏东远