I want to read larger csv files but run into memory problems. Thus, I would like to try reading them in chunks with read_csv_chunked()
from the readr
package. My problem is that I do not really understand the callback
argument.
This is a minimal example of what I have tried so far (I know I would have to include the desired operations into f()
, otherwise there would not be an advandate in terms of memory usage, right?):
library(tidyverse)
data(diamonds)
write_csv(diamonds, "diamonds.csv") # to have a csv to read
f <- function(x) {x}
diamonds_chunked <- read_csv_chunked("diamonds.csv",
callback = DataFrameCallback$new(f),
chunk_size = 10000)
I tried to keep the callback
argument close to the example from the official documentation:
# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"),
DataFrameCallback$new(f),
chunk_size = 5)
However, I receive the error below which seems to appear after the first chunk has been read since I see the progress bar moving to 18%.
Error in eval(substitute(expr), envir, enclos) : unused argument (index)
I already tried to include the manipulations that I want to make inside of f()
, but I still got the same error.
I figured out that the function to be called in DataFrameCallback$new()
always needs to have one additional argument (pos
in the example from the documentation). This argument does not have to be used so I do not really understand its purpose. But at least, it works this way.
Does anyone know more details about this second argument?
pos
means position, it's the index number of the first line in every chunk. Using this callback function, you can process every line in the chunk.
Below is the official example from https://readr.tidyverse.org/reference/callback.html
ChunkCallback Callback interface definition, all callback functions should inherit from this class.
SideEffectChunkCallback Callback function that is used only for side effects, no results are returned.
DataFrameCallback Callback function that combines each result together at the end.
AccumulateCallBack Callback function that accumulates a single result. Requires the parameter acc to specify the initial value of the accumulator. The parameter acc is NULL by default.
# Print starting line of each chunk
f <- function(x, pos) print(pos)
read_lines_chunked(readr_example("mtcars.csv"), SideEffectChunkCallback$new(f), chunk_size = 5)
# The ListCallback can be used for more flexible output
f <- function(x, pos) x$mpg[x$hp > 100]
read_csv_chunked(readr_example("mtcars.csv"), ListCallback$new(f), chunk_size = 5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With