Reason behind speed of fread in data.table package in R

Tags:

I am amazed by the speed of the fread function in data.table on large data files but how does it manages to read so fast? What are the basic implementation differences between fread and read.csv?

885

asked Jun 26 '14 07:06

Vijay

1 Answers

I assume we are comparing to read.csv with all known advice applied such as setting colClasses, nrows etc. read.csv(filename) without any other arguments is slow mainly because it first reads everything into memory as if it were character and then attempts to coerce that to integer or numeric as a second step.

So, comparing fread to read.csv(filename, colClasses=, nrows=, etc) ...

They are both written in C so it's not that.

There isn't one reason in particular, but essentially, fread memory maps the file into memory and then iterates through the file using pointers. Whereas read.csv reads the file into a buffer via a connection.

If you run fread with verbose=TRUE it will tell you how it works and report the time spent in each of the steps. For example, notice that it skips straight to the middle and the end of the file to make a much better guess of the column types (although in this case the top 5 were enough).

> fread("test.csv",verbose=TRUE) Input contains no \n. Taking this to be a filename to open File opened, filesize is 0.486 GB File is opened and mapped ok Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' Found 6 columns First row with 6 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 10000001 Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows Type codes (   first 5 rows): 113431 Type codes (+ middle 5 rows): 113431 Type codes (+   last 5 rows): 113431 Type codes: 113431 (after applying colClasses and integer64) Type codes: 113431 (after applying drop or select (if supplied) Allocating 6 column slots (6 - 0 dropped) Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44   13.420s ( 31%) Memory map (rerun may be quicker)    0.000s (  0%) sep and header detection    3.210s (  7%) Count rows (wc -l)    0.000s (  0%) Column type detection (first, middle and last 5 rows)    1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM   25.580s ( 59%) Reading data    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered    0.000s (  0%) Coercing data already read in type bumps (if any)    0.040s (  0%) Changing na.strings to NA   43.560s        Total

NB: these timings on my very slow netbook with no SSD. Both the absolute and relative times of each step will vary widely from machine to machine. For example if you rerun fread a second time you may notice the time to mmap is much less because your OS has cached it from the previous run.

$ lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                2 On-line CPU(s) list:   0,1 Thread(s) per core:    1 Core(s) per socket:    2 Socket(s):             1 NUMA node(s):          1 Vendor ID:             AuthenticAMD CPU family:            20 Model:                 2 Stepping:              0 CPU MHz:               800.000         # i.e. my slow netbook BogoMIPS:              1995.01 Virtualisation:        AMD-V L1d cache:             32K L1i cache:             32K L2 cache:              512K NUMA node0 CPU(s):     0,1

129

answered Sep 18 '22 09:09

Matt Dowle

Related questions
                            
                                Union of intersecting vectors in a list in R
                            
                                Remove NA/NaN/Inf in a matrix
                            
                                Format axis tick labels to percentage in plotly
                            
                                How do I add a prefix to several variable names using dplyr?
                            
                                How to obtain RMSE out of lm result?
                            
                                Exclude Blank and NA in R [duplicate]
                            
                                Error message installing Cairo package in R
                            
                                data.table alternative for dplyr case_when
                            
                                Change colours of particular bars in a bar chart
                            
                                Is it possible to draw diagrams in R?
                            
                                How to determine which older version of the R package is compatible with my R version
                            
                                Why does ifelse convert a data.frame to a list: ifelse(TRUE, data.frame(1), 0)) != data.frame(1)?
                            
                                Dynamic column name in for loop with cbind
                            
                                Overlaying two graphs using ggplot2 in R
                            
                                Is there a good R API for accessing Google Docs?
                            
                                Importing two functions with same name using roxygen2
                            
                                roxygen2 not fully updating DESCRIPTION file
                            
                                How to plot a large ctree() to avoid overlapping nodes
                            
                                How do I modify an existing a sheet in an Excel Workbook using Openxlsx package in R?
                            
                                data.table equivalent of tidyr::complete()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reason behind speed of fread in data.table package in R

Tags:

performance

r

data.table

fread

Vijay

People also ask

1 Answers

Matt Dowle

Recent Activity

Donate For Us