Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill option for fread

Tags:

r

data.table

Let's say I have this txt file:

"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2

With read.csv I can:

> read.csv("linktofile.txt", fill=T, header=F)
  V1 V2 V3 V4 V5 V6 V7
1 AA  3  3  3  3 NA NA
2 CC ad  2  2  2  2  2
3 ZZ  2 NA NA NA NA NA
4 AA  3  3  3  3 NA NA
5 CC ad  2  2  2  2  2

However fread gives

> library(data.table)

> fread("linktofile.txt")
   V1 V2 V3 V4 V5 V6 V7
1: CC ad  2  2  2  2  2

Can I get the same result with fread?

like image 909
nigmastar Avatar asked Sep 03 '13 16:09

nigmastar


2 Answers

Major update

It looks like development plans for fread changed and fread has now gained a fill argument.

Using the same sample data from the end of this answer, here's what I get:

library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
#    V1 V2 V3 V4 V5 V6 V7
# 1: AA  3  3  3  3 NA NA
# 2: CC ad  2  2  2  2  2
# 3: ZZ  2 NA NA NA NA NA
# 4: AA  3  3  3  3 NA NA
# 5: CC ad  2  2  2  2  2

Install the development version of "data.table" with:

install.packages("data.table", 
                 repos = "https://Rdatatable.github.io/data.table", 
                 type = "source")

Original answer

This doesn't answer your question about fread: That question has already been addressed by @Matt.

It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.

Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.

You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.

library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
           col_types = rep("character", max(count.fields(x, ","))))

Sample data

x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")

## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
like image 131
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 17 '22 13:09

A5C1D2H2I1M1N2O1R2T1


Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.

Could you add it to the list please? That way you'll get notified when its status changes.

Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.

UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

like image 36
Matt Dowle Avatar answered Sep 20 '22 13:09

Matt Dowle