Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading aligned column data with fread

Tags:

r

data.table

I came across a file like this:

COL1        COL2          COL3
weqw        asrg          qerhqetjw
weweg       ethweth       rqerhwrtjw
rhqerhqerhq qergqer       qerhqew5h
qerh        qergqer       wetjwryerj

I could not load it directly with fread so I replaced \s+ by , with sed than I gave to fread and it solved it. But is there a built in way of reading this kind of data with data.table ?

like image 821
statquant Avatar asked Jun 20 '15 15:06

statquant


2 Answers

fread does not (yet) have any capabilities for reading fixed-width files.

I, too, often come across files annoyingly stored like this. Feel free to add a feature request on the Github page.

It may not be so in your case, but your solution with sed would not work on a lot of FWF I come across because there's no space between columns, e.g. you'll see strings like 00010 that actually comprise 3 fields.

If that's the case, you'll need a field width dictionary, at which point you have several options:

  1. read.fwf within R
  2. Write a fwf->csv program (I use one I wrote in Python and it's pretty fast, could share the code if you'd like)--basically the beefed up version of your initial approach, so that you never have to deal with the FWF again
  3. Open it in Excel / LibreOffice / etc; there's a native FWF reader that tries (usually poorly) to guess the widths of the columns, which at least does half the work of specifying the column widths for you. Then you can save it as .csv or whatever from there.

I personally stick with the second option most often. read.fwf is not optimized like fread so it will probably be slow. And if you've got a lot (say 20+) of FWF to read, the 3rd option is pretty tedious.

But I agree it would be nice to have something like this built in to fread.

like image 165
MichaelChirico Avatar answered Nov 15 '22 10:11

MichaelChirico


Fixed in current devel (v1.9.5) recently. Please upgrade and test (and report if any issues).

require(data.table) # v1.9.5+
fread("~/Downloads/tmp.txt")
#           COL1    COL2       COL3
# 1:        weqw    asrg  qerhqetjw
# 2:       weweg ethweth rqerhwrtjw
# 3: rhqerhqerhq qergqer  qerhqew5h
# 4:        qerh qergqer wetjwryerj

fread() gained strip.white argument (default=TRUE) amidst other arguments. Please check README on project page for up-to-date NEWS.

like image 22
Arun Avatar answered Nov 15 '22 10:11

Arun