Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import fixed width data file with no line separator

Tags:

import

r

dbf

I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:

20141101 77h  3.210                                  0    3 20141102 76h  3.090                                  0    3 

The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.

With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.

Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?

Thanks in advance.

like image 384
Ben Avatar asked Feb 05 '16 10:02

Ben


People also ask

Does a fixed width file require delimiters?

Data in a fixed-width text file is arranged in rows and columns, with one entry per row. Each column has a fixed width, specified in characters, which determines the maximum amount of data it can contain. No delimiters are used to separate the fields in the file.

How do you convert fixed width to delimited in Python?

You can convert a fixed-width file to a CSV using Python pandas by reading the fixed-width file as a DataFrame df using pd. read('my_file. fwf') and writing the DataFrame to a CSV using df. to_csv('my_file.

How do I read a fixed-length file?

Enter the file path to read a fixed-length. Click [Browse] button to activate the file Selectr and Select the file. Specify the absolute path of the DataSpider file system as the file path. Constrained characters of DataSpider File System cannot be used except for path separator "/".


2 Answers

Maybe not the best idea but this should work:

content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)

The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.

Another approach is doable with regexes and package stringr (still with content resulting from scan above):

library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)

which gives:

        V1   V2      V3                                        V4
1 20141101  77h   3.210                                   0    3 
2 20141102  76h   3.090                                   0    3 

str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].

Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.

you can then name the columns with colnames(d) <- c('date','time','data_point','rest')

If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to @jaap for the remind of this function) like this:

td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)

Output:

        X1  X2    X3     X4
1 20141101 77h 3.210 0    3
2 20141102 76h 3.090 0    3
like image 165
Tensibai Avatar answered Oct 24 '22 08:10

Tensibai


A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):

txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
                              function(x) substr(txt, x, x+59)))

library(dplyr)
library(tidyr)
dfx %>% 
  separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>% 
  mutate_all(trimws)

which gives:

        VA  VB    VC VD VE
1 20141101 77h 3.210  0  3
2 20141102 76h 3.090  0  3

To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.

If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.

like image 32
Jaap Avatar answered Oct 24 '22 08:10

Jaap