Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing a text file into R

Tags:

r

I have a text file which contains over 100,000 rows which I download weekly from SAP. it is downloaded as pages and each page contains the same header along with dashed line. a minimal example with two pages each containing only two items is below.

------------------------------------------------------------
|date              |Material          |Description         |
|----------------------------------------------------------|
|10/04/2013        |WM.5597394        |PNEUMATIC           |
|11/07/2013        |GB.D040790        |RING                |
------------------------------------------------------------

------------------------------------------------------------
|date              |Material          |Description         |
|----------------------------------------------------------|
|08/06/2013        |WM.4M01004A05     |TOUCHEUR            |
|08/06/2013        |WM.4M010108-1     |LEVER               |
------------------------------------------------------------

what I would like to do is import this file into R with only one header and no dash lines. I tried:

read.table( "myfile.txt",  sep = "|", fill=TRUE)

Many thanks

like image 385
Ragy Isaac Avatar asked Jan 14 '14 13:01

Ragy Isaac


2 Answers

You can pre-process file like text, then use read.table:

lines <- readLines("myfile.txt")
lines <- sapply(lines, gsub, pattern="[-]{2,}|[|]", replacement="")
lines <- c(lines[2], lines[lines!="" & lines!=lines[2]])

read.table(text=lines, header=T)

gives

        date      Material Description
1 10/04/2013    WM.5597394   PNEUMATIC
2 11/07/2013    GB.D040790        RING
3 08/06/2013 WM.4M01004A05    TOUCHEUR
4 08/06/2013 WM.4M010108-1       LEVER
like image 179
redmode Avatar answered Sep 29 '22 12:09

redmode


Another readLines approach:

l <- readLines("myfile.txt")

# remove unnecessary lines
l <- grep("^\\|?-+\\|?$|^$", l, value = TRUE, invert = TRUE)

# remove duplicated headers
l2 <- c(l[1], l[-1][l[-1] != l[1]])

# split
lsplit <- strsplit(l2, "\\s*\\|")

# create data frame
dat <- setNames(data.frame(do.call(rbind, lsplit[-1])[ , -1]), lsplit[[1]][-1])


        date      Material Description
1 10/04/2013    WM.5597394   PNEUMATIC
2 11/07/2013    GB.D040790        RING
3 08/06/2013 WM.4M01004A05    TOUCHEUR
4 08/06/2013 WM.4M010108-1       LEVER
like image 36
Sven Hohenstein Avatar answered Sep 29 '22 14:09

Sven Hohenstein