I have a text file which contains over 100,000 rows which I download weekly from SAP. it is downloaded as pages and each page contains the same header along with dashed line. a minimal example with two pages each containing only two items is below.
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|10/04/2013 |WM.5597394 |PNEUMATIC |
|11/07/2013 |GB.D040790 |RING |
------------------------------------------------------------
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|08/06/2013 |WM.4M01004A05 |TOUCHEUR |
|08/06/2013 |WM.4M010108-1 |LEVER |
------------------------------------------------------------
what I would like to do is import this file into R with only one header and no dash lines. I tried:
read.table( "myfile.txt", sep = "|", fill=TRUE)
Many thanks
You can pre-process file like text, then use read.table
:
lines <- readLines("myfile.txt")
lines <- sapply(lines, gsub, pattern="[-]{2,}|[|]", replacement="")
lines <- c(lines[2], lines[lines!="" & lines!=lines[2]])
read.table(text=lines, header=T)
gives
date Material Description
1 10/04/2013 WM.5597394 PNEUMATIC
2 11/07/2013 GB.D040790 RING
3 08/06/2013 WM.4M01004A05 TOUCHEUR
4 08/06/2013 WM.4M010108-1 LEVER
Another readLines
approach:
l <- readLines("myfile.txt")
# remove unnecessary lines
l <- grep("^\\|?-+\\|?$|^$", l, value = TRUE, invert = TRUE)
# remove duplicated headers
l2 <- c(l[1], l[-1][l[-1] != l[1]])
# split
lsplit <- strsplit(l2, "\\s*\\|")
# create data frame
dat <- setNames(data.frame(do.call(rbind, lsplit[-1])[ , -1]), lsplit[[1]][-1])
date Material Description
1 10/04/2013 WM.5597394 PNEUMATIC
2 11/07/2013 GB.D040790 RING
3 08/06/2013 WM.4M01004A05 TOUCHEUR
4 08/06/2013 WM.4M010108-1 LEVER
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With