Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse tab-delimited data (of different formats) into a data.table/data.frame?

I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.

The tab-delimited format is the following:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,

1092    -   1093    +   1X

B consistently has two columns:

1093    HRDCPMRFYT

Question: How do you parse a file with "alternating" rows with different formats?

Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:

1092    -    1093    +    1X    1093    HRDCPMRFYT
1093    +    1094    -    1X    1094    BSZSDFJRVF
1094    +    1095    +    1X    1095    SSTFCLEPVV
... 
like image 218
ShanZhengYang Avatar asked Jan 03 '23 09:01

ShanZhengYang


2 Answers

You can run shell commands using fread. In Win10, you can even run some linux utilities such as sed

Hence, you can simply do

fread("sed '$!N;s/\\n/ /' test.tab")
#      V1 V2   V3 V4      V5         V6
# 1: 1092  - 1093  + 1X 1093 HRDCPMRFYT
# 2: 1093  + 1094  - 1X 1094 BSZSDFJRVF
# 3: 1094  + 1095  + 1X 1095 SSTFCLEPVV

(sed syntax taken from here)


Data

text <- "1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
like image 176
David Arenburg Avatar answered Jan 13 '23 14:01

David Arenburg


One way to go is to read in your data with readLines, pull out the bits you want, and pass to read.table to form the dataframe. So if the rows are alternating then:

txt <- 
'1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV'


rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]), 
           read.table(text=rd[c(FALSE, TRUE)]))

Change textConnection(txt) to your file path


Another way is to read in only once and then post-process

r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]
like image 36
user20650 Avatar answered Jan 13 '23 15:01

user20650