Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exceptions to sep = " " when reading table into R? Dealing with whitespace within fields

Tags:

import

r

I need to import a table into R that is separated by spaces. Unfortunately, within some of the fields, there are spaces which cause R to separate into a new row. Is there any way of making those fields 'stick together'?

For example, the table looks like this:

V1    V2    V3    V4
Text  More  0.11  (a)kdfs hdfa ag$
Text  More  1.12  a
Text  More  0.21  v
Text  More  1222  (a)sdfs sdfa->g
Text  More  1232  (a)sdfs sdfa->g

But gets turned into this when R reads it (using read.delim)

V1    V2    V3    V4
Text  More  0.11  (a)kdfs 
hdfa  ag$
Text  More  1.12  a
Text  More  0.21  v
Text  More  1222  (a)sdfs 
sdfa->g
Text  More  1232  (a)sdfs 
sdfa->g

Those fields all have weird characters that aren't all shared with the other columns/rows. However, as seen, the spaces aren't flanked by the same characters.

In the original file, the rows are separated properly. Is there a way to do any of the following?

  1. Stop separating by spaces after the fourth column is created
  2. Have fields starting/ending with certain characters be stuck together as a string/add a non-space character where the spaces are
  3. Generically, allow exceptions to sep

Quite new to R so sorry if this is very naive. Here is what my script looks like up to then:

strs <- readLines("file")
dat <- read.delim(text = strs, 
            skip = 17, 
            col.names = c("V1", "V2", "V3", "V4"),
            sep = " ", header = F) 

Is there anything I can add to either read.delim or readLines or in between those to fix this problem? As there is fluff that needs to be cut out (hence the skip) I can't use read.table (correct me if I'm wrong).

Some of the characters around the spaces are shared, so I would be willing to use a more tedious method to put other characters in place of the spaces in between e.g. 's' and 's'. Would that be possible with gsub if there isn't an easier method?

Thanks so much!

EDIT: Flash of insight, would it be possible to make the fourth column a new table (that's of course not separated by spaces), then replace all spaces in that table with something else? How would I go about 'breaking off' the fourth column/columns after the third column?

like image 981
questionmark Avatar asked Dec 27 '13 03:12

questionmark


1 Answers

1) Try this:

for(i in 1:3) strs <- sub(" +", ",", strs)
read.csv(text = strs)

The result of the last line is:

    V1   V2      V3               V4
1 Text More    0.11 (a)kdfs hdfa ag$
2 Text More    1.12                a
3 Text More    0.21                v
4 Text More 1222.00  (a)sdfs sdfa->g
5 Text More 1232.00  (a)sdfs sdfa->g

2) Here is a second solution:

strs.comma <- sub("^(\\S+) +(\\S+) +(\\S+) +", "\\1,\\2,\\3,", strs)
read.csv(text = strs.comma)
like image 141
G. Grothendieck Avatar answered Sep 29 '22 12:09

G. Grothendieck