Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a text file into GNU R with a multiple-byte separator?

Tags:

r

csv

I can use read.csv or read.csv2 to read data into R. But the issue I encountered is that my separator is a multiple-byte string instead of a single character. How can I deal with this?

like image 366
RobinMin Avatar asked Oct 25 '11 01:10

RobinMin


People also ask

What is the difference between read CSV and read table in R?

Remember that the read. csv() as well as the read. csv2() function are almost identical to the read. table() function, with the sole difference that they have the header and fill arguments set as TRUE by default.


2 Answers

Providing example data would help. However, you might be able to adapt the following to your needs.

I created an example data file, which is a just a text file containing the following:

1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3

I saved it as 'test.csv'. The separation character is the 'sep' string. I think read.csv() uses scan(), which only accepts a single character for sep. To get around it, consider the following:

dat <- readLines('test.csv')
dat <- gsub("sep", " ", dat)
dat <- textConnection(dat)
dat <- read.table(dat)

readLines() just reads the lines in. gsub substitutes the multi-character seperation string for a single ' ', or whatever is convenient for your data. Then textConnection() and read.data() reads everything back in conveniently. For smaller datasets, this should be fine. If you have very large data, consider preprocessing with something like AWK to substitute the multi-character separation string. The above is from http://tolstoy.newcastle.edu.au/R/e4/help/08/04/9296.html .

Update Regarding your comment, if you have spaces in your data, use a different replacement separator. Consider changing test.csv to :

1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3 

Then, with the following function:

readMulti <- function(x, sep, replace, as.is = T)
{
    dat <- readLines(x)
    dat <- gsub(sep, replace, dat)
    dat <- textConnection(dat)
    dat <- read.table(dat, sep = replace, as.is = as.is)

    return(dat)
}

Try:

readMulti('test.csv', sep = "sep", replace = "\t", as.is = T)

Here, you replace the original separator with tabs (\t). The as.is is passed to read.table() to prevent strings being read in is factors, but that's your call. If you have more complicated white space within your data, you might find the quote argument in read.table() helpful, or pre-process with AWK, perl, etc.

Something similar with crippledlambda's strsplit() is most likely equivalent for moderately sized data. If performance becomes an issue, try both and see which works for you.

like image 74
jthetzel Avatar answered Oct 09 '22 04:10

jthetzel


In this case you can replace textConnection(txt) with your file name, but essentially you can build a code or function around strsplit. Here I'm assuming you have a header line, but you can of course give define a header argument and generalize the creation of your data frame based on the function below:

read.multisep <- function(File,sep) {
    Lines <- readLines(File)
    Matrix <- do.call(rbind,strsplit(Lines, sep, fixed = TRUE))
    DataFrame <- structure(data.frame(Matrix[-1,]), names=Matrix[1,]) ## assuming header is present
    DataFrame[] <- lapply(DataFrame, type.convert)                    ## automatically convert modes
    DataFrame
}

example <- "a#*&b#*&c
            1#*&2#*&3
            4#*&5#*&6"

read.multisep(textConnection(example),sep="#*&")

  a b c
1 1 2 3
2 4 5 6
like image 36
hatmatrix Avatar answered Oct 09 '22 03:10

hatmatrix