I can use read.csv or read.csv2 to read data into R. But the issue I encountered is that my separator is a multiple-byte string instead of a single character. How can I deal with this?

Providing example data would help. However, you might be able to adapt the following to your needs. I created an example data file, which is a just a text file containing the following: <pre class="prettyprint"><code>1sep2sep3 1sep2sep3 1sep2sep3 1sep2sep3 1sep2sep3 1sep2sep3 1sep2sep3 </code></pre> I saved it as 'test.csv'. The separation character is the 'sep' string. I think <code>read.csv()</code> uses <code>scan()</code>, which only accepts a single character for <code>sep</code>. To get around it, consider the following: <pre class="prettyprint"><code>dat <- readLines('test.csv') dat <- gsub("sep", " ", dat) dat <- textConnection(dat) dat <- read.table(dat) </code></pre> <code>readLines()</code> just reads the lines in. <code>gsub</code> substitutes the multi-character seperation string for a single <code>' '</code>, or whatever is convenient for your data. Then <code>textConnection()</code> and <code>read.data()</code> reads everything back in conveniently. For smaller datasets, this should be fine. If you have very large data, consider preprocessing with something like AWK to substitute the multi-character separation string. The above is from http://tolstoy.newcastle.edu.au/R/e4/help/08/04/9296.html . Update Regarding your comment, if you have spaces in your data, use a different replacement separator. Consider changing <code>test.csv</code> to : <pre class="prettyprint"><code>1sep2 2sep3 1sep2 2sep3 1sep2 2sep3 1sep2 2sep3 1sep2 2sep3 1sep2 2sep3 1sep2 2sep3 </code></pre> Then, with the following function: <pre class="prettyprint"><code>readMulti <- function(x, sep, replace, as.is = T) { dat <- readLines(x) dat <- gsub(sep, replace, dat) dat <- textConnection(dat) dat <- read.table(dat, sep = replace, as.is = as.is) return(dat) } </code></pre> Try: <pre class="prettyprint"><code>readMulti('test.csv', sep = "sep", replace = "\t", as.is = T) </code></pre> Here, you replace the original separator with tabs (<code>\t</code>). The <code>as.is</code> is passed to <code>read.table()</code> to prevent strings being read in is factors, but that's your call. If you have more complicated white space within your data, you might find the <code>quote</code> argument in <code>read.table()</code> helpful, or pre-process with AWK, perl, etc. Something similar with crippledlambda's <code>strsplit()</code> is most likely equivalent for moderately sized data. If performance becomes an issue, try both and see which works for you.

In this case you can replace <code>textConnection(txt)</code> with your file name, but essentially you can build a code or function around <code>strsplit</code>. Here I'm assuming you have a header line, but you can of course give define a <code>header</code> argument and generalize the creation of your data frame based on the function below: <pre class="prettyprint"><code>read.multisep <- function(File,sep) { Lines <- readLines(File) Matrix <- do.call(rbind,strsplit(Lines, sep, fixed = TRUE)) DataFrame <- structure(data.frame(Matrix[-1,]), names=Matrix[1,]) ## assuming header is present DataFrame[] <- lapply(DataFrame, type.convert) ## automatically convert modes DataFrame } example <- "a#*&b#*&c 1#*&2#*&3 4#*&5#*&6" read.multisep(textConnection(example),sep="#*&") a b c 1 1 2 3 2 4 5 6 </code></pre>

How to read a text file into GNU R with a multiple-byte separator?

2 Answers

Providing example data would help. However, you might be able to adapt the following to your needs.

I created an example data file, which is a just a text file containing the following:

1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3

I saved it as 'test.csv'. The separation character is the 'sep' string. I think read.csv() uses scan(), which only accepts a single character for sep. To get around it, consider the following:

dat <- readLines('test.csv')
dat <- gsub("sep", " ", dat)
dat <- textConnection(dat)
dat <- read.table(dat)

readLines() just reads the lines in. gsub substitutes the multi-character seperation string for a single ' ', or whatever is convenient for your data. Then textConnection() and read.data() reads everything back in conveniently. For smaller datasets, this should be fine. If you have very large data, consider preprocessing with something like AWK to substitute the multi-character separation string. The above is from http://tolstoy.newcastle.edu.au/R/e4/help/08/04/9296.html .

Update Regarding your comment, if you have spaces in your data, use a different replacement separator. Consider changing test.csv to :

1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3

Then, with the following function:

readMulti <- function(x, sep, replace, as.is = T)
{
    dat <- readLines(x)
    dat <- gsub(sep, replace, dat)
    dat <- textConnection(dat)
    dat <- read.table(dat, sep = replace, as.is = as.is)

    return(dat)
}

Try:

readMulti('test.csv', sep = "sep", replace = "\t", as.is = T)

Here, you replace the original separator with tabs (\t). The as.is is passed to read.table() to prevent strings being read in is factors, but that's your call. If you have more complicated white space within your data, you might find the quote argument in read.table() helpful, or pre-process with AWK, perl, etc.

Something similar with crippledlambda's strsplit() is most likely equivalent for moderately sized data. If performance becomes an issue, try both and see which works for you.

answered Oct 09 '22 04:10

jthetzel

In this case you can replace textConnection(txt) with your file name, but essentially you can build a code or function around strsplit. Here I'm assuming you have a header line, but you can of course give define a header argument and generalize the creation of your data frame based on the function below:

read.multisep <- function(File,sep) {
    Lines <- readLines(File)
    Matrix <- do.call(rbind,strsplit(Lines, sep, fixed = TRUE))
    DataFrame <- structure(data.frame(Matrix[-1,]), names=Matrix[1,]) ## assuming header is present
    DataFrame[] <- lapply(DataFrame, type.convert)                    ## automatically convert modes
    DataFrame
}

example <- "a#*&b#*&c
            1#*&2#*&3
            4#*&5#*&6"

read.multisep(textConnection(example),sep="#*&")

  a b c
1 1 2 3
2 4 5 6

answered Oct 09 '22 03:10

hatmatrix

Related questions
                            
                                Draw border around certain rows using cowplot and ggplot2
                            
                                How to correctly use group_by() and summarise() in a For loop in R
                            
                                wrap text in knitr::kable table cell using "\n"
                            
                                Error in contrib.url(repos, "source") in R trying to use CRAN without setting a mirror Calls: install.packages -> contrib.url Execution halted
                            
                                How to aggregate categorical data in R?
                            
                                Bind vectors across lists to single list of matrices
                            
                                Is it possible to pass multible variables to the same curly curly?
                            
                                Convert string data into data frame
                            
                                Unnest or unchop dataframe containing lists of different lengths
                            
                                How to fix degree symbol not showing correctly in R on Linux/Fedora 31
                            
                                Pass expression as argument in R Survey package
                            
                                how to define fill colours in ggplot histogram?
                            
                                2-way anova on unbalanced dataset
                            
                                multiply each cell of a data.frame with it's weight
                            
                                Making a better summary statistics table with plyr in R
                            
                                R: Plot a time series with quantiles using ggplot2
                            
                                Error from ggplot2 plotting date data -- missing value where TRUE/FALSE needed
                            
                                How to use igraph vertex.shape functionality
                            
                                Selecting last n items in a time series
                            
                                Control font thickness without changing font size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read a text file into GNU R with a multiple-byte separator?

Tags:

r

csv

RobinMin

People also ask

2 Answers

jthetzel

hatmatrix

Recent Activity

Donate For Us