I'd like to read only the first character from each line of a text file, ignoring the rest. Here's an example file: <pre class="prettyprint"><code>x <- c( "Afklgjsdf;bosfu09[45y94hn9igf", "Basfgsdbsfgn", "Cajvw58723895yubjsdw409t809t80", "Djakfl09w50968509", "E3434t" ) writeLines(x, "test.txt") </code></pre> I can solve the problem by reading everything with <code>readLines</code> and using <code>substring</code> to get the first character: <pre class="prettyprint"><code>lines <- readLines("test.txt") substring(lines, 1, 1) ## [1] "A" "B" "C" "D" "E" </code></pre> This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them? I suspect that there ought to be some incantation using <code>scan</code>, but I can't find it. An alternative might be low level file manipulation (maybe with <code>seek</code>). <hr> Since performance is only relevant for larger files, here's a bigger test file for benchmarking with: <pre class="prettyprint"><code>set.seed(2015) nch <- sample(1:100, 1e4, replace = TRUE) x2 <- vapply( nch, function(nch) { paste0( sample(letters, nch, replace = TRUE), collapse = "" ) }, character(1) ) writeLines(x2, "bigtest.txt") </code></pre> <hr> Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to <code>readLines</code> (Richard Scriven's <code>stringi::stri_read_lines</code> solution and Josh O'Brien's <code>data.table::fread</code> solution), or to treat the file as binary (Martin Morgan's <code>readBin</code> solution).

If you allow/have access to Unix command-line tools you can use <pre class="prettyprint"><code>scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE) </code></pre> Obviously less portable but <del>probably</del> very fast. Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file: <pre class="prettyprint"><code> expr min lq mean median uq RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341 RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754 BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137 RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464 RS scan 13.927765 14.752597 16.634288 15.274470 16.992124 </code></pre>

Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form <pre class="prettyprint"><code>f0 <- function() { sz <- file.info("bigtest.txt")$size what <- charToRaw("\n") x = readBin("bigtest.txt", raw(), sz) idx = which(x == what) rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE) } </code></pre> The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!) <pre class="prettyprint"><code>library(data.table) f1 <- function() substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1) </code></pre> and in comparison <pre class="prettyprint"><code>> identical(f0(), f1()) [1] TRUE > library(microbenchmark) > microbenchmark(f0(), f1()) Unit: milliseconds expr min lq mean median uq max neval f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100 f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100 </code></pre> Still wasteful, since the entire file is read in to memory before mostly being discarded.

How to efficiently read the first character from each line of a text file?

Tags:

I'd like to read only the first character from each line of a text file, ignoring the rest.

Here's an example file:

x <- c(
  "Afklgjsdf;bosfu09[45y94hn9igf",
  "Basfgsdbsfgn",
  "Cajvw58723895yubjsdw409t809t80",
  "Djakfl09w50968509",
  "E3434t"
)
writeLines(x, "test.txt")

I can solve the problem by reading everything with readLines and using substring to get the first character:

lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"

This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?

I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).

Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:

set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)    
x2 <- vapply(
  nch, 
  function(nch)
  {
    paste0(
      sample(letters, nch, replace = TRUE), 
      collapse = ""
    )    
  },
  character(1)
)
writeLines(x2, "bigtest.txt")

Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).

839

asked Jan 02 '15 19:01

Richie Cotton

4 Answers

If you allow/have access to Unix command-line tools you can use

scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)

Obviously less portable but ~~probably~~ very fast.

Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:

           expr         min          lq        mean      median          uq
     RC readLines   14.797830   17.083849   19.261917   18.103020   20.007341
      RS read.fwf  125.113935  133.259220  148.122596  138.024203  150.528754
 BB scan pipe cut    6.277267    7.027964    7.686314    7.337207    8.004137
      RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
          RS scan   13.927765   14.752597   16.634288   15.274470   16.992124

169

answered Sep 27 '22 23:09

Ben Bolker

data.table::fread() seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:

library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)

Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):

Unit: milliseconds
             expr         min          lq        mean      median          uq        max neval
     RC readLines   15.830318   16.617075   18.294723   17.116666   18.959381   27.54451   100
        JOB fread    5.532777    6.013432    7.225067    6.292191    7.727054   12.79815   100
      RS read.fwf  111.099578  113.803053  118.844635  116.501270  123.987873  141.14975   100
 BB scan pipe cut    6.583634    8.290366    9.925221   10.115399   11.013237   15.63060   100
      RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091   100

And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut supplied by Rtools):

Unit: milliseconds
             expr         min          lq       mean      median          uq        max neval   cld
     RC readLines   26.653266   27.493167   33.13860   28.057552   33.208309   61.72567   100  b 
        JOB fread    4.964205    5.343063    6.71591    5.538246    6.027024   13.54647   100 a  
      RS read.fwf  213.951792  217.749833  229.31050  220.793649  237.400166  287.03953   100   c 
 BB scan pipe cut  180.963117  263.469528  278.04720  276.138088  280.227259  387.87889   100    d 
      RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773   100     e

answered Sep 27 '22 22:09

Josh O'Brien

Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form

f0 <- function() {
    sz <- file.info("bigtest.txt")$size
    what <- charToRaw("\n")
    x = readBin("bigtest.txt", raw(), sz)
    idx = which(x == what)
    rawToChar(x[c(1L,  idx[-length(idx)] + 1L)], multiple=TRUE)
}

The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)

library(data.table)
f1 <- function()
    substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)

and in comparison

> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
 expr      min       lq     mean    median        uq       max neval
 f0() 5.144873 5.515219 5.571327  5.547899  5.623171  5.897335   100
 f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261   100

Still wasteful, since the entire file is read in to memory before mostly being discarded.

answered Sep 27 '22 23:09

Martin Morgan

01/04/2015 Edited to bring the better solution to the top.

Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.

## scan() on open connection 
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)

I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.

## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)

Here are the timings for these two methods.

## timings
library(microbenchmark)

microbenchmark(
    scan = {
        conn <- file("bigtest.txt", "rt")
        substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
        close(conn)
    },
    stringi = {
        stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
    }
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
#    scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646   100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421   100

Original [slower] answer :

You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.

read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"

Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.

Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.

lines <- count.fields("test.txt")   ## length is num of lines in file
skip <- seq_along(lines) - 1        ## set up the 'skip' arg for scan()
read <- function(n) {
    ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
    substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"

version$platform
# [1] "x86_64-pc-linux-gnu"

answered Sep 27 '22 22:09

Rich Scriven

Related questions
                            
                                Create line break in WhatsApp message
                            
                                Hide a client request header with a Nginx reverse proxy server
                            
                                Resize text to fit a label in Swift
                            
                                Eloquent model not updating updated_at timestamp
                            
                                Swift 2 ( executeFetchRequest ) : error handling
                            
                                AndroidRX - run method in background
                            
                                Genymotion problems in windows 10
                            
                                Elastic search error operation [search] and lang [groovy] is disabled?
                            
                                How to change android design support library FAB Button border color?
                            
                                Android - Independent Fragment UI testing tool
                            
                                How to delete a single record in Laravel 5?
                            
                                window.innerWidth in Chrome's device mode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With