I'd like to read only the first character from each line of a text file, ignoring the rest.
Here's an example file:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines
and using substring
to get the first character:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
I suspect that there ought to be some incantation using scan
, but I can't find it. An alternative might be low level file manipulation (maybe with seek
).
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines
(Richard Scriven's stringi::stri_read_lines
solution and Josh O'Brien's data.table::fread
solution), or to treat the file as binary (Martin Morgan's readBin
solution).
Use charAt() instead to get first character of a String to decide which object you want to create. Use nextLine() instead of next() to get the whole line, then use split method to split that whole line by comma. Use substring(1) to get rid of the first character.
If you allow/have access to Unix command-line tools you can use
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
expr min lq mean median uq
RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341
RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754
BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137
RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
data.table::fread()
seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
Unit: milliseconds
expr min lq mean median uq max neval
RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100
JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100
RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100
BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100
RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut
supplied by Rtools):
Unit: milliseconds
expr min lq mean median uq max neval cld
RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b
JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a
RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c
BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d
RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
f0 <- function() {
sz <- file.info("bigtest.txt")$size
what <- charToRaw("\n")
x = readBin("bigtest.txt", raw(), sz)
idx = which(x == what)
rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)
}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
library(data.table)
f1 <- function()
substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
expr min lq mean median uq max neval
f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100
f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.
01/04/2015 Edited to bring the better solution to the top.
Update 2 Changing the scan()
method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines()
function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
You could try read.fwf()
(fixed width file), setting the width to a single 1 to capture the first character on each line.
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
Update 1 : read.fwf()
is not very efficient, calling scan()
and read.table()
internally. We can skip the middle-men and try scan()
directly.
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With