How to Sample a specific proportion of lines from a big file in R?

Question

I have a huge file of coordinates about 125 million lines. I want to sample these lines to obtain say 1% of all the lines so that I can plot them. Is there a way to do this in R? The file is very simple, it has only 3 columns, and I am only interested in first two. A sample of the file would be as follows:

Any help / pointer is highly appreciated.

Any help / pointer is highly appreciated.

Greg Snow · Accepted Answer

If you have a fixed sample size that you want to select and you do not know ahead of time how many rows the file has, then here is some sample code that will result in a simple random sample of the data without storing the whole dataset in memory:

n <- 1000
con <- file("jan08.csv", open = "r")
head <- readLines(con, 1)
sampdat <- readLines(con, n)
k <- n
while (length(curline <- readLines(con, 1))) {
    k <- k + 1
    if (runif(1) < n/k) {
        sampdat[sample(n, 1)] <- curline
    }
}
close(con)
delaysamp <- read.csv(textConnection(c(head, sampdat)))

If you are working with the large dataset more than just the once then it may be better to read the data into a database, then sample from there.

The ff package is another option for storing a large data object in a file, but being able to grab parts of it within R in a simple manner.

vtenhunen · Answer

LaF package and sample_line command is one option to read sample from the file:

datafile <- "file.txt" # file from working directory
sample_line(datafile, length(datafile)/100) # this give 1 % of lines

More about sample_line: https://rdrr.io/cran/LaF/man/sample_lines.html

How to Sample a specific proportion of lines from a big file in R?

Tags:

r

large-files

sampling

Sam

2 Answers

Greg Snow

vtenhunen

Recent Activity

Donate For Us

How to Sample a specific proportion of lines from a big file in R?

Tags:

r

large-files

sampling

Sam

2 Answers

Greg Snow

vtenhunen

Related questions

Recent Activity

Donate For Us