Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract number of rows from fread without reading the whole file

I have a large text file (475,000,000 lines). I would like to quickly get the number of rows in the file without reading it.

fread from data.table actually comes up with the row number quite rapidly (~10 seconds) before it proceeds to read the whole file:

fread('D:/text_file.txt',select=1,colClasses="character")
Read 7.1% of 472933221 rows #number of rows appears after 10 seconds

Is there a way to extract this row number without reading the whole file afterwards? For the record, reading the whole file takes 36 seconds.

I have tried countLines from R.utils but it takes 53 seconds. The difference might be that fread has an option to select only one column and countLines reads everything.

R.utils::countLines("D:/text_file.txt") #53 seconds

I have also tried other Windows methods such as:

find /v /c "" "D:\text_file.txt" #takes 1 minute 50 seconds
grep "^" D:\text_file.txt | wc -l #takes 2 minutes

These work, but they're not as fast as fread. I'm on Windows.

like image 804
Pierre Lapointe Avatar asked Nov 18 '17 18:11

Pierre Lapointe


People also ask

Does fread read whole files?

fread returns the number of full items actually read, which may be less than count if an error occurs or if the end of the file is encountered before reaching count . Use the feof or ferror function to distinguish a read error from an end-of-file condition.

How do I count the number of rows in a CSV file without opening it?

Option 1: Through a file connection, count. fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.

How do I find the number of rows in a csv file?

Using len() function Under this method, we need to read the CSV file using pandas library and then use the len() function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.

Is fread faster than read CSV?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .


1 Answers

@d.b asked me to provide a detailed answer to my own question. As @G. Grothendieck suggested, the answer is to use wc, which is part of Rtools, a collection of resources for building packages for R under Microsoft Windows.

Once installed, make sure C:\Rtools\bin is in your PATH in environment variables in Windows.

Then, wc becomes available to R using system or shell:

shell('wc -l "D:/text_file.txt"',intern =TRUE)
like image 141
Pierre Lapointe Avatar answered Oct 19 '22 01:10

Pierre Lapointe