I have a large text file (475,000,000 lines). I would like to quickly get the number of rows in the file without reading it.
fread
from data.table
actually comes up with the row number quite rapidly (~10 seconds) before it proceeds to read the whole file:
fread('D:/text_file.txt',select=1,colClasses="character")
Read 7.1% of 472933221 rows #number of rows appears after 10 seconds
Is there a way to extract this row number without reading the whole file afterwards? For the record, reading the whole file takes 36 seconds.
I have tried countLines
from R.utils
but it takes 53 seconds. The difference might be that fread
has an option to select only one column and countLines reads everything.
R.utils::countLines("D:/text_file.txt") #53 seconds
I have also tried other Windows methods such as:
find /v /c "" "D:\text_file.txt" #takes 1 minute 50 seconds
grep "^" D:\text_file.txt | wc -l #takes 2 minutes
These work, but they're not as fast as fread
. I'm on Windows.
fread returns the number of full items actually read, which may be less than count if an error occurs or if the end of the file is encountered before reaching count . Use the feof or ferror function to distinguish a read error from an end-of-file condition.
Option 1: Through a file connection, count. fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
Using len() function Under this method, we need to read the CSV file using pandas library and then use the len() function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.
For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .
@d.b asked me to provide a detailed answer to my own question. As @G. Grothendieck suggested, the answer is to use wc
, which is part of Rtools, a collection of resources for building packages for R under Microsoft Windows.
Once installed, make sure C:\Rtools\bin
is in your PATH
in environment variables in Windows.
Then, wc
becomes available to R using system
or shell
:
shell('wc -l "D:/text_file.txt"',intern =TRUE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With