I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?
Use len() and list() on a CSV reader to count lines in a CSV file.
If you need a quick way to count rows that contain data, select all the cells in the first column of that data (it may not be column A). Just click the column header. The status bar, in the lower-right corner of your Excel window, will tell you the row count.
Count the number of rows and columns of Dataframe using len() function. The len() function returns the length rows of the Dataframe, we can filter a number of columns using the df. columns to get the count of columns.
To count the number of records (or rows) in several CSV files the wc can used in conjunction with pipes. In the following example there are five CSV files. The requirement is to find out the sum of records in all five files. This can be achieved by piping the output of the cat command to wc.
For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename
Option 1:
Through a file connection, count.fields()
counts the number of fields per line of the file based on some sep
value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields) # function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE, # comment.char = "#") # NULL
See help(count.fields)
for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv")) # [1] 99846 system.time({ l <- length(count.fields("Batting.csv", skip = 1)) }) # user system elapsed # 0.528 0.000 0.503 l # [1] 99846 file.info("Batting.csv")$size # [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread()
to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L))) # user system elapsed # 0.063 0.000 0.063
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With