I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.
The way I do it now it's just cat fname | wc -l
, and it takes very long. Is there any solution that'd be much faster?
I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.
I'd like the solution to be as simple as one line run, like the wc -l
solution, but not sure how feasible it is.
Any ideas?
Using “wc -l” There are several ways to count lines in a file. But one of the easiest and widely used way is to use “wc -l”. The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output. 1.
If you are in *Nix system, you can call the command wc -l that gives the number of lines in file.
Use readlines() to get Line Count This is the most straightforward way to count the number of lines in a text file in Python. The readlines() method reads all lines from a file and stores it in a list. Next, use the len() function to find the length of the list which is nothing but total lines present in a file.
Try: sed -n '$=' filename
Also cat is unnecessary: wc -l filename
is enough in your present way.
Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.
But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:
$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l & $ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l & $ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l & $ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &
Notice the &
at each command line, so all will run in parallel; dd
works like cat
here, but allow us to specify how many bytes to read (count * bs
bytes) and how many to skip at the beginning of the input (skip * bs
bytes). It works in blocks, hence, the need to specify bs
as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.
If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.
EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With