Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With