I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often. The way I do it now it's just <code>cat fname | wc -l</code>, and it takes very long. Is there any solution that'd be much faster? I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help. I'd like the solution to be as simple as one line run, like the <code>wc -l</code> solution, but not sure how feasible it is. Any ideas?

Try: <code>sed -n '$=' filename</code> Also cat is unnecessary: <code>wc -l filename</code> is enough in your present way.

Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have. But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up: <pre class="prettyprint"><code>$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l & $ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l & $ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l & $ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l & </code></pre> Notice the <code>&</code> at each command line, so all will run in parallel; <code>dd</code> works like <code>cat</code> here, but allow us to specify how many bytes to read (<code>count * bs</code> bytes) and how many to skip at the beginning of the input (<code>skip * bs</code> bytes). It works in blocks, hence, the need to specify <code>bs</code> as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability. If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain. EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.

Count lines in large files

2 Answers

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.

159

answered Sep 27 '22 00:09

P.P

Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.

But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:

Click to copy

$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l & $ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l & $ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l & $ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &

Notice the & at each command line, so all will run in parallel; dd works like cat here, but allow us to specify how many bytes to read (count * bs bytes) and how many to skip at the beginning of the input (skip * bs bytes). It works in blocks, hence, the need to specify bs as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.

If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.

EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.

answered Sep 25 '22 00:09

lvella

Related questions
                            
                                command line utility to print statistics of numbers in linux
                            
                                Location of ini/config files in linux/unix?
                            
                                How to copy the GNU Screen copy buffer to the clipboard? [closed]
                            
                                What's the meaning of a ! before a command in the shell?
                            
                                Making CMake print commands before executing
                            
                                Copy text from nano editor to shell [closed]
                            
                                Difference between checkout and export in SVN
                            
                                How do I find all the files that were created today in Unix/Linux?
                            
                                Bash script log file display to screen continuously
                            
                                'readline/readline.h' file not found
                            
                                Cross-platform space remaining on volume using python
                            
                                Setting environment variables in Linux using Bash
                            
                                How to remove cached credentials from Git?
                            
                                How to get only process ID in specify process name in linux?
                            
                                Why does this program print "forked!" 4 times?
                            
                                Is there a way to figure out what is using a Linux kernel module?
                            
                                what does anon-rss and total-vm mean
                            
                                What is the equivalent to getch() & getche() in Linux?
                            
                                The difference between initrd and initramfs?
                            
                                How to get diff between all files inside 2 folders that are on the web?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count lines in large files

Tags:

linux

mapreduce

Dnaiel

People also ask

2 Answers

P.P

lvella

Recent Activity

Donate For Us