I have a text file infile.txt
as such:
abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?
Each line in the file will be processed by this perl command into the out.txt
`cat infile.txt | perl dosomething > out.txt`
Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:
$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt
But is there a less verbose way to do the same?
To recombine split files, double-click the first file in the sequence from within the 7-Zip interface. You can then select the file within the split-set and click 'Extract'. Within the resulting dialogue box: select the directory into which you want to extract the file.
To split a file into pieces, you simply use the split command. By default, the split command uses a very simple naming scheme. The file chunks will be named xaa, xab, xac, etc., and, presumably, if you break up a file that is sufficiently large, you might even get chunks named xza and xzz.
To split a file equally into two files, we use the '-n' option. By specifying '-n 2' the file is split equally into two files.
The command "csplit" can be used to split a file into different files based on certain pattern in the file or line numbers. we can split the file into two new files ,each having part of the contents of the original file, using csplit.
The answer from @Ulfalizer gives you a good hint about the solution, but it lacks some details.
You can use GNU parallel (apt-get install parallel
on Debian)
So your problem can be solved using the following command:
parallel -a infile.txt -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt
Here is the meaning of the arguments:
-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With