Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split files up and process them in parallel and then stitch them back? unix

I have a text file infile.txt as such:

abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?

Each line in the file will be processed by this perl command into the out.txt

`cat infile.txt | perl dosomething > out.txt`

Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:

$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt

But is there a less verbose way to do the same?

like image 813
alvas Avatar asked Mar 13 '15 13:03

alvas


People also ask

How do I recombine a split file?

To recombine split files, double-click the first file in the sequence from within the 7-Zip interface. You can then select the file within the split-set and click 'Extract'. Within the resulting dialogue box: select the directory into which you want to extract the file.

How do I unsplit a file in Linux?

To split a file into pieces, you simply use the split command. By default, the split command uses a very simple naming scheme. The file chunks will be named xaa, xab, xac, etc., and, presumably, if you break up a file that is sufficiently large, you might even get chunks named xza and xzz.

How do you split two files in UNIX?

To split a file equally into two files, we use the '-n' option. By specifying '-n 2' the file is split equally into two files.

How do you split a Unix file by pattern?

The command "csplit" can be used to split a file into different files based on certain pattern in the file or line numbers. we can split the file into two new files ,each having part of the contents of the original file, using csplit.


1 Answers

The answer from @Ulfalizer gives you a good hint about the solution, but it lacks some details.

You can use GNU parallel (apt-get install parallel on Debian)

So your problem can be solved using the following command:

parallel -a infile.txt -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt

Here is the meaning of the arguments:

-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
like image 187
Adam Avatar answered Oct 08 '22 11:10

Adam