Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GNU Parallel, too many input files, Argument list too long

I run a command like this on my macbook, using GNU Parallel:

parallel "sample operation" ::: samplefolder/*.txt

The problem is that I have 20,000 txt files in the samplefolder, which cause a Argument list too long error.

And there's no such a problem when I tried run the same script on an ubuntu machine.

I tried googling and reading some man files, but no luck. How can I solve this problem?

Thanks!

like image 262
zachguo Avatar asked Apr 16 '14 21:04

zachguo


People also ask

How to fix Argument list too long?

The Solution There are several solutions to this problem (bash: /usr/bin/rm: Argument list too long). Remove the folder itself, then recreate it. If you still need that directory, then recreate it with the mkdir command.


3 Answers

Try:

ls samplefolder | grep \.txt | parallel "sample operation samplefolder/{}" 
like image 148
Ole Tange Avatar answered Dec 12 '22 05:12

Ole Tange


Here's how you can deal with this on a typical UNIX box (I assume OSX has find and xargs too):

# find samplefolder -name \*.txt -print0 | xargs -P 8 -n 1 -0 sample operation

Find will print all .txt file names in samplefolder separated by a NUL character. xargs in turn will read this NUL-separated list (-0) and for each N files (-n1 -- for each file in this case) will launch sample operation path/file.txt with up to 8 (-P8) of them in parallel.

like image 39
ArtemB Avatar answered Dec 12 '22 04:12

ArtemB


Handle that operation in smaller batches using -N, and pipe the input file list rather than giving it on the command line.

For example, expanding on ArtemB's answer, to process in batches of 16 files (warning, this will break with paths containing newlines):

find samplefolder -type f -name "*.txt" | parallel -N16 "sample operation" {}

To tailor the maximum number of arguments you can check getconf ARG_MAX in your environment. For example:

# ~$> getconf ARG_MAX
2097152

given that paths on *nix can typically be 4096 characters, that leaves me free to put 2097152/4096=512 file paths on the command line (excluding the "sample operation" command itself of course).

So something like

find samplefolder -name "*.txt" | parallel -N500 "sample operation" {}

would let me process in batches of 500. Of course, depending on what tool you are running, you may want to experiment and optimize the batch size for speed.

like image 38
Florian Castellane Avatar answered Dec 12 '22 05:12

Florian Castellane