Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl Search and replace in 400'000 files

Tags:

regex

perl

I've got about 400'000 files that need some text to be replaced.

I tried the following Perl script:

@files = <*.html>;

foreach $file (@files) {
    `perl -0777 -i -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' $file`;

    `perl -0777 -i -pe 's{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;' $file`;

    `perl -0777 -i -pe 's{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;' $file`;

    `perl -p -i -e 's/.css.html/.css/g;' $file`;
}

I don't have a deep Perl knowledge, but the script runs too slow (updates only about 180 files per day).

Is there a way to speed it up?

Thank you in advance!

PS: When I tested it on a smaller number of files, I've noticed a much better performance...

like image 856
user1751343 Avatar asked Dec 06 '22 10:12

user1751343


1 Answers

Calling perl from perl will always be slower than doing all the work in one process. So, the solution might be

perl -i -pe 'BEGIN { undef $/ }
             s{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;
             s{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;
             s{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;
             s/.css.html/.css/g;
    ' *.html
like image 165
choroba Avatar answered Dec 20 '22 23:12

choroba