Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find and replace double newlines with perl?

Tags:

string

regex

perl

I'm cleaning up some web pages that for some reason have about 8 line breaks between tags. I wanted to remove most of them, and I tried this

perl -pi -w -e "s/\n\n//g" *.html

But no luck. For good measure, I tried

perl -pi -w -e "s/\n//g" *.html

and it did remove all my line breaks. What am I doing wrong?

edit I also tried \r\n\r\n, same deal. Works as a single line breaks, doesn't do anything for two consecutive ones.

like image 926
user151841 Avatar asked Aug 21 '10 01:08

user151841


People also ask

How do I search and replace in Perl?

Performing a regex search-and-replace is just as easy: $string =~ s/regex/replacement/g; I added a “g” after the last forward slash. The “g” stands for “global”, which tells Perl to replace all matches, and not just the first one.

How to Replace multiple spaces with single space in Perl?

The metacharacter “\s” matches spaces and + indicates the occurrence of the spaces one or more times, therefore, the regular expression \S+ matches all the space characters (single or multiple). Therefore, to replace multiple spaces with a single space.

How do I match a new line in a regular expression in Perl?

Use /m , /s , or both as pattern modifiers. /s lets . match newline (normally it doesn't). If the string had more than one line in it, then /foo. *bar/s could match a "foo" on one line and a "bar" on a following line.


2 Answers

Use -0:

perl -pi -0 -w -e "s/\n\n//g" *.html

The problem is that by default -p reads the file one line at a time. There's no such thing as a line with two newlines, so you didn't find any. The -0 changes the line-ending character to "\0", which probably doesn't exist in your file, so it processes the whole file at once. (Even if the file did contain NULs, you're looking for consecutive newlines, so processing it in NUL-delimited chunks won't be a problem.)

You probably want to adjust your regex as well, but it's hard to be sure exactly what you want. Try s/\n\n+/\n/g, which will replace any number of consecutive newlines with a single newline.

If the file is very large, you may not have enough memory to load it in a single chunk. A workaround for this is to pick some character that is common enough to split the file into manageable chunks, and tell Perl to use that as the line-ending character. But it also has to be a character that will not appear inside the matches you're trying to replace. For example, -0x2e will split the file on "." (ASCII 0x2E).

like image 51
cjm Avatar answered Nov 15 '22 00:11

cjm


I was trying to replace a double newline with a single using the above recommendation on a large file (2.3G) With huge files, it will seg fault when trying to read the entire file at once. So instead of looking for a double newline, just look for lines where the only char is a newline:

perl -pi -w -e 's/^\n$//' file.txt
like image 24
Ian Avatar answered Nov 15 '22 00:11

Ian