Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove hard line breaks from text with Ruby

I have some text with hard line breaks in it like this:

This should all be on one line 
since it's one sentence.

This is a new paragraph that
should be separate.

I want to remove the single newlines but keep the double newlines so it looks like this:

This should all be on one line since it's one sentence.

This is a new paragraph that should be separate.

Is there a single regular expression to do this? (or some easy way)

So far this is my only solution which works but feels hackish.

txt = txt.gsub(/(\r\n|\n|\r)/,'[[[NEWLINE]]]')
txt = txt.gsub('[[[NEWLINE]]][[[NEWLINE]]]', "\n\n")
txt = txt.gsub('[[[NEWLINE]]]', " ")
like image 495
Brian Armstrong Avatar asked Jan 28 '11 21:01

Brian Armstrong


2 Answers

Replace all newlines that are not followed by or preceded by a newline:

text = <<END
This should all be on one line
since it's one sentence.

This is a new paragraph that
should be separate.
END

p text.gsub /(?<!\n)\n(?!\n)/, ' '
#=> "This should all be on one line since it's one sentence.\n\nThis is a new paragraph that should be separate. "

Or, for Ruby 1.8 without lookarounds:

txt.gsub! /([^\n])\n([^\n])/, '\1 \2'
like image 183
Phrogz Avatar answered Nov 10 '22 20:11

Phrogz


text.gsub!(/(\S)[^\S\n]*\n[^\S\n]*(\S)/, '\1 \2')

The two (\S) groups serve the same purposes as the lookarounds ((?<!\s)(?<!^) and(?!\s)(?!$)) in @sln's regexes:

  • they confirm that the linefeed really is in the middle of a sentence, and
  • they ensure that the [^\S\n]*\n[^\S\n]* part consumes any other whitespace surrounding the linefeed, making it possible for us to normalize it to a single space.

They also make the regex easier to read, and (perhaps most importantly) they work in pre-1.9 versions of Ruby that don't support lookbehinds.

like image 39
Alan Moore Avatar answered Nov 10 '22 21:11

Alan Moore