Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace all line breaks not precede by a period with a regular expression?

Tags:

regex

Is is possible to select only line breaks that are not preceded by a period using regular expressions ? I am editing subtitles files for students. To make the printed version dead tree friendly I am trying to replace all the line breaks not preceded by a period or question mark with a space.

option 1
select all the line breaks not preceded by a period or question mark regex [a-z]\n works for that but then it of course selects the last letter of the word before the line break. -> Is it possible to somehow save and insert the last letter of the word before the line break and insert that together with a space using regular expressions or do I have to write a script for that (say php)

option 2
Select only line breaks that are preceded by a character. I tried looking into lookbehind.

While writing this question the solution hit me. To select a line break precede by a character do (?<=[a-z])\n and then replace with a space.

I searched stack overflow and could not really find what I was looking for. I hope I will not offend anybody by posting the question and solution at the same time. It might help someone else in the future.

like image 718
wim hendrix Avatar asked May 22 '13 00:05

wim hendrix


People also ask

How do you match line breaks in regex?

If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line.

How do you match everything including newline regex?

The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.

What does '$' mean in regex?

Literal Characters and Sequences For instance, you might need to search for a dollar sign ("$") as part of a price list, or in a computer program as part of a variable name. Since the dollar sign is a metacharacter which means "end of line" in regex, you must escape it with a backslash to use it literally.


1 Answers

I have had this problem recently, I solved it like this:

search:

"(?<!\.|\?)(\r\n)+([^?\.]+)"

replace: (Be careful! There is a space!!)

" $2"


(?<!\.|\?) -> There can't be ./?
(\r\n)+ -> one or more newlines 
([^?\.]+) -> selects everything of the new line except ?/. 

" $2" -> second capture group with SPACE before.

I used Regex Buddy, if it doesn't work for you, I can try to convert it for you to another programming language using Regex Buddy.

like image 170
Hans Avatar answered Sep 22 '22 00:09

Hans