Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to Clean a numbered list

Tags:

regex

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".

I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))

and the following as the replace for each of the above:

$1. $2 $3 $4$5

My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...

An example of the text I want to clean up:

1 This is the first step in the list

2 Second lot if instructions to run through
3 Doing more of the recipe instruction

4 Half way through cooking up a storm

5 almost finished the recipe

6 Serve and eat

And what I want it to look like:

1. This is the first step in the list

2. Second lot if instructions to run through

3. Doing more of the recipe instruction

4. Half way through cooking up a storm

5. almost finished the recipe

6. Serve and eat

Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?

like image 299
Palendrone Avatar asked Jan 16 '13 13:01

Palendrone


1 Answers

dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:

^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search

$1. $2\r\n\r\n                 // replace

If you're not using Windows, remove the \rs from the replace string.

Explanation:

^           // beginning of the line
(\d+)       // capture group 1. one or more digits
\s+         // any spaces after the digit. don't capture
([^\r\n]+)  // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture

Replace:

$1.       // group 1 (the digit), then period and a space
$2        // group 2
\r\n\r\n  // two EOLs, to create a blank line
          // (remove both \r for Linux)
like image 172
alan Avatar answered Sep 21 '22 21:09

alan