Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bible Verse Regex

I am trying to match bible verses that can be any of these formats:

1 John 4:5 - 6
2 john 4:5 - 4:6
3 john 4:5 - 3 John 4:6
John 4:5 - 6
john 4:5 - 4:6
John 4:5 - 1 John 4:6
1john4:6
john 4
john 4-5
1 john 4-5

-any spaces in the above examples should be ignored when matched -any of the above can appear anywhere in a string of text:

text this is text John 4:5 - 1 John 4:6 text text john 4-5 more text

this is what I have but barely works and doesnt match correctly in a long string of text:

\b[a-zA-Z]+(?:\s+\d+)?(?::\d+(?:–\d+)?(?:,\s*\d+(?:–\d+)?)*)?
like image 591
user3071933 Avatar asked Mar 07 '14 15:03

user3071933


3 Answers

This is as specific as one could get, utilizing stuff like an optional capital letter at the start so things like "jOhn" don't match.

(?:\d\s*)?[A-Z]?[a-z]+\s*\d+(?:[:-]\d+)?(?:\s*-\s*\d+)?(?::\d+|(?:\s*[A-Z]?[a-z]+\s*\d+:\d+))?
like image 28
tenub Avatar answered Sep 30 '22 20:09

tenub


FWIW I've found that RegexPal to be a huge help in these cases. Here's what I ended up with:

([\d ]*[a-zA-Z]+( \d*:\d*)?)(( - )| )?(((\d* )?[a-zA-Z]+ )?\d*([:-]+\d*)?)

Which breaks down as:

// zero of more digit(s) or a space
[\d ]*

// any number of upper/lowercase letters
[a-zA-Z]+

// a space followed by an optional any number of digits, a colon,
// and any number of digits again
( \d*:\d*)?)

// an optional hyphen with a space either side, or a space.
(( - )| )

Repeat for the other side of the optional hyphen except for this difference:

// one or more of either a colon or a hyphen
[:-]+
like image 39
Andy Avatar answered Sep 30 '22 20:09

Andy


Let's break down your format.

First of all, the main thing I see is that "there can be a dash followed by stuff" so let's split this problem up into two parts: first deal with the start bit, then the optional dash and end bit.

Your first bit is focussed around the name, and there may be a number before it. After it there is a number, which may be followed by a colon then another number. So we have:

(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?

Now for the bit after the dash. It's a number, which may be followed by the name and another number. The whole thing may then be followed by a colon and another number. And remember the whole thing is optional:

(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?

Put the two together and wrap it in a literal with case-insensitivity and you get:

/(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?/i

Which, depending on how devout you are, may be described by any variety of colourful language.

But since when were Regexes pretty?

Anyway, in your result match, you will have:

  1. Initial number
  2. Name
  3. Second number
  4. Number after the colon
  5. Number after the dash
  6. Second name
  7. Number after the name
  8. Final number after the second colon

Of course, any of these can be empty, except for 2 and 3.

like image 83
Niet the Dark Absol Avatar answered Sep 30 '22 21:09

Niet the Dark Absol