I am trying to match bible verses that can be any of these formats:
1 John 4:5 - 6
2 john 4:5 - 4:6
3 john 4:5 - 3 John 4:6
John 4:5 - 6
john 4:5 - 4:6
John 4:5 - 1 John 4:6
1john4:6
john 4
john 4-5
1 john 4-5
-any spaces in the above examples should be ignored when matched -any of the above can appear anywhere in a string of text:
text this is text John 4:5 - 1 John 4:6 text text john 4-5 more text
this is what I have but barely works and doesnt match correctly in a long string of text:
\b[a-zA-Z]+(?:\s+\d+)?(?::\d+(?:–\d+)?(?:,\s*\d+(?:–\d+)?)*)?
This is as specific as one could get, utilizing stuff like an optional capital letter at the start so things like "jOhn" don't match.
(?:\d\s*)?[A-Z]?[a-z]+\s*\d+(?:[:-]\d+)?(?:\s*-\s*\d+)?(?::\d+|(?:\s*[A-Z]?[a-z]+\s*\d+:\d+))?
FWIW I've found that RegexPal to be a huge help in these cases. Here's what I ended up with:
([\d ]*[a-zA-Z]+( \d*:\d*)?)(( - )| )?(((\d* )?[a-zA-Z]+ )?\d*([:-]+\d*)?)
Which breaks down as:
// zero of more digit(s) or a space
[\d ]*
// any number of upper/lowercase letters
[a-zA-Z]+
// a space followed by an optional any number of digits, a colon,
// and any number of digits again
( \d*:\d*)?)
// an optional hyphen with a space either side, or a space.
(( - )| )
Repeat for the other side of the optional hyphen except for this difference:
// one or more of either a colon or a hyphen
[:-]+
Let's break down your format.
First of all, the main thing I see is that "there can be a dash followed by stuff" so let's split this problem up into two parts: first deal with the start bit, then the optional dash and end bit.
Your first bit is focussed around the name, and there may be a number before it. After it there is a number, which may be followed by a colon then another number. So we have:
(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?
Now for the bit after the dash. It's a number, which may be followed by the name and another number. The whole thing may then be followed by a colon and another number. And remember the whole thing is optional:
(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?
Put the two together and wrap it in a literal with case-insensitivity and you get:
/(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?/i
Which, depending on how devout you are, may be described by any variety of colourful language.
But since when were Regexes pretty?
Anyway, in your result match, you will have:
Of course, any of these can be empty, except for 2 and 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With