I'm pulling titles from novel related posts. The aim is, via use of regex, to determine which chapter(s) the post is about. Each site uses different ways of identifying the chapters. Here are the most common cases:
$title = 'text chapter 25.6 text'; // c25.6
$title = 'text chapters 23, 24, 25 text'; // c23-25
$title = 'text chapters 23+24+25 text'; // c23-25
$title = 'text chapter 23, 25 text'; // c23 & 25
$title = 'text chapter 23 & 24 & 25 text'; // c23-25
$title = 'text c25.5-30 text'; // c25.5-30
$title = 'text c99-c102 text'; // c99-102
$title = 'text chapter 99 - chapter 102 text'; // c99-102
$title = 'text chapter 1 - 3 text'; // c1-3
$title = '33 text chapter 1, 2 text 3'; // c1-2
$title = 'text v2c5-10 text'; // c5-10
$title = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32
The chapter numbers are always listed in the title, just in different variations as displayed above.
So far, I have a regex to determine single cases of chapters, like:
$title = '9 text chapter 25.6 text'; // c25.6
Using this code (try ideone):
function get_chapter($text, $terms) {
if (empty($text)) return;
if (empty($terms) || !is_array($terms)) return;
$values = false;
$terms_quoted = array();
foreach ($terms as $term)
$terms_quoted[] = preg_quote($term, '/');
// search for matches in $text
// matches with lowercase, and ignores white spaces...
if (preg_match('/('.implode('|', $terms_quoted).')\s*(\d+(\.\d+)?)/i', $text, $matches)) {
if (!empty($matches[2]) && is_numeric($matches[2])) {
$values = array(
'term' => $matches[1],
'value' => $matches[2]
);
}
}
return $values;
}
$text = '9 text chapter 25.6 text'; // c25.6
$terms = array('chapter', 'chapters');
$chapter = get_chapter($text, $terms);
print_r($chapter);
if ($chapter) {
echo 'Chapter is: c'. $chapter['value'];
}
How do I make this work with the other examples listed above? Given the complexity of this question, I will bounty it 200 points when eligible.
I suggest the following approach that combines a regex and common string processing logic:
preg_match
with the appropriate regex to match the first occurrence of the whole chunk of text starting with the keyword from the $terms
array till the last number (+ optional section letter) related to the term+
, &
or ,
chars. This requires a multi-step operation: 1) match the hyphen-separated substrings in the previous overall match and trim off unnecessary zeros and whitespace, 2) split the number chunks into separate items and pass them to a separate function that will generate the number rangesbuildNumChain($arr)
function will create the number ranges and if a letter follows a number, will convert it to a section X
suffix.You may use
$strs = ['c0', 'c0-3', 'c0+3', 'c0 & 9', 'c0001, 2, 03', 'c01-03', 'c1.0 - 2.0', 'chapter 2A Hello', 'chapter 2AHello', 'chapter 10.4c', 'chapter 2B', 'episode 23.000 & 00024', 'episode 23 & 24', 'e23 & 24', 'text c25.6 text', '001 & 2 & 5 & 8-20 & 100 text chapter 25.6 text 98', 'hello 23 & 24', 'ep 1 - 2', 'chapter 1 - chapter 2', 'text chapter 25.6 text', 'text chapters 23, 24, 25 text','text chapter 23, 25 text', 'text chapter 23 & 24 & 25 text','text c25.5-30 text', 'text c99-c102 text', 'text chapter 1 - 3 text', '33 text chapter 1, 2 text 3','text chapters 23, 24, 25, 29, 31, 32 text', 'c19 & c20', 'chapter 25.6 & chapter 29', 'chapter 25+c26', 'chapter 25 + 26 + 27'];
$terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', ''];
usort($terms, function($a, $b) {
return strlen($b) - strlen($a);
});
$chapter_main_rx = "\b(?|" . implode("|", array_map(function ($term) {
return strlen($term) > 0 ? "(" . substr($term, 0, 1) . ")(" . substr($term, 1) . "s?)": "()()" ;},
$terms)) . ")\s*";
$chapter_aux_rx = "\b(?:" . implode("|", array_map(function ($term) {
return strlen($term) > 0 ? substr($term, 0, 1) . "(?:" . substr($term, 1) . "s?)": "" ;},
$terms)) . ")\s*";
$reg = "~$chapter_main_rx((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:$chapter_aux_rx)?(?4))*)~ui";
foreach ($strs as $s) {
if (preg_match($reg, $s, $m)) {
$p3 = preg_replace_callback(
"~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui", function($x) use ($chapter_aux_rx) {
return (isset($x[3]) && strlen($x[3])) ? buildNumChain(preg_split("~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui", $x[0]))
: ((isset($x[1]) && strlen($x[1])) ? ($x[1] + 0) : "") . ((isset($x[2]) && strlen($x[2])) ? ord(strtolower($x[2])) - 96 : "") . "-";
}, $m[3]);
print_r(["original" => $s, "found_match" => trim($m[0]), "converted" => $m[1] . $p3]);
echo "\n";
} else {
echo "No match for '$s'!\n";
}
}
function buildNumChain($arr) {
$ret = "";
$rngnum = "";
for ($i=0; $i < count($arr); $i++) {
$val = $arr[$i];
$part = "";
if (preg_match('~^(\d+(?:\.\d+)?)([A-Z]?)$~i', $val, $ms)) {
$val = $ms[1];
if (!empty($ms[2])) {
$part = ' part ' . (ord(strtolower($ms[2])) - 96);
}
}
$val = $val + 0;
if (($i < count($arr) - 1) && $val == ($arr[$i+1] + 0) - 1) {
if (empty($rngnum)) {
$ret .= ($i == 0 ? "" : " & ") . $val;
}
$rngnum = $val;
} else if (!empty($rngnum) || $i == count($arr)) {
$ret .= '-' . $val;
$rngnum = "";
} else {
$ret .= ($i == 0 ? "" : " & ") . $val . $part;
}
}
return $ret;
}
See the PHP demo.
c
or chapter
/chapters
with numbers that follow them, capture just c
and the numbers<number>-c?<number>
substrings should be stripped of whitespaces and c
before/in between numbers and,
/&
-separated numbers should be post-processed with buildNumChain
that generates ranges out of consecutive numbers (whole numbers are assumed).The main regex will look like if $terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', '']
:
'~(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())\s*((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*)~ui'
See the regex demo.
Pattern details
(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())
- a branch reset group that captures the first letter of the search term and captures the rest of the term into an obligatory Group 2. If there is an empty term, the ()()
are added to make sure the branches in the group contain the same number of groups\s*
- 0+ whitespaces((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*c?(?3))*)
- Group 2:
(\d+(?:\.\d+)?(?:[A-Z]\b)?)
- Group 3: 1+ digits, followed with an optional sequence of .
, 1+ digits and then an optional ASCII letter that should be followed with a non-word char or end of string (note the case insensitive modifier will make [A-Z]
also match lowercase ASCII letters)(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*
- zero or more sequences of
\s*(?:[,&+-]|and)\s*
- a ,
, &
, +
, -
or and
enclosed with optional 0+ whitespaces(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)
- any of the terms with added optional Plural endings s
(?4)
- Group 4 pattern recursed / repeatedWhen the regex matches, the Group 1 value is c
, so it will be the first part of the result. Then,
"~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui"
is used inside preg_replace_callback
to remove whitespaces in between -
(if any) and terms (if any) followed with 0+ whitespace chars, and if Group 1 matches, the match is split with
"~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui"
regex (it matches &
, ,
, +
or and
in between optional 0+ whitespaces followed with 0+ whitespaces and then an optional string, terms followed with 0+ whitespaces) and the array is passed to the buildNumChain
function that builds the resulting string.
I think that it is very complex to build something like this without throwing some false positives because some of the patterns might be contained in the title and in those cases, they will be detected by the code.
Anyway, I'll expose one solution that might be interesting to you, experiment with it when you have some time. I have not tested it deeply, so, if you find any problem with this implementation, let me know and I'll try to find a solution to it.
Looking at your patterns, all of them can be separated into two big groups:
So, if we can separate these two groups we can treat them differently. From the next titles, I'll try to get the chapter numbers in this way:
+-------------------------------------------+-------+------------------------+
| TITLE | GROUP | EXTRACT |
+-------------------------------------------+-------+------------------------+
| text chapter 25.6 text | G2 | 25.6 |
| text chapters 23, 24, 25 text | G2 | 23, 24, 25 |
| text chapters 23+24+25 text | G2 | 23, 24, 25 |
| text chapter 23, 25 text | G2 | 23, 25 |
| text chapter 23 & 24 & 25 text | G2 | 23, 24, 25 |
| text c25.5-30 text | G1 | 25.5 - 30 |
| text c99-c102 text | G1 | 99 - 102 |
| text chapter 99 - chapter 102 text | G1 | 99 - 102 |
| text chapter 1 - 3 text | G1 | 1 - 3 |
| 33 text chapter 1, 2 text 3 | G2 | 1, 2 |
| text v2c5-10 text | G1 | 5 - 10 |
| text chapters 23, 24, 25, 29, 31, 32 text | G2 | 23, 24, 25, 29, 31, 32 |
| text chapters 23 and 24 and 25 text | G2 | 23, 24, 25 |
| text chapters 23 and chapter 30 text | G2 | 23, 30 |
+-------------------------------------------+-------+------------------------+
To extract just the number of the chapters and differentiate them, one solution could be building a regular expression that captures two groups for the chapter ranges (G1) and one single group for the numbers separated by characters (G2). After the chapter numbers extraction, we can process the result to show the chapters correctly formatted.
Here is the code:
I've seen that you are still adding more cases in the comments that are not contained in the question. If you want to add a new case, just create a new matching pattern and add it to the final regexp. Just follow the rule of two matching groups for the ranges and a single matching group for the numbers separated by characters. Also, take into account that the most verbose patterns should be located before the lesser ones. For example
ccc N - ccc N
should be located beforecc N - cc N
and this last one beforec N - c N
.
$model = ['chapters?', 'chap', 'c']; // different type of chapter names
$c = '(?:' . implode('|', $model) . ')'; // non-capturing group for chapter names
$n = '\d+\.?\d*'; // chapter number
$s = '(?:[\&\+,]|and)'; // non-capturing group of valid separators
$e = '[ $]'; // end of a match (a space or an end of a line)
// Different patterns to match each case
$g1 = "$c *($n) *\- *$c *($n)$e"; // match chapter number - chapter number in all its variants (G1)
$g2 = "$c *($n) *\- *($n)$e"; // match chapter number - number in all its variants (G1)
$g3 = "$c *((?:(?:$n) *$s *)+(?:$n))$e"; // match chapter numbers separated by something in all its variants (G2)
$g4 = "((?:$c *$n *$s *)+$c *$n)$e"; // match chapter number and chater number ... and chapter numberin all its variants (G2)
$g5 = "$c *($n)$e"; // match chapter number in all its variants (G2)
// Build a big non-capturing group with all the patterns
$reg = "/(?:$g1|$g2|$g3|$g4|$g5)/";
// Function to process each title
function getChapters ($title) {
global $n, $reg;
// Store the matches in one flatten array
// arrays with three indexes correspond to G1
// arrays with two indexes correspond to G2
if (!preg_match($reg, $title, $matches)) return '';
$numbers = array_values(array_filter($matches));
// Show the formatted chapters for G1
if (count($numbers) == 3) return "c{$numbers[1]}-{$numbers[2]}";
// Show the formatted chapters for G2
if(!preg_match_all("/$n/", $numbers[1], $nmatches, PREG_PATTERN_ORDER)) return '';
$m = $nmatches[0];
$t = count($m);
$str = "c{$m[0]}";
foreach($m as $i => $mn) {
if ($i == 0) continue;
if ($mn == $m[$i - 1] + 1) {
if (substr($str, -1) != '-') $str .= '-';
if ($i == $t - 1 || $mn != $m[$i + 1] - 1) $str .= $mn;
} else {
if ($i < $t) $str .= ' & ';
$str .= $mn;
}
return $str;
}
}
You can check the code working on Ideone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With