Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining chapter number in different types of text

Tags:

I'm pulling titles from novel related posts. The aim is, via use of regex, to determine which chapter(s) the post is about. Each site uses different ways of identifying the chapters. Here are the most common cases:

$title = 'text chapter 25.6 text'; // c25.6
$title = 'text chapters 23, 24, 25 text'; // c23-25
$title = 'text chapters 23+24+25 text'; // c23-25
$title = 'text chapter 23, 25 text'; // c23 & 25
$title = 'text chapter 23 & 24 & 25 text'; // c23-25
$title = 'text c25.5-30 text'; // c25.5-30
$title = 'text c99-c102 text'; // c99-102
$title = 'text chapter 99 - chapter 102 text'; // c99-102
$title = 'text chapter 1 - 3 text'; // c1-3
$title = '33 text chapter 1, 2 text 3'; // c1-2
$title = 'text v2c5-10 text'; // c5-10
$title = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32

The chapter numbers are always listed in the title, just in different variations as displayed above.

What I have so far

So far, I have a regex to determine single cases of chapters, like:

$title = '9 text chapter 25.6 text'; // c25.6

Using this code (try ideone):

function get_chapter($text, $terms) {

    if (empty($text)) return;
    if (empty($terms) || !is_array($terms)) return;

    $values = false;

    $terms_quoted = array();
    foreach ($terms as $term)
        $terms_quoted[] = preg_quote($term, '/');

    // search for matches in $text
    // matches with lowercase, and ignores white spaces...
    if (preg_match('/('.implode('|', $terms_quoted).')\s*(\d+(\.\d+)?)/i', $text, $matches)) {
        if (!empty($matches[2]) && is_numeric($matches[2])) {
            $values = array(
                'term' => $matches[1],
                'value' => $matches[2]
            );
        }
    }

    return $values;
}

$text = '9 text chapter 25.6 text'; // c25.6
$terms = array('chapter', 'chapters');
$chapter = get_chapter($text, $terms);

print_r($chapter);

if ($chapter) {
    echo 'Chapter is: c'. $chapter['value'];
}

How do I make this work with the other examples listed above? Given the complexity of this question, I will bounty it 200 points when eligible.

like image 776
Henrik Petterson Avatar asked Jul 16 '18 13:07

Henrik Petterson


2 Answers

Logic

I suggest the following approach that combines a regex and common string processing logic:

  • use preg_match with the appropriate regex to match the first occurrence of the whole chunk of text starting with the keyword from the $terms array till the last number (+ optional section letter) related to the term
  • once the match is obtained, create an array that includes the input string, the match value, and the post-processed match
  • post-processing can be done by removing spaces in between hyphenated numbers and rebuilding numeric ranges in case of numbers joined with +, & or , chars. This requires a multi-step operation: 1) match the hyphen-separated substrings in the previous overall match and trim off unnecessary zeros and whitespace, 2) split the number chunks into separate items and pass them to a separate function that will generate the number ranges
  • the buildNumChain($arr) function will create the number ranges and if a letter follows a number, will convert it to a section X suffix.

Solution

You may use

$strs = ['c0', 'c0-3', 'c0+3', 'c0 & 9', 'c0001, 2, 03', 'c01-03', 'c1.0 - 2.0', 'chapter 2A Hello', 'chapter 2AHello', 'chapter 10.4c', 'chapter 2B', 'episode 23.000 & 00024', 'episode 23 & 24', 'e23 & 24', 'text c25.6 text', '001 & 2 & 5 & 8-20 & 100 text chapter 25.6 text 98', 'hello 23 & 24', 'ep 1 - 2', 'chapter 1 - chapter 2', 'text chapter 25.6 text', 'text chapters 23, 24, 25 text','text chapter 23, 25 text', 'text chapter 23 & 24 & 25 text','text c25.5-30 text', 'text c99-c102 text', 'text chapter 1 - 3 text', '33 text chapter 1, 2 text 3','text chapters 23, 24, 25, 29, 31, 32 text', 'c19 & c20', 'chapter 25.6 & chapter 29', 'chapter 25+c26', 'chapter 25 + 26 + 27'];
$terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', ''];

usort($terms, function($a, $b) {
    return strlen($b) - strlen($a);
});
 
$chapter_main_rx = "\b(?|" . implode("|", array_map(function ($term) {
    return strlen($term) > 0 ? "(" . substr($term, 0, 1) . ")(" . substr($term, 1) . "s?)": "()()" ;},
  $terms)) . ")\s*";
$chapter_aux_rx = "\b(?:" . implode("|", array_map(function ($term) {
    return strlen($term) > 0 ? substr($term, 0, 1) . "(?:" . substr($term, 1) . "s?)": "" ;},
  $terms)) . ")\s*";

$reg = "~$chapter_main_rx((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:$chapter_aux_rx)?(?4))*)~ui";

foreach ($strs as $s) {
    if (preg_match($reg, $s, $m)) {
        $p3 = preg_replace_callback(
            "~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui", function($x) use ($chapter_aux_rx) {
                return (isset($x[3]) && strlen($x[3])) ? buildNumChain(preg_split("~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui", $x[0])) 
                : ((isset($x[1]) && strlen($x[1])) ? ($x[1] + 0) : "") . ((isset($x[2]) && strlen($x[2])) ? ord(strtolower($x[2])) - 96 : "") . "-";
            }, $m[3]);
        print_r(["original" => $s, "found_match" => trim($m[0]), "converted" => $m[1] . $p3]);
        echo "\n";
    } else {
        echo "No match for '$s'!\n";
    
    }
}

function buildNumChain($arr) {
    $ret = "";
    $rngnum = "";
    for ($i=0; $i < count($arr); $i++) {
        $val = $arr[$i];
        $part = "";
        if (preg_match('~^(\d+(?:\.\d+)?)([A-Z]?)$~i', $val, $ms)) {
            $val = $ms[1];
            if (!empty($ms[2])) {
                $part = ' part ' . (ord(strtolower($ms[2])) - 96);
            }
        }
        $val = $val + 0;
        if (($i < count($arr) - 1) && $val == ($arr[$i+1] + 0) - 1) {
            if (empty($rngnum))  {
                $ret .= ($i == 0 ? "" : " & ") . $val;
            }
            $rngnum = $val;
        } else if (!empty($rngnum) || $i == count($arr)) {
            $ret .= '-' . $val;
            $rngnum = "";
        } else {
            $ret .= ($i == 0 ? "" : " & ") . $val . $part;
        }
    }
    return $ret;
}

See the PHP demo.

Main points

  • Match c or chapter/chapters with numbers that follow them, capture just c and the numbers
  • After matches are found, process Group 2 that contains the number sequences
  • All <number>-c?<number> substrings should be stripped of whitespaces and c before/in between numbers and
  • All ,/&-separated numbers should be post-processed with buildNumChain that generates ranges out of consecutive numbers (whole numbers are assumed).

The main regex will look like if $terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', '']:

'~(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())\s*((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*)~ui'

See the regex demo.

Pattern details

  • (?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()()) - a branch reset group that captures the first letter of the search term and captures the rest of the term into an obligatory Group 2. If there is an empty term, the ()() are added to make sure the branches in the group contain the same number of groups
  • \s* - 0+ whitespaces
  • ((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*c?(?3))*) - Group 2:
    • (\d+(?:\.\d+)?(?:[A-Z]\b)?) - Group 3: 1+ digits, followed with an optional sequence of ., 1+ digits and then an optional ASCII letter that should be followed with a non-word char or end of string (note the case insensitive modifier will make [A-Z] also match lowercase ASCII letters)
    • (?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))* - zero or more sequences of
      • \s*(?:[,&+-]|and)\s* - a ,, &, +, - or and enclosed with optional 0+ whitespaces
      • (?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|) - any of the terms with added optional Plural endings s
      • (?4) - Group 4 pattern recursed / repeated

When the regex matches, the Group 1 value is c, so it will be the first part of the result. Then,

 "~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui"

is used inside preg_replace_callback to remove whitespaces in between - (if any) and terms (if any) followed with 0+ whitespace chars, and if Group 1 matches, the match is split with

"~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui"

regex (it matches &, ,, + or and in between optional 0+ whitespaces followed with 0+ whitespaces and then an optional string, terms followed with 0+ whitespaces) and the array is passed to the buildNumChain function that builds the resulting string.

like image 110
Wiktor Stribiżew Avatar answered Oct 02 '22 12:10

Wiktor Stribiżew


I think that it is very complex to build something like this without throwing some false positives because some of the patterns might be contained in the title and in those cases, they will be detected by the code.

Anyway, I'll expose one solution that might be interesting to you, experiment with it when you have some time. I have not tested it deeply, so, if you find any problem with this implementation, let me know and I'll try to find a solution to it.

Looking at your patterns, all of them can be separated into two big groups:

  • from one number to another number (G1)
  • one or multiple numbers separated by commas, plus signs, or ampersands (G2)

So, if we can separate these two groups we can treat them differently. From the next titles, I'll try to get the chapter numbers in this way:

+-------------------------------------------+-------+------------------------+
| TITLE                                     | GROUP | EXTRACT                |
+-------------------------------------------+-------+------------------------+
| text chapter 25.6 text                    |  G2   | 25.6                   |
| text chapters 23, 24, 25 text             |  G2   | 23, 24, 25             |
| text chapters 23+24+25 text               |  G2   | 23, 24, 25             |
| text chapter 23, 25 text                  |  G2   | 23, 25                 |
| text chapter 23 & 24 & 25 text            |  G2   | 23, 24, 25             |
| text c25.5-30 text                        |  G1   | 25.5 - 30              |
| text c99-c102 text                        |  G1   | 99 - 102               |
| text chapter 99 - chapter 102 text        |  G1   | 99 - 102               |
| text chapter 1 - 3 text                   |  G1   | 1 - 3                  |
| 33 text chapter 1, 2 text 3               |  G2   | 1, 2                   |
| text v2c5-10 text                         |  G1   | 5 - 10                 |
| text chapters 23, 24, 25, 29, 31, 32 text |  G2   | 23, 24, 25, 29, 31, 32 |
| text chapters 23 and 24 and 25 text       |  G2   | 23, 24, 25             | 
| text chapters 23 and chapter 30 text      |  G2   | 23, 30                 | 
+-------------------------------------------+-------+------------------------+

To extract just the number of the chapters and differentiate them, one solution could be building a regular expression that captures two groups for the chapter ranges (G1) and one single group for the numbers separated by characters (G2). After the chapter numbers extraction, we can process the result to show the chapters correctly formatted.

Here is the code:

I've seen that you are still adding more cases in the comments that are not contained in the question. If you want to add a new case, just create a new matching pattern and add it to the final regexp. Just follow the rule of two matching groups for the ranges and a single matching group for the numbers separated by characters. Also, take into account that the most verbose patterns should be located before the lesser ones. For example ccc N - ccc N should be located before cc N - cc N and this last one before c N - c N.

$model = ['chapters?', 'chap', 'c']; // different type of chapter names
$c = '(?:' . implode('|', $model) . ')'; // non-capturing group for chapter names
$n = '\d+\.?\d*'; // chapter number
$s = '(?:[\&\+,]|and)'; // non-capturing group of valid separators
$e = '[ $]'; // end of a match (a space or an end of a line)

// Different patterns to match each case
$g1 = "$c *($n) *\- *$c *($n)$e"; // match chapter number - chapter number in all its variants (G1)
$g2 = "$c *($n) *\- *($n)$e"; // match chapter number - number in all its variants (G1)
$g3 = "$c *((?:(?:$n) *$s *)+(?:$n))$e"; // match chapter numbers separated by something in all its variants (G2) 
$g4 = "((?:$c *$n *$s *)+$c *$n)$e"; // match chapter number and chater number ... and chapter numberin all its variants (G2)
$g5 = "$c *($n)$e"; // match chapter number in all its variants (G2)

// Build a big non-capturing group with all the patterns
$reg = "/(?:$g1|$g2|$g3|$g4|$g5)/";

// Function to process each title
function getChapters ($title) {

    global $n, $reg;
    // Store the matches in one flatten array
    // arrays with three indexes correspond to G1
    // arrays with two indexes correspond to G2
    if (!preg_match($reg, $title, $matches)) return '';
    $numbers = array_values(array_filter($matches));

    // Show the formatted chapters for G1
    if (count($numbers) == 3) return "c{$numbers[1]}-{$numbers[2]}";

    // Show the formatted chapters for G2        
    if(!preg_match_all("/$n/", $numbers[1], $nmatches, PREG_PATTERN_ORDER)) return '';
    $m = $nmatches[0];
    $t = count($m);
    $str = "c{$m[0]}";
    foreach($m as $i => $mn) {
        if ($i == 0) continue;
        if ($mn == $m[$i - 1] + 1) {
            if (substr($str, -1) != '-') $str .= '-';
            if ($i == $t - 1 || $mn != $m[$i + 1] - 1) $str .= $mn;
        } else {
            if ($i < $t) $str .= ' & ';
            $str .= $mn;
        }
        return $str;
    }

}

You can check the code working on Ideone.

like image 45
ElChiniNet Avatar answered Oct 02 '22 11:10

ElChiniNet