Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex pattern isn't matching certain show titles

Using C# regex to match and return data parsed from a string is returning unreliable results.

The pattern I am using is as follows :

Regex r=new Regex( 
      @"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})",
      RegexOptions.IgnoreCase
);

Following are a couple test cases that fail


Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]
The Soup 2015.05.22 [mp4]
Big Brother UK Live From The House (May 22, 2015)

Should return

  • Show Name (eg, Ellen)
  • Date (eg, 2015.05.22)
  • Extra Info (eg, Joseph Gordon Levitt [REPOST])

Alaskan Bush People S02 Wild Times Special

Should return

  • Show Name (eg, Alaskan Bush People)
  • Season (eg, 02)
  • Extra Info (eg, Wild Times Special)

500 Questions S01E03

Should return

  • Show Name (eg, 500 Questions)
  • Season (eg, 01)
  • Episode (eg, 03)

Examples that work and return proper data

Boyster S01E13 – E14
Mysteries at the Museum S08E08
Mysteries at the National Parks S01E07 – E08
The Last Days Of… S01E06
Born Naughty? S01E02
Have I Got News For You S49E07

What it seems like, is that the pattern is ignoring the S and the E if not found, and then using the first set of matching numbers to fill in that slot.

It is clear that there is more work needed on this pattern to work with the above varying strings. Your assistance in this matter is much appreciated.

like image 759
Kraang Prime Avatar asked May 23 '15 09:05

Kraang Prime


People also ask

How do I match a pattern in regex?

Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

Does regex match anything?

Matching a Single Character Using RegexThe matched character can be an alphabet, a number or, any special character. To create more meaningful patterns, we can combine the dot character with other regular expression constructs. Matches only a single character.

What is the difference between () and [] in regex?

In other words, square brackets match exactly one character. (a-z0-9) will match two characters, the first is one of abcdefghijklmnopqrstuvwxyz , the second is one of 0123456789 , just as if the parenthesis weren't there. The () will allow you to read exactly which characters were matched.


1 Answers

Divide and Conquer

You're trying to parse too much with one simple expression. That's not going to work very well. The best approach in this case is to divide the problem into smaller problems, and solve each one separately. Then, we can combine everything into one pattern later.

Let's write some patterns for the data you want to extract.

  • Season/episode:

    S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?
    

    I used \p{Pd} instead of - to accommodate for any dash type.

  • Date:

    \d{4}\.\d{1,2}\.\d{1,2}
    

    Or...

    (?i:January|February|March|April|May|June|July|August|September|October|November|December)
    \s*\d{1,2},\s*\d{4}
    
  • Write a simple pattern for extra info:

    .*?
    

    (yeah, that's pretty generic)

  • We can also detect the show format like this:

    \[.*?\]
    
  • You can add additional parts as required.

Now, we can put everything into one pattern, using group names to extract data:

^\s*
(?<name>.*?)
(?<info> \s+ (?:
    (?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
    |
    (?<date>\d{4}\.\d{1,2}\.\d{1,2})
    |
    \(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
    |
    \[(?<format>.*?)\]
    |
    (?<extra>(?(info)|(?!)).*?)
))*
\s*$

Just ignore the info group (it's used for the conditional in extra, so that extra doesn't consume what should be part of the show name). And you can get multiple extra infos, so just concatenate them, putting a space in between each part.

Sample code:

var inputData = new[]
{
    "Boyster S01E13 – E14",
    "Mysteries at the Museum S08E08",
    "Mysteries at the National Parks S01E07 – E08",
    "The Last Days Of… S01E06",
    "Born Naughty? S01E02",
    "Have I Got News For You S49E07",
    "Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]",
    "The Soup 2015.05.22 [mp4]",
    "Big Brother UK Live From The House (May 22, 2015)",
    "Alaskan Bush People S02 Wild Times Special",
    "500 Questions S01E03"
};

var re = new Regex(@"
    ^\s*
    (?<name>.*?)
    (?<info> \s+ (?:
        (?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
        |
        (?<date>\d{4}\.\d{1,2}\.\d{1,2})
        |
        \(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
        |
        \[(?<format>.*?)\]
        |
        (?<extra>(?(info)|(?!)).*?)
    ))*
    \s*$
", RegexOptions.IgnorePatternWhitespace);

foreach (var input in inputData)
{
    Console.WriteLine();
    Console.WriteLine("--- {0} ---", input);

    var match = re.Match(input);
    if (!match.Success)
    {
        Console.WriteLine("FAIL");
        continue;
    }

    foreach (var groupName in re.GetGroupNames())
    {
        if (groupName == "0" || groupName == "info")
            continue;

        var group = match.Groups[groupName];
        if (!group.Success)
            continue;

        foreach (Capture capture in group.Captures)
            Console.WriteLine("{0}: '{1}'", groupName, capture.Value);
    }
}

And the output of this is...

--- Boyster S01E13 - E14 ---
name: 'Boyster'
episode: 'S01E13 - E14'

--- Mysteries at the Museum S08E08 ---
name: 'Mysteries at the Museum'
episode: 'S08E08'

--- Mysteries at the National Parks S01E07 - E08 ---
name: 'Mysteries at the National Parks'
episode: 'S01E07 - E08'

--- The Last Days Ofâ?¦ S01E06 ---
name: 'The Last Days Ofâ?¦'
episode: 'S01E06'

--- Born Naughty? S01E02 ---
name: 'Born Naughty?'
episode: 'S01E02'

--- Have I Got News For You S49E07 ---
name: 'Have I Got News For You'
episode: 'S49E07'

--- Ellen 2015.05.22 Joseph Gordon Levitt [REPOST] ---
name: 'Ellen'
date: '2015.05.22'
format: 'REPOST'
extra: 'Joseph'
extra: 'Gordon'
extra: 'Levitt'

--- The Soup 2015.05.22 [mp4] ---
name: 'The Soup'
date: '2015.05.22'
format: 'mp4'

--- Big Brother UK Live From The House (May 22, 2015) ---
name: 'Big Brother UK Live From The House'
date: 'May 22, 2015'

--- Alaskan Bush People S02 Wild Times Special ---
name: 'Alaskan Bush People'
episode: 'S02'
extra: 'Wild'
extra: 'Times'
extra: 'Special'

--- 500 Questions S01E03 ---
name: '500 Questions'
episode: 'S01E03'
like image 115
Lucas Trzesniewski Avatar answered Oct 26 '22 19:10

Lucas Trzesniewski