Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How write a regex that matches only when there's a slash OR nothing after the match?

Tags:

regex

php

I'm trying to use preg_match() to extract 10-character ASIN numbers from Amazon URLs. The URLs could be in any of these basic formats:

http://www.amazon.com/gp/product/ASIN
http://www.amazon.com/gp/product/[text]/ASIN
http://www.amazon.com/o/ASIN
http://www.amazon.com/dp/ASIN
http://www.amazon.com/[text]/dp/ASIN
http://www.amazon.com/[text]/dp/[text]/ASIN

NOTE: The problem I'm having stems from the fact that there may or may not be slashes and variables at the end of the URLs, after the ASIN.

With the help I received in a previous question, I came up with this:

\/([A-Za-z0-9]{10})

Which I thought was working, until I tried it on this URL:

http://www.amazon.com/PlayStation-2-Console-Slim-Black/dp/B000TLU67W/ref=sr_1_4?ie=UTF8&qid=1389314719&sr=8-4&keywords=playstation+1

The output of preg_match() for that is:

Array
(
    [0] => /PlayStatio
    [1] => PlayStatio
)

So then I tried adding a slash at the end of the regex, like this:

\/([A-Za-z0-9]{10})\/

Which fixes the problem, giving the following output for the above URL:

Array
(
    [0] => /B000TLU67W/
    [1] => B000TLU67W
)

However, there won't always be a slash at the end of the URL. For example, the above URL works just fine on Amazon if modified to this:

http://www.amazon.com/PlayStation-2-Console-Slim-Black/dp/B000TLU67W

My modified regex doesn't work for this URL, because there's no slash on the end.

I think maybe having an OR condition to see if there's either a slash after the match, or nothing after it, might work, but I'm not sure how to do it..

Is there any way to get the regex to work with both of the above URLs?

like image 360
Nate Avatar asked Dec 15 '22 02:12

Nate


2 Answers

You can use this regex:

'#/([A-Z0-9]{10})(?=$|[/?#])#i'

i.e. 10 digit alphanumerics followed by a slash OR ? OR just end of input.

Online Demo: http://regex101.com/r/aE0jU8

like image 57
anubhava Avatar answered Dec 17 '22 17:12

anubhava


Easy, just find the last possible ASIN value in the URL path, like so:

if (preg_match('%
    # Fetch ASIN value from Amazon URL.
    (?<=/)                  # ASIN value always preceeded by slash.
    [A-Za-z0-9]{10}         # The ASIN value is exactly 10 alphanum.
    (?=                     # Assert no more ASIN values in path.
      (?:                   # Zero or more non-ASIN path segments.
        /                   # Path segment always begins with slash.
        (?!                 # Assert this path segment not ASIN.
          [A-Za-z0-9]{10}   # Is valid ASIN value if followed by
          (?:$|[/?\#])      # EOL/EOS or / or ? or # terminator.
        )                   # End assert this path segment not ASIN.
        (?:                 # Zero or more URI path characters.
          [A-Za-z0-9\-._~!$&\'()*+,;=:@]  # Either URI path char,
        | \%[0-9A-Fa-f]{2}  # or URI encoded value.
        )*                  # Zero or more URI path characters.
      )*                    # Zero or more non-ASIN path segments.
      (?=$|[?\#])           # Path ends on EOS, query or fragment.
    )                       # End assert no more ASIN values in path.
    %x', $subject, $matches)) {
    $ASIN = $matches[0];
} else {
    $ASIN = "";
}

Edited 20140110 12:30MDT: First version did not correctly handle a lone slash at end of path.

like image 33
ridgerunner Avatar answered Dec 17 '22 15:12

ridgerunner