Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Youtube complete Java Regex

I need to parse several pages to get all of their Youtube IDs.

I found many regular expressions on the web, but : the Java ones are not complete (they either give me garbage in addition to the IDs, or they miss some IDs).

The one that I found that seems to be complete is hosted here. But it is written in JavaScript and PHP. Unfortunately I couldn't translate them into JAVA.

Can somebody help me rewrite this PHP regex or the following JavaScript one in Java?

'~
    https?://         # Required scheme. Either http or https.
    (?:[0-9A-Z-]+\.)? # Optional subdomain.
    (?:               # Group host alternatives.
      youtu\.be/      # Either youtu.be,
    | youtube\.com    # or youtube.com followed by
      \S*             # Allow anything up to VIDEO_ID,
      [^\w\-\s]       # but char before ID is non-ID char.
    )                 # End host alternatives.
    ([\w\-]{11})      # $1: VIDEO_ID is exactly 11 chars.
    (?=[^\w\-]|$)     # Assert next char is non-ID or EOS.
    (?!               # Assert URL is not pre-linked.
      [?=&+%\w]*      # Allow URL (query) remainder.
      (?:             # Group pre-linked alternatives.
        [\'"][^<>]*>  # Either inside a start tag,
      | </a>          # or inside <a> element text contents.
      )               # End recognized pre-linked alts.
    )                 # End negative lookahead assertion.
    [?=&+%\w]*        # Consume any URL (query) remainder.
    ~ix'
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube\.com\S*[^\w\-\s])([\w\-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:['"][^<>]*>|<\/a>))[?=&+%\w]*/ig;
like image 876
mossaab Avatar asked Oct 25 '11 19:10

mossaab


People also ask

Can regex be used in Java?

Regular expressions can be used to perform all types of text search and text replace operations. Java does not have a built-in Regular Expression class, but we can import the java. util. regex package to work with regular expressions.

How does Java pattern compile work?

The compile(String) method of the Pattern class in Java is used to create a pattern from the regular expression passed as parameter to method. Whenever you need to match a text against a regular expression pattern more than one time, create a Pattern instance using the Pattern. compile() method.

What is regular expression in Java example?

A simple example for a regular expression is a (literal) string. For example, the Hello World regex matches the "Hello World" string. . (dot) is another example for a regular expression. A dot matches any single character; it would match, for example, "a" or "1".


2 Answers

First of all you need to insert and extra backslash \ foreach backslash in the old regex, else java thinks you escapes some other special characters in the string, which you are not doing.

https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*

Next when you compile your pattern you need to add the CASE_INSENSITIVE flag. Here's an example:

String pattern = "https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";

Pattern compiledPattern = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(link);
while(matcher.find()) {
    System.out.println(matcher.group());
}
like image 123
Marcus Avatar answered Oct 09 '22 17:10

Marcus


Marcus above has a good regex, but i found that it doesn't recognize youtube links that have "www" but not "http(s)" in them for example www.youtube....

i have an update:

^(?:https?:\\/\\/)?(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*

it's the same except for the start

like image 42
Blagoj Atanasovski Avatar answered Oct 09 '22 18:10

Blagoj Atanasovski