The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc. Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must)

CrossRef has a recommendation, that they tested successfully on 99.3% of DOIs (known to them): <pre class="prettyprint"><code>/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i </code></pre>

Finding a DOI in a document or page

Tags:

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc.

Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must)

864

asked Aug 26 '08 12:08

Kai

2 Answers

Ok, I'm currently extracting thousands of DOIs from free form text (XML) and I realized that my previous approach had a few problems, namely regarding encoded entities and trailing punctuation, so I went on reading the specification and this is the best I could come with.

The DOI prefix shall be composed of a directory indicator followed by a registrant code. These two components shall be separated by a full stop (period).

The directory indicator shall be "10". The directory indicator distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the resolution system.

Easy enough, the initial \b prevents us from "matching" a "DOI" that doesn't start with 10.:

$pattern = '\b(10[.]';

The second element of the DOI prefix shall be the registrant code. The registrant code is a unique string assigned to a registrant.

Also, all assigned registrant code are numeric, and at least 4 digits long, so:

$pattern = '\b(10[.][0-9]{4,}';

The registrant code may be further divided into sub-elements for administrative convenience if desired. Each sub-element of the registrant code shall be preceded by a full stop.

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*';

The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.

However, this isn't absolutely necessary, section 2.2.3 states that uncommon suffix systems may use other conventions (such as 10.1000.123456 instead of 10.1000/123456), but lets cut some slack.

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/';

The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode. The DOI suffix shall consist of a character string of any length chosen by the registrant. Each suffix shall be unique to the prefix element that precedes it. The unique suffix can be a sequential number, or it might incorporate an identifier generated from or based on another system.

Now this is where it gets trickier, from all the DOIs I have processed, I saw the following characters (besides [0-9a-zA-Z] of course) in their suffixes: .-()/:- -- so, while it doesn't exist, the DOI 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7 is completely plausible.

The logical choice would be to use \S or the [[:graph:]] PCRE POSIX class, so lets do that:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/\S+'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/[[:graph:]]+';

Now we have a difficult problem, the [[:graph:]] class is a super-set of the [[:punct:]] class, which includes characters easily found in free text or any markup language: "'&<> among others.

Lets just filter the markup ones for now using a negative lookahead:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+';

The above should cover encoded entities (&), attribute quotes (["']) and open / close tags ([<>]).

Unlike markup languages, free text usually doesn't employ punctuation characters unless they are bounded by at least one space or placed at the end of a sentence, for instance:

This is a long DOI: 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7!!!

The solution here is to close our capture group and assert another word boundary:

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b'; // or

$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+)\b';

And voilá, here is a demo.

answered Oct 01 '22 03:10

Alix Axel

CrossRef has a recommendation, that they tested successfully on 99.3% of DOIs (known to them):

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i

answered Oct 01 '22 03:10

Katrin Leinweber

Related questions
                            
                                Can I do a MongoDB "starts with" query on an indexed subdocument field?
                            
                                Can I turn on extended regular expressions support in Vim?
                            
                                Vim Regex : How to search for A AND B NOT C
                            
                                What is a regex to match a string NOT at the end of a line?
                            
                                Forward slash in Java Regex
                            
                                Comma Separated Numbers Regex
                            
                                R:how to get grep to return the match, rather than the whole string
                            
                                Extract parameter value from url using regular expressions
                            
                                How can I use a variable in the replacement side of the Perl substitution operator?
                            
                                apache HTTP:X-Forwarded-Proto in .htaccess is causing redirect loop in dev environment
                            
                                Python regular expressions OR
                            
                                Find lines not starting with " in Notepad++
                            
                                Extract string from string using RegEx in the Terminal [duplicate]
                            
                                Java: Split string when an uppercase letter is found
                            
                                Why is this regex allowing a caret?
                            
                                grep: group capturing
                            
                                How to put variable in regular expression match?
                            
                                regex match any single character (one character only)
                            
                                Regex with replace in Golang
                            
                                Is there an online RegexBuddy-like regular expression analyzer? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding a DOI in a document or page

Tags:

regex

doi

Kai

People also ask

2 Answers

Alix Axel

Katrin Leinweber

Recent Activity

Donate For Us