Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given a table of citations, how to reverse-lookup the Digital Object Identifier for each of the citations?

I have a table of citations that includes the last name of the first author, the title, journal, year, and page numbers for each citation.

I have posted the first few lines of the table on Google Docs; it is also available in the form of a CSV file. (Notice that some records do not have a DOI.)

I would like to be able to query the DOI for each of these citations. For the titles, it would be best if the query could handle some form of fuzzy matching.

How can I do this?

The table is currently in MySQL, but it would be sufficient to start and end with a CSV file or, since I mostly use R, an R data frame. (I would appreciate an answer that goes from start to finish.)

like image 206
David LeBauer Avatar asked Mar 14 '12 22:03

David LeBauer


People also ask

What does a DOI look like in a citation?

DOIs are managed by the International DOI Foundation. What does a DOI look like? A DOI can appear as either an alphanumeric string of digits or as a webpage URL: DOI: 10.1080/15588742.2015.

How DOI read a DOI number?

All DOI numbers begin with a 10 and contain a prefix and a suffix separated by a slash. The prefix is a unique number of four or more digits assigned to organizations; the suffix is assigned by the publisher and was designed to be flexible with publisher identification standards.

How DOI find the DOI of a website?

Check the first page or first several pages of the document, near the copyright notice. The DOI can also be found on the database landing page for the source. If you still can't find the DOI, you can look it up on the website CrossRef.org (use the "Search Metadata" option).


2 Answers

I don’t know of any complete packages or functions that do this already, but this is the general approach I would use. The Crossref DOI registration agency offers a Web-based approach for determining the DOI from bibliographic data at https://www.crossref.org/guestquery/.

On that page are several different ways to search, including the last one which takes an XML formatted search. The page includes information on how to create the appropriate XML. You would need to the submit the XML over HTTP (determining the details by picking apart the page to figure out form destinations and any additional information that needs to be included), and then parse out the response.

Additionally, you would need to verify that doing this in an automated manner does not violate the terms of service of the website in any way.


Below is the XML form for the Crossref free DOI lookup, where the searchable terms include article_title, author, year, journal_title, volume, and first_page:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>[email protected]</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="false|exact|multi_hit_per_rule|one_hit_per_rule|true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <component_number></component_number>
    <edition_number></edition_number>
    <institution_name></institution_name>
    <isbn></isbn>
    <issn></issn>
    <volume></volume>
    <issue></issue>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
    <proceedings_title></proceedings_title>
    <series_title></series_title>
    <volume_title></volume_title>
    <unstructured_citation></unstructured_citation>
  </query>
</body>
</query_batch>
like image 59
Brian Diggs Avatar answered Sep 22 '22 04:09

Brian Diggs


This is an open problem. There are better and worse ways to attack it. Start by reading Karen Coyle’s summary of the problem. The bibliography at the end of that article is excellent.

In short, the problem of quantifying sameness between two bibliographic records is hard, and a substantial amount of machine-learning research has been done around this topic.

like image 30
meawoppl Avatar answered Sep 19 '22 04:09

meawoppl