I have a table of citations that includes the last name of the first author, the title, journal, year, and page numbers for each citation. I have posted the first few lines of the table on Google Docs; it is also available in the form of a CSV file. (Notice that some records do not have a DOI.) I would like to be able to query the DOI for each of these citations. For the titles, it would be best if the query could handle some form of fuzzy matching. How can I do this? The table is currently in MySQL, but it would be sufficient to start and end with a CSV file or, since I mostly use R, an R data frame. (I would appreciate an answer that goes from start to finish.)

I don’t know of any complete packages or functions that do this already, but this is the general approach I would use. The Crossref DOI registration agency offers a Web-based approach for determining the DOI from bibliographic data at https://www.crossref.org/guestquery/. On that page are several different ways to search, including the last one which takes an XML formatted search. The page includes information on how to create the appropriate XML. You would need to the submit the XML over HTTP (determining the details by picking apart the page to figure out form destinations and any additional information that needs to be included), and then parse out the response. Additionally, you would need to verify that doing this in an automated manner does not violate the terms of service of the website in any way. <hr> Below is the XML form for the Crossref free DOI lookup, where the searchable terms include <code>article_title</code>, <code>author</code>, <code>year</code>, <code>journal_title</code>, <code>volume</code>, and <code>first_page</code>: <pre class="prettyprint"><code><?xml version = "1.0" encoding="UTF-8"?> <query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0" xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd"> <head> <email_address>test@crossref.org</email_address> <doi_batch_id>test</doi_batch_id> </head> <body> <query enable-multiple-hits="false|exact|multi_hit_per_rule|one_hit_per_rule|true" list-components="false" expanded-results="false" key="key"> <article_title match="fuzzy"></article_title> <author search-all-authors="false"></author> <component_number></component_number> <edition_number></edition_number> <institution_name></institution_name> <isbn></isbn> <issn></issn> <volume></volume> <issue></issue> <year></year> <first_page></first_page> <journal_title></journal_title> <proceedings_title></proceedings_title> <series_title></series_title> <volume_title></volume_title> <unstructured_citation></unstructured_citation> </query> </body> </query_batch> </code></pre>

This is an open problem. There are better and worse ways to attack it. Start by reading Karen Coyle’s summary of the problem. The bibliography at the end of that article is excellent. In short, the problem of quantifying sameness between two bibliographic records is hard, and a substantial amount of machine-learning research has been done around this topic.

Given a table of citations, how to reverse-lookup the Digital Object Identifier for each of the citations?

2 Answers

I don’t know of any complete packages or functions that do this already, but this is the general approach I would use. The Crossref DOI registration agency offers a Web-based approach for determining the DOI from bibliographic data at https://www.crossref.org/guestquery/.

On that page are several different ways to search, including the last one which takes an XML formatted search. The page includes information on how to create the appropriate XML. You would need to the submit the XML over HTTP (determining the details by picking apart the page to figure out form destinations and any additional information that needs to be included), and then parse out the response.

Additionally, you would need to verify that doing this in an automated manner does not violate the terms of service of the website in any way.

Below is the XML form for the Crossref free DOI lookup, where the searchable terms include article_title, author, year, journal_title, volume, and first_page:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>[email protected]</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="false|exact|multi_hit_per_rule|one_hit_per_rule|true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <component_number></component_number>
    <edition_number></edition_number>
    <institution_name></institution_name>
    <isbn></isbn>
    <issn></issn>
    <volume></volume>
    <issue></issue>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
    <proceedings_title></proceedings_title>
    <series_title></series_title>
    <volume_title></volume_title>
    <unstructured_citation></unstructured_citation>
  </query>
</body>
</query_batch>

answered Sep 22 '22 04:09

Brian Diggs

This is an open problem. There are better and worse ways to attack it. Start by reading Karen Coyle’s summary of the problem. The bibliography at the end of that article is excellent.

In short, the problem of quantifying sameness between two bibliographic records is hard, and a substantial amount of machine-learning research has been done around this topic.

answered Sep 19 '22 04:09

meawoppl

Related questions
                            
                                Are there any reasons to use SGML instead of XML?
                            
                                How to keep whitespace before document element when parsing with Java?
                            
                                LINQ to XML in VB.NET
                            
                                How to convert SQL Server XML type value (xsi:nil) of DateTime to null
                            
                                RESTful web services: trying to achieve HATEOAS with custom XML
                            
                                GetElementById() not finding the tag?
                            
                                IRS E-File API?
                            
                                C# How to perform a live xslt transformation on an in memory object?
                            
                                XmlDocument.WriteTo truncates resultant file
                            
                                How to get first element by XPath in Oracle
                            
                                Is it possible for e JUnit test to tell if it's running in Eclipse (rather than ant)
                            
                                How to write an XML to file, with just a Parser instance?
                            
                                JAXP: How to validate a org.w3c.dom.Document against a XML Schema
                            
                                XPath to count the child nodes based on complex filter
                            
                                Add an attribute using xmllint
                            
                                JAXB - Suppress Boolean attribute if false
                            
                                Putting Message in Websphere MQ via C# has different data length than manually putting the same message
                            
                                How to disable/avoid Ampersand-Escaping in Java-XML?
                            
                                Workaround for "undeclared prefix" error on XElement.Load()
                            
                                Improve performance of XmlSerializer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Given a table of citations, how to reverse-lookup the Digital Object Identifier for each of the citations?

Tags:

r

xml

web-scraping

mechanize

doi

David LeBauer

People also ask

2 Answers

Brian Diggs

meawoppl

Recent Activity

Donate For Us