Natural Language Processing - Converting unstructured bibliography to structured metadata

Question

Currently working on a natural language processing project in which I need to convert unstructured bibliography section (which is at the end of research article) to structured metadata like "Year", "Author", "Journal", "Volume ID", "Page Number", "Title", etc.

For example: Input

McCallum, A.; Nigam, K.; and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, 169–178

Expected output:

<Author> McCallum, A.</Author> <Author>Nigam, K.</Author> <Author>Ungar, L. H.</Author>
<Year> 2000 </Year>
<Title>Efficient clustering of high-dimensional data sets with application to reference matching <Title> and so on

Tool used: CRFsuite

Data-set: This contains 12000 references

Contains Journal title,
Contains article title's words,
Contains location names,

Each word in given line considered as token and for each token I derive following features

BOR at the start of line,
EOR for end
digitFeature : if token is digit
Year: if token is in year format like 19** and 20**
available in current data-set,

From above tool and data-set I got only 63.7% accuracy. Accuracy is very less for "Title" and good for "Year" and "Volume".

Questions:

Can I draw any additional features?
Can I use any other tool?

Nikita Astrakhantsev · Accepted Answer

I'd propose to base solution over existed approaches. Take a look for example at this paper

Park, Sung Hee, Roger W. Ehrich, and Edward A. Fox. "A hybrid two-stage approach for discipline-independent canonical representation extraction from references." Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 2012.

Sections 3.2 and 4.2 provide descriptions of dozens of features.

As for CRF implementations, there are other tools like this one, but I don't think it is a source of low accuracy.

Natural Language Processing - Converting unstructured bibliography to structured metadata

Tags:

java

nlp

crf++

Somnath Kadam

1 Answers

Nikita Astrakhantsev

Recent Activity

Donate For Us

Natural Language Processing - Converting unstructured bibliography to structured metadata

Tags:

java

nlp

crf++

Somnath Kadam

1 Answers

Nikita Astrakhantsev

Related questions

Recent Activity

Donate For Us