Currently working on a natural language processing project in which I need to convert unstructured bibliography section (which is at the end of research article) to structured metadata like "Year", "Author", "Journal", "Volume ID", "Page Number", "Title", etc.
For example: Input
McCallum, A.; Nigam, K.; and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, 169–178
Expected output:
<Author> McCallum, A.</Author> <Author>Nigam, K.</Author> <Author>Ungar, L. H.</Author>
<Year> 2000 </Year>
<Title>Efficient clustering of high-dimensional data sets with application to reference matching <Title> and so on
Tool used: CRFsuite
Data-set: This contains 12000 references
Each word in given line considered as token and for each token I derive following features
From above tool and data-set I got only 63.7% accuracy. Accuracy is very less for "Title" and good for "Year" and "Volume".
Questions:
I'd propose to base solution over existed approaches. Take a look for example at this paper
Park, Sung Hee, Roger W. Ehrich, and Edward A. Fox. "A hybrid two-stage approach for discipline-independent canonical representation extraction from references." Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 2012.
Sections 3.2 and 4.2 provide descriptions of dozens of features.
As for CRF implementations, there are other tools like this one, but I don't think it is a source of low accuracy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With