Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language Processing - Converting unstructured bibliography to structured metadata

Tags:

java

nlp

crf++

Currently working on a natural language processing project in which I need to convert unstructured bibliography section (which is at the end of research article) to structured metadata like "Year", "Author", "Journal", "Volume ID", "Page Number", "Title", etc.


For example: Input

McCallum, A.; Nigam, K.; and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, 169–178

Expected output:

<Author> McCallum, A.</Author> <Author>Nigam, K.</Author> <Author>Ungar, L. H.</Author>
<Year> 2000 </Year>
<Title>Efficient clustering of high-dimensional data sets with application to reference matching <Title> and so on

Tool used: CRFsuite


Data-set: This contains 12000 references

  1. Contains Journal title,
  2. Contains article title's words,
  3. Contains location names,

Each word in given line considered as token and for each token I derive following features

  1. BOR at the start of line,
  2. EOR for end
  3. digitFeature : if token is digit
  4. Year: if token is in year format like 19** and 20**
  5. available in current data-set,

From above tool and data-set I got only 63.7% accuracy. Accuracy is very less for "Title" and good for "Year" and "Volume".

Questions:

  1. Can I draw any additional features?
  2. Can I use any other tool?
like image 360
Somnath Kadam Avatar asked Aug 26 '15 08:08

Somnath Kadam


1 Answers

I'd propose to base solution over existed approaches. Take a look for example at this paper

Park, Sung Hee, Roger W. Ehrich, and Edward A. Fox. "A hybrid two-stage approach for discipline-independent canonical representation extraction from references." Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 2012.

Sections 3.2 and 4.2 provide descriptions of dozens of features.

As for CRF implementations, there are other tools like this one, but I don't think it is a source of low accuracy.

like image 172
Nikita Astrakhantsev Avatar answered Oct 26 '22 13:10

Nikita Astrakhantsev