Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for a person's name in a text? (heuristic)

I have a huge list of person's full names that I must search in a huge text.

Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.

Example:

I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:

  • ...The candidate Barack Obama was elected the president of the United States... (incomplete)
  • ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
  • ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
  • ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
  • ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
  • ...The candidate John McCain lost the the election... (no occurrences of Obama name)

Certanily there isn't a deterministic solution for it, but...

What is a good heuristic for this kind of search?

If you had to, how would you do it?

like image 329
Daniel Silveira Avatar asked Dec 10 '22 22:12

Daniel Silveira


2 Answers

You said it's about 200 pages.

Divide it into 200 one-page PDFs.

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

like image 117
Joel Spolsky Avatar answered Jan 21 '23 20:01

Joel Spolsky


Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

like image 21
joegtp Avatar answered Jan 21 '23 19:01

joegtp