I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With