Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

Tags:

python-3.x

nlp

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?

like image 997
VolatileRig Avatar asked Oct 14 '22 08:10

VolatileRig


2 Answers

This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.

How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.

Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.

When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.

like image 133
Mike Cialowicz Avatar answered Oct 18 '22 13:10

Mike Cialowicz


Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.

Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0

This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.

like image 27
mjv Avatar answered Oct 18 '22 15:10

mjv