Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to find the abnormal id from so many ids

We run an affiliate program. Users who sign up can gain points when they successfully recruit other users. However, spammers are abusing this program, and automatically signing up large numbers of accounts. We want to prevent this from happening by closing down clearly machine-generated accounts. My idea for this is to write a program to identify machine-generated account names, or at least select a subset for manual inspection.

So far, we have found that there are two types of abnormal ids:

  1. The first one is that there are some ids looks very similar to others, such as:

    • wss12345
    • wss12346
    • wss12347
    • test1
    • test2
    • ...
  2. The second one is that there are some ids looks like randomly generated with out rules, such as:

    • MiDjiSxxiDekiE
    • NiMjKhJixLy
    • DAFDAB7643
    • ...

For the first one, I use the Levenshtein(edit) distance. This method can find out some ids, which was illustrate in type 1. (I have done this, and can get good performance)

For the second one, I can calculate the probabilty for the ids, just like:

id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)

So I can use the probability to filter out the abnormal ids. (Just an idea; I haven't tried it out.)

Can anyone give me other suggestions about this topic? How else could I approach this problem? Can you see flaws or omissions in my attempts?

like image 824
Tim Avatar asked Nov 12 '22 23:11

Tim


1 Answers

  1. Assuming that these new accounts refer back to the the recruiter's ID, I'd look at the rate and/or sheer number of new accounts associated with a given recruiter.

  2. Some analysis on IP addresses or similar may also indicate if multiple users are coming from the same computer.

  3. I'd use a dictionary of words, and kind of do the reverse of detecting poor passwords -- human user names should have dictionary words, personal names, lack punctuation, not include repeated characters, be mostly lower case etc.

  4. Sort of going back to 1. above -- if a recruiter has an anamalously tight cluster of IDs, using the features you've already identified, would be a good flag. I think that this might be, essentially, @larsmans comment directly under the question.

I'd be curious to know if re-purposing password checking algorithms (item 3) provides any benefit.

like image 93
Dave Avatar answered Dec 24 '22 12:12

Dave