Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Odd issue with .isin() and strings (Python/Pandas)

I'm having a strange problem with the Pandas .isin() method. I'm doing a project in which I need to identify bad passwords by length, common word/password lists, etc (don't worry, this is from a public source). One of the ways is to see if someone is using part of their name as a password. I'm using .isin() to determine if that is the case, but it's giving me weird results. To show:

# Extracting first and last names into their own columns
users['first_name'] = users.user_name.str.extract('(^.+)(\.)', expand = False)[0]
users['last_name'] = users.user_name.str.extract('\.(.+)', expand = False)

# Flagging the users with passwords that matches their names
users['uses_name'] = (users['password'].isin(users.first_name)) | (users['password'].isin(users.last_name))

# Looking at the new data
print(users[users['uses_name']][['password','user_name','first_name','last_name','uses_name']].head())

The output of this is:

   password            user_name first_name  last_name uses_name
7    murphy          noreen.hale     noreen       hale      True
11  hubbard      milford.hubbard    milford    hubbard      True
22  woodard        jenny.woodard      jenny    woodard      True
30     reid         rosanna.reid    rosanna       reid      True
58   golden  rosalinda.rodriquez  rosalinda  rodriquez      True

Mostly it's good; milford.hubbard is using 'hubbard' as the password, etc. But then we have several examples like the first one. Noreen Hale is being flagged, despite her password being "murphy", which shares only a single letter with her name.

I can't for the life of me figure out what is causing this. Does anyone know why this is happening, and how to fix it?

like image 578
tq343 Avatar asked Mar 05 '18 23:03

tq343


People also ask

What does ISIN do in pandas?

Pandas DataFrame isin() Method The isin() method checks if the Dataframe contains the specified value(s). It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False .

What does ISIN stand for in Python?

Pandas isin() method is used to filter data frames. isin() method helps in selecting rows with having a particular(or Multiple) value in a particular column. Syntax: DataFrame.isin(values) Parameters: values: iterable, Series, List, Tuple, DataFrame or dictionary to check in the caller Series/Data Frame.


2 Answers

Since you need to compare adjacent columns in the same row, vectorisation isn't much of an option here. As such, you could use the (possibly) fastest alternative at your disposal: a list comprehension:

df['uses_name'] = [
       pwd in name for name, pwd in zip(df.user_name, df.password)
]

Or, if you dislike loops, you can hide them with np.vectorize:

def f(name, pwd):
    return pwd in name

v = np.vectorize(f)
df['uses_name'] = v(df.user_name, df.password)

df
   password            user_name  uses_name
7    murphy          noreen.hale      False
11  hubbard      milford.hubbard       True
22  woodard        jenny.woodard       True
30     reid         rosanna.reid       True
58   golden  rosalinda.rodriquez      False

Considering you extract first_name and last_name from user_name, I don't think you need it here.

like image 165
cs95 Avatar answered Sep 18 '22 17:09

cs95


Regarding the reason why this error occurs:

If you do users['password'].isin(users.first_name) you ask for each row of users['password'] if the element is contained in ANY of the elements in the column first_name Therefore I assume that the element murphy is somewhere in that column

like image 21
DZurico Avatar answered Sep 19 '22 17:09

DZurico