I'm having a strange problem with the Pandas .isin() method. I'm doing a project in which I need to identify bad passwords by length, common word/password lists, etc (don't worry, this is from a public source). One of the ways is to see if someone is using part of their name as a password. I'm using .isin() to determine if that is the case, but it's giving me weird results. To show:
# Extracting first and last names into their own columns
users['first_name'] = users.user_name.str.extract('(^.+)(\.)', expand = False)[0]
users['last_name'] = users.user_name.str.extract('\.(.+)', expand = False)
# Flagging the users with passwords that matches their names
users['uses_name'] = (users['password'].isin(users.first_name)) | (users['password'].isin(users.last_name))
# Looking at the new data
print(users[users['uses_name']][['password','user_name','first_name','last_name','uses_name']].head())
The output of this is:
password user_name first_name last_name uses_name
7 murphy noreen.hale noreen hale True
11 hubbard milford.hubbard milford hubbard True
22 woodard jenny.woodard jenny woodard True
30 reid rosanna.reid rosanna reid True
58 golden rosalinda.rodriquez rosalinda rodriquez True
Mostly it's good; milford.hubbard is using 'hubbard' as the password, etc. But then we have several examples like the first one. Noreen Hale is being flagged, despite her password being "murphy", which shares only a single letter with her name.
I can't for the life of me figure out what is causing this. Does anyone know why this is happening, and how to fix it?
Pandas DataFrame isin() Method The isin() method checks if the Dataframe contains the specified value(s). It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False .
Pandas isin() method is used to filter data frames. isin() method helps in selecting rows with having a particular(or Multiple) value in a particular column. Syntax: DataFrame.isin(values) Parameters: values: iterable, Series, List, Tuple, DataFrame or dictionary to check in the caller Series/Data Frame.
Since you need to compare adjacent columns in the same row, vectorisation isn't much of an option here. As such, you could use the (possibly) fastest alternative at your disposal: a list comprehension:
df['uses_name'] = [
pwd in name for name, pwd in zip(df.user_name, df.password)
]
Or, if you dislike loops, you can hide them with np.vectorize
:
def f(name, pwd):
return pwd in name
v = np.vectorize(f)
df['uses_name'] = v(df.user_name, df.password)
df
password user_name uses_name
7 murphy noreen.hale False
11 hubbard milford.hubbard True
22 woodard jenny.woodard True
30 reid rosanna.reid True
58 golden rosalinda.rodriquez False
Considering you extract first_name
and last_name
from user_name
, I don't think you need it here.
Regarding the reason why this error occurs:
If you do users['password'].isin(users.first_name)
you ask for each row of users['password']
if the element is contained in ANY of the elements in the column first_name
Therefore I assume that the element murphy is somewhere in that column
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With