Say I have the following dataframe (a column of integers and a column with a list of integers)...
ID Found_IDs
0 12345 [15443, 15533, 3433]
1 15533 [2234, 16608, 12002, 7654]
2 6789 [43322, 876544, 36789]
And also a separate list of IDs...
bad_ids = [15533, 876544, 36789, 11111]
Given that, and ignoring the df['ID']
column and any index, I want to see if any of the IDs in the bad_ids
list are mentioned in the df['Found_IDs']
column. The code I have so far is:
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
This works but only if the bad_ids
list is longer than the dataframe and for the real dataset the bad_ids
list is going to be a lot shorter than the dataframe. If I set the bad_ids
list to only two elements...
bad_ids = [15533, 876544]
I get a very popular error (I have read many questions with the same error)...
ValueError: Length of values does not match length of index
I have tried converting the list to a series (no change in the error). I have also tried adding the new column and setting all values to False
before doing the comprehension line (again no change in the error).
Two questions:
df['bad_id']
column (more useful than True/False)?Expected output for bad_ids = [15533, 876544]
:
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] True
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] True
Ideal output for bad_ids = [15533, 876544]
(ID(s) are written to a new column or columns):
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] 15533
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] 876544
Code:
import pandas as pd
result_list = [[12345,[15443,15533,3433]],
[15533,[2234,16608,12002,7654]],
[6789,[43322,876544,36789]]]
df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])
# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]
# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]
# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))
# setting up a new column of false values doesn't change things
# df['bad_id'] = False
print(df)
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
print(bad_ids)
print(df)
To assign new columns to a DataFrame, use the Pandas assign() method. The assign() returns the new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The length of the newly assigned column must match the number of rows in the DataFrame.
To add an identifier column, we need to specify the identifiers as a list for the argument “keys” in concat() function, which creates a new multi-indexed dataframe with two dataframes concatenated. Now we'll use reset_index to convert multi-indexed dataframe to a regular pandas dataframe.
To add a string before each column label of DataFrame in Pandas, call add_prefix() method on this DataFrame, and pass the prefix string as argument to add_prefix() method.
Using np.intersect1d
to get the intersect of the two lists:
df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] []
2 6789 [43322, 876544, 36789] [876544]
Or with just vanilla python using intersect of sets
:
bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With