Add ID found in list to new column in pandas dataframe

Tags:

Say I have the following dataframe (a column of integers and a column with a list of integers)...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]

And also a separate list of IDs...

bad_ids = [15533, 876544, 36789, 11111]

Given that, and ignoring the df['ID'] column and any index, I want to see if any of the IDs in the bad_ids list are mentioned in the df['Found_IDs'] column. The code I have so far is:

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

This works but only if the bad_ids list is longer than the dataframe and for the real dataset the bad_ids list is going to be a lot shorter than the dataframe. If I set the bad_ids list to only two elements...

bad_ids = [15533, 876544]

I get a very popular error (I have read many questions with the same error)...

ValueError: Length of values does not match length of index

I have tried converting the list to a series (no change in the error). I have also tried adding the new column and setting all values to False before doing the comprehension line (again no change in the error).

Two questions:

How do I get my code (below) to work for a list that is shorter than a dataframe?
How would I get the code to write the actual ID found back to the df['bad_id'] column (more useful than True/False)?

Expected output for bad_ids = [15533, 876544]:

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

Ideal output for bad_ids = [15533, 876544] (ID(s) are written to a new column or columns):

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544

Code:

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)

945

asked Apr 02 '20 10:04

MDR

1 Answers

Using np.intersect1d to get the intersect of the two lists:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

Or with just vanilla python using intersect of sets:

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))

171

answered Oct 25 '22 19:10

Erfan

Related questions
                            
                                OpenCV digits merging into surrounding boxes
                            
                                GDBM doesn't work with Python 3.6 and anaconda
                            
                                Can't import pathlib
                            
                                AWS Lambda - unable to import module 'lambda_function'
                            
                                Log accuracy metric while training a tf.estimator
                            
                                Which options do exist for defining a Python package with node.js dependencies?
                            
                                How to update pip3 to its latest version in Ubuntu 18.04?
                            
                                tensorflow placeholder - understanding `shape=[None,`
                            
                                How to integrate Python Code with C#.Net Core language? [closed]
                            
                                Scipy Circular Variance
                            
                                Why is Seaborn creating an extra category in my data? [duplicate]
                            
                                Why do circular imports cause problems with object identity using `isinstance`?
                            
                                How can I stop a particular cell from running in google colab?
                            
                                How do I fix/debug this Multi-Process terminated worker error thrown in scikit learn
                            
                                How does activating a python virtual environment modify sys.path?
                            
                                Is it possible to freeze only certain embedding weights in the embedding layer in pytorch?
                            
                                Function call stack: keras_scratch_graph Error
                            
                                How does the __getitem__'s idx work within PyTorch's DataLoader?
                            
                                Pythonic way of ignoring the last element when doing set difference
                            
                                google-cloud-sdk installation fails on python syntax error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Add ID found in list to new column in pandas dataframe

Tags:

python

python-3.x

pandas

dataframe

MDR

People also ask

1 Answers

Erfan

Recent Activity

Donate For Us