Suppose I have a DataFrame such as: <pre class="prettyprint"><code> col1 col2 0 1 A 1 2 B 2 6 A 3 5 C 4 9 C 5 3 A 6 5 B </code></pre> And multiple lists such as: <pre class="prettyprint"><code>list_1 = [1, 2, 4] list_2 = [3, 8] list_3 = [5, 6, 7, 9] </code></pre> I can update the value of <code>col2</code> depending on whether the value of <code>col1</code> is included in a list, for example: <pre class="prettyprint"><code>for i in list_1: df.loc[df.col1 == i, 'col2'] = 'A' for i in list_2: df.loc[df.col1 == i, 'col2'] = 'B' for i in list_3: df.loc[df.col1 == i, 'col2'] = 'C' </code></pre> However this is very slow. With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations. Is there a better (faster) way of doing this?

You can use <code>isin</code> with <code>np.select</code> here: <pre class="prettyprint"><code>df['col2'] = (np.select([df['col1'].isin(list_1), df['col1'].isin(list_2), df['col1'].isin(list_3)] ,['A','B','C'])) </code></pre> With <code>Map</code>: <pre class="prettyprint"><code>d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C'])) df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k}) </code></pre> <hr> <pre class="prettyprint"><code> col1 col2 0 1 A 1 2 A 2 6 C 3 5 C 4 9 C 5 3 B 6 5 C </code></pre>

Fastest Way To Filter A Pandas Dataframe Using A List

Tags:

python

pandas

Suppose I have a DataFrame such as:

Click to copy

   col1  col2
0     1     A
1     2     B
2     6     A
3     5     C
4     9     C
5     3     A
6     5     B

And multiple lists such as:

Click to copy

list_1 = [1, 2, 4]
list_2 = [3, 8]
list_3 = [5, 6, 7, 9]

I can update the value of col2 depending on whether the value of col1 is included in a list, for example:

Click to copy

for i in list_1:
    df.loc[df.col1 == i, 'col2'] = 'A'

for i in list_2:
    df.loc[df.col1 == i, 'col2'] = 'B'

for i in list_3:
    df.loc[df.col1 == i, 'col2'] = 'C'

However this is very slow. With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations. Is there a better (faster) way of doing this?

338

asked Apr 30 '20 03:04

Alan

3 Answers

You can use isin with np.select here:

Click to copy

df['col2'] = (np.select([df['col1'].isin(list_1),
                         df['col1'].isin(list_2),
                         df['col1'].isin(list_3)]
                    ,['A','B','C']))

With Map:

Click to copy

d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C']))
df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k})

Click to copy

   col1 col2
0     1    A
1     2    A
2     6    C
3     5    C
4     9    C
5     3    B
6     5    C

answered Oct 14 '22 01:10

anky

You can first convert the lists to dicts and then map to col1.

Click to copy

d1 = {k:'A' for k in list_1}
d2 = {k:'B' for k in list_2}
d3 = {k:'C' for k in list_3}

df['col2'] = (
    df.col1.apply(lambda x: d1.get(x,x))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
)

If there is no duplicates in the lists, you can make it even faster by merging them to a single dict:

Click to copy

d = {**{k:'A' for k in list_1}, 
     **{k:'B' for k in list_2}, 
     **{k:'C' for k in list_3}}
df['col2'] = df.col1.apply(lambda x: d.get(x,x))

answered Oct 13 '22 23:10

Allen

I would suggest iterating through your lists with a dictionary using conditional updating:

Click to copy

# Create your update dictionary
col_dict = {
    "A":[1, 2, 4],
    "B":[3, 8],
    "C":[5, 6, 7, 9]
}

# Iterate and update
for key, value in col_dict.items():
  # key is the col name; value is the lookup list
  df["col2"] = np.where(df["col1"].isin(value), key, df["col2"])

There is a concern of overwriting values – since a row can technically match multiple lists. How those updates are reconciled is not obvious.

If rows don't match multiple keys, consider a dynamic programming approach where a running index of "unmatched" rows are used for each iteration, updating as your proceed so that the number of rows you're iterating through are fewer with each iteration.

answered Oct 14 '22 00:10

Yaakov Bressler

Related questions
                            
                                Python kernel dies on Jupyter Notebook with tensorflow 2
                            
                                how to get a continuous rolling mean in pandas?
                            
                                pandas - Splitting date ranges on specific day boundary
                            
                                Airflow task running tweepy exits with return code -6
                            
                                Overfitting and data leakage in tensorflow/keras neural network
                            
                                Sending messages in the on_ready? Python discord bot
                            
                                pinging ~ 100,000 servers, is multithreading or multiprocessing better?
                            
                                How to conditionally drop rows in pandas
                            
                                How to change the time of a Pandas datetime column to midnight?
                            
                                AttributeError: 'NoneType' object has no attribute 'time' paramiko
                            
                                How to download a file from Google Drive using Python and the Drive API v3
                            
                                SignatureDoesNotMatch - Boto3 Django-storages
                            
                                Anaconda won't update spyder 4
                            
                                pytorch conv2d value cannot be converted to type uint8_t without overflow
                            
                                How to get the latest release version in Github only use python-requests?
                            
                                Most pythonic way to provide defaults for class constructor
                            
                                Explosion in loss function, LSTM autoencoder
                            
                                Tensorflow 2.0: Cannot Import tf.keras.utils.conv_utils
                            
                                ModuleNotFoundError: No module named 'tf'
                            
                                No module named 'sklearn.svm._classes' when loading model from colab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest Way To Filter A Pandas Dataframe Using A List

Tags:

python

pandas

Alan

People also ask

3 Answers

anky

Allen

Yaakov Bressler

Recent Activity

Donate For Us