Use Pandas string method 'contains' on a Series containing lists of strings

Q: How do you check if a series contains a string?

contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function returns boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

Q: How do you find a string in a series in Python?

find() method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1. Start and end points can also be passed to search a specific part of string for the passed character or substring.

Q: How do I check if a string contains a substring panda?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series. str. contains("substring") .

Q: Can pandas series contain different data types?

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

Tags:

python

string

regex

pandas

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:

In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])

Out:
0    This is a long text. It has multiple sentences.
1                Do you see? More than one sentence!
2             This one has only one sentence though.
dtype: object

I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).

In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')

Out:
0    [, This is a long text.,  , It has multiple se...
1        [, Do you see?,  , More than one sentence!, ]
2         [, This one has only one sentence though., ]
dtype: object

This converts each row into lists of strings, each element holding one sentence.

Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.

I would expect something like:

In:
s.str.contains('you')

Out:
0   False
1   True
2   False

<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.

However, when doing the above, the return is

0   NaN
1   NaN
2   NaN
dtype: float64

I also tried a list comprehension which does not work:

result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'

Any suggestions on how this can be achieved?

524

asked Dec 04 '14 17:12

Dirk

1 Answers

you can use python find() method

>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0    False
1     True
2    False
dtype: bool

I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:

>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0    False
1     True
2    False

132

answered Sep 24 '22 03:09

Roman Pekar

Related questions
                            
                                TransactionManagementError - This is forbidden when an 'atomic' block is active
                            
                                Adding breakpoint command lists in GDB controlled from Python script
                            
                                number in function name [closed]
                            
                                Python - Parsing multipart/form-data request on server side
                            
                                Test an HTTPS proxy in python
                            
                                Make WebElement visible via Selenium with Python with JavaScript
                            
                                Python: Why does a dynamically added __repr__ method not override the default [duplicate]
                            
                                Original tweet or retweeted?
                            
                                Reverse one edge in networkx graph
                            
                                numpy.tile a non-integer number of times
                            
                                Extracting unsigned char from array of numpy.uint8
                            
                                How can I set up Celery to call a custom worker initialization?
                            
                                pyspark: Save schemaRDD as json file
                            
                                Python pandas to_sql 'append'
                            
                                What is the correct ordering of Django middleware?
                            
                                Convert CSV to YAML, with Unicode?
                            
                                Disable styling on Google Search with Selenium FirefoxDriver
                            
                                Python: Matplotlib avoid plotting gaps
                            
                                Index numpy nd array along last dimension
                            
                                How to convert numpy array to R matrix? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With