In pandas, how to select phrases in a dataframe from word list or word set?

Tags:

In Python3 and pandas I have the dataframe:

df_projetos_api_final.info()                                                      
<class 'pandas.core.frame.DataFrame'>                                                      
Int64Index: 93631 entries, 1 to 93667                                                      
Data columns (total 21 columns):                                                           
AnoMateria                       93631 non-null object                                     
CodigoMateria                    93631 non-null object                                     
DescricaoIdentificacaoMateria    93631 non-null object                                     
DescricaoSubtipoMateria          93631 non-null object                                     
IndicadorTramitando              93631 non-null object                                     
NomeCasaIdentificacaoMateria     93631 non-null object                                     
NumeroMateria                    93631 non-null object                                     
ApelidoMateria                   891 non-null object                                       
DataApresentacao                 93631 non-null object                                     
DataLeitura                      54213 non-null object                                     
EmentaMateria                    93631 non-null object                                     
ExplicacaoEmentaMateria          9461 non-null object                                      
IndicadorComplementar            93631 non-null object                                     
DescricaoNatureza                54352 non-null object                                     
NomeAutor                        93100 non-null object                                     
IndicadorOutrosAutores           93214 non-null object                                     
CodigoParlamentar                49786 non-null object                                     
NomeParlamentar                  49786 non-null object                                     
NomeCompletoParlamentar          49786 non-null object                                     
UfParlamentar                    45613 non-null object                                     
DescricaoSituacao                78783 non-null object                                     
dtypes: object(21)                                                                         
memory usage: 8.2+ MB

The column "EmentaMateria" has in each row a series of sentences. I plan to create a new dataframe from rows that contain any or several of these words (or group of words) in this column:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"

So I did like this:

seleciona2 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\
                               str.contains(str_choice, na=False)]

The newly generated dataframe has collected multiple sentences that have one or more of these words. However, many lines do not have these words, as
"ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A
PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO
PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO
DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00
(TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E
SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E
EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA."

Please, is this because similar words are also being searched for in sentences? Or because many sentences have too much white space or line breaks between some words?

Edit 7/12/2019

Thanks so much all friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem - I didn't know the original base had been changed

I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents and with the original database, and then it worked - I haven't checked all the lines yet, but all I've seen are correct so far

So the new str_choice (I used the name search_list), I used was this:

df_projetos_api_final['EmentaMateria'] = df_projetos_api_final['EmentaMateria'].str.upper()
search_list = ["MULHER", "MULHERES", "TRABALHO DOMÉSTICO", "VIOLÊNCIA CONTRA A MULHER", "VIOLÊNCIA DOMÉSTICA", "VIOLÊNCIA DE GÊNERO", "MARIA DA PENHA", "ABORTO", "ABORTAMENTO", "INTERRUPÇÃO DE GRAVIDEZ", "INTERRUPÇÃO DE GESTAÇÃO", "DIREITO REPRODUTIVO", "DIREITOS REPRODUTIVOS", "DIREITO À VIDA", "CONCEPÇÃO", "CONTRACEPÇÃO", "CONTRACEPTIVO", "MISOPROSTOL", "MIFEPRISTONE", "CYTOTEC", "ÚTERO", "GESTAÇÃO", "GRAVIDEZ", "PARTO", "VIOLÊNCIA OBSTÉTRICA", "FETO", "BEBÊ", "CRIANÇA", "VIOLÊNCIA SEXUAL", "FEMINICÍDIO", "MORTE DE MULHER", "MORTE DE MULHERES", "HOMICÍDIO DE MULHER", "HOMICÍDIO DE MULHERES", "ASSÉDIO SEXUAL", "ASSÉDIO", "ESTUPRO", "VIOLÊNCIA SEXUAL", "ABUSO SEXUAL", "ESTUPRO DE VULNERÁVEL", "LICENÇA MATERNIDADE", "FEMININO", "MULHER NEGRA", "MULHERES NEGRAS", "MULHERES QUILOMBOLAS", "MULHERES INDÍGENAS", "NEGRAS", "NEGRA", "RACISMO", "RAÇA", "RACIAL", "ABUSO SEXUAL", "MATERNIDADE", "MÃE", "AMAMENTAÇÃO", "SEXUALIDADE", "SEXO", "GÊNERO", "FEMINISMO", "MACHISMO", "GUARDA DE FILHOS", "GUARDA DOS FILHOS", "IGUALDADE DE GÊNERO", "IDENTIDADE DE GÊNERO", "IDEOLOGIA DE GÊNERO", "EDUCAÇÃO SEXUAL", "ESCOLA SEM PARTIDO", "TRANSEXUAL", "TRANSEXUALIDADE", "MULHER TRANS", "MULHERES TRANS", "MUDANÇA DE SEXO", "READEQUAÇÃO SEXUAL", "EXPLORAÇÃO SEXUAL", "PROSTITUIÇÃO", "ORIENTAÇÃO SEXUAL", "HOMOSSEXUAL", "HOMOSSEXUALIDADE", "HOMOSSEXUALISMO",  "LÉSBICA",  "LÉSBICAS",  "DIREITO DOS HOMENS", "EDUCAÇÃO RELIGIOSA",  "DEUS", "RELIGIÃO", "EDUCACÃO DOMICILIAR", "HOMESCHOOLING", "CRECHE",  "EDUCAÇÃO INFANTIL",  "CASAMENTO INFANTIL"]
mask = df_projetos_api_final['EmentaMateria'].str.contains('|'.join(search_list))
seleciona = df_projetos_api_final[mask]
seleciona.info()

671

asked Nov 29 '19 00:11

Reinaldo Chaves

Video Answer

1 Answers

Diagnosis

First, let's take a look at why your code doesn't work. @jorijnsmit gave it away (and shared a useful answer), your regex matches characters regardless of where they are. Let us illustrate with a simpler example, which I will be using throughout:

We want to match the words 'app' and 'he', so we construct a regex much like yours.

strings_to_match = ['app', 'he']

match_pattern = '|'.join(strings_to_match) # "app|he"

We join the strings we want to match using the alternation operator and we're good to go, right? Thanks to the magic of regex101, here are the results of applying our pattern to a few strings (matches are in square brackets):

[he]llo
brot[he]r
[app]lication
[he]
[app]le
h[app]ier
[app]
[he]ll

Our pattern matches the strings 'app' and 'he' anywhere, when we want just the words themselves!

RegEx Solution

What can we do to fix this? Our first thought may be to change our pattern to ' app | he ', which does fix the issue with strings like 'application'. Unfortunately, this isn't foolproof. That pattern fails to recognize the word 'app' in 'I downloaded an app.', which is perfectly valid to us. Fortunately for us, regex has just the solution we need: Word boundaries, represented by the token '\b', whose principle is rather self-explanatory.

Here are some results for the new pattern, '\bapp\b|\bhe\b':

'[he]'
'apple'
'happier'
' [app] '
'hell'
'I downloaded an [app]!'

Exactly what we expect! While it does work correctly, that pattern is needlessly difficult to read. We can use only one set of word boundary tokens by putting all our substrings into a non-capturing group: '\b(?:app|he)\b'. A capturing group, well, groups and captures a subset of a regex. In this case, the group would return the same thing as the entire match. A non-capturing group eliminates that redundancy while still allowing us to logically separate part of our expression.

Here is a complete program which demonstrates constructing the pattern and using it on a Pandas Series:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

match_pattern = fr"\b(?:{'|'.join(strings_to_match)})\b"  # "\b(?:app|he)\b"

match_res = test_series.str.contains(match_pattern, case=False)

Output of print(match_res):

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8     True
9     True
dtype: bool

A few notes on other solutions

1.

Please note that these methods can only match words, not arbitrary substrings. As such, they are not actually valid solutions for this particular problem, and are only covered here for the sake of completeness.

This is the same style as the solution by @FBruzzesi, which we will call version 1. For reference:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]

While their solution collects all the matches, we are only concerned with whether or not there is a match in. Let's look at a refactored version of their solution, version 2:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

series_split = test_series.str.split()

match_res = series_split.map(lambda curr_words: any((curr_sub in curr_words for curr_sub in strings_to_match)))

Unlike version 1, version 2 keeps the number of split() operations, which were a real cause for concern (I estimated around 8,000,000 split() operations for the entire column), to a minimum. It should also be more efficient since the number of iterations depends on the number of substrings to match, which should often be lower than the number of words in the string to check.

2.

I have seen a few mentions of the regex parameter which I feel are unclear or misleading. Yes, passing regex=False will match a literal string, no, just changing the parameter won't make your current code work (why would it?).

I hope this is the sort of thing you had in mind when you asked for a canonical answer. Let me know if anything is unclear or you have any further questions :)

123

answered Sep 18 '22 15:09

AMC

Related questions
                            
                                How to perform assert_has_calls for a __getitem__() call?
                            
                                cannot install tensorflow-text using pip despite having tensorflow 2.0.0-beta1 installed
                            
                                How to record mouse and keyboard movement simultaneously with Python?
                            
                                Hypothesis - reuse @given between tests
                            
                                Set background colour for a custom QWidget
                            
                                Airflow can not enter the /admin page after initdb
                            
                                One hot encoding of multi label images in keras
                            
                                How remove numbering from output after extract xls file with pandas [Python]
                            
                                Assign values to array efficiently
                            
                                Remove 2d slice from 3d numpy array
                            
                                Find all months between two date columns and generate row for each month
                            
                                How to set background color on image to white with OpenCV in Python
                            
                                torch.softmax and torch.sigmoid are not equivalent in the binary case
                            
                                How to list objects based on prefixes with wildcard using Python Boto3?
                            
                                Plotly: How to create subplots from each column in a pandas dataframe?
                            
                                Python GraphQL How to declare a self-referencing graphene object type
                            
                                Python pool not working in windows but works in linux
                            
                                How to scale train, validation and test sets properly using StandardScaler?
                            
                                How do you save a model, dictionary and corpus to disk in Gensim, and then load them again?
                            
                                Does order matter in Python's swap notation? (a, b = b, a)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In pandas, how to select phrases in a dataframe from word list or word set?

Tags:

python

string

pandas

dataframe

similarity