Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In pandas, how to select phrases in a dataframe from word list or word set?

In Python3 and pandas I have the dataframe:

df_projetos_api_final.info()                                                      
<class 'pandas.core.frame.DataFrame'>                                                      
Int64Index: 93631 entries, 1 to 93667                                                      
Data columns (total 21 columns):                                                           
AnoMateria                       93631 non-null object                                     
CodigoMateria                    93631 non-null object                                     
DescricaoIdentificacaoMateria    93631 non-null object                                     
DescricaoSubtipoMateria          93631 non-null object                                     
IndicadorTramitando              93631 non-null object                                     
NomeCasaIdentificacaoMateria     93631 non-null object                                     
NumeroMateria                    93631 non-null object                                     
ApelidoMateria                   891 non-null object                                       
DataApresentacao                 93631 non-null object                                     
DataLeitura                      54213 non-null object                                     
EmentaMateria                    93631 non-null object                                     
ExplicacaoEmentaMateria          9461 non-null object                                      
IndicadorComplementar            93631 non-null object                                     
DescricaoNatureza                54352 non-null object                                     
NomeAutor                        93100 non-null object                                     
IndicadorOutrosAutores           93214 non-null object                                     
CodigoParlamentar                49786 non-null object                                     
NomeParlamentar                  49786 non-null object                                     
NomeCompletoParlamentar          49786 non-null object                                     
UfParlamentar                    45613 non-null object                                     
DescricaoSituacao                78783 non-null object                                     
dtypes: object(21)                                                                         
memory usage: 8.2+ MB 

The column "EmentaMateria" has in each row a series of sentences. I plan to create a new dataframe from rows that contain any or several of these words (or group of words) in this column:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"  

So I did like this:

seleciona2 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\
                               str.contains(str_choice, na=False)]

The newly generated dataframe has collected multiple sentences that have one or more of these words. However, many lines do not have these words, as
"ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A
PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO
PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO
DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00
(TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E
SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E
EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA."

Please, is this because similar words are also being searched for in sentences? Or because many sentences have too much white space or line breaks between some words?

Edit 7/12/2019

Thanks so much all friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem - I didn't know the original base had been changed

I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents and with the original database, and then it worked - I haven't checked all the lines yet, but all I've seen are correct so far

So the new str_choice (I used the name search_list), I used was this:

df_projetos_api_final['EmentaMateria'] = df_projetos_api_final['EmentaMateria'].str.upper()
search_list = ["MULHER", "MULHERES", "TRABALHO DOMÉSTICO", "VIOLÊNCIA CONTRA A MULHER", "VIOLÊNCIA DOMÉSTICA", "VIOLÊNCIA DE GÊNERO", "MARIA DA PENHA", "ABORTO", "ABORTAMENTO", "INTERRUPÇÃO DE GRAVIDEZ", "INTERRUPÇÃO DE GESTAÇÃO", "DIREITO REPRODUTIVO", "DIREITOS REPRODUTIVOS", "DIREITO À VIDA", "CONCEPÇÃO", "CONTRACEPÇÃO", "CONTRACEPTIVO", "MISOPROSTOL", "MIFEPRISTONE", "CYTOTEC", "ÚTERO", "GESTAÇÃO", "GRAVIDEZ", "PARTO", "VIOLÊNCIA OBSTÉTRICA", "FETO", "BEBÊ", "CRIANÇA", "VIOLÊNCIA SEXUAL", "FEMINICÍDIO", "MORTE DE MULHER", "MORTE DE MULHERES", "HOMICÍDIO DE MULHER", "HOMICÍDIO DE MULHERES", "ASSÉDIO SEXUAL", "ASSÉDIO", "ESTUPRO", "VIOLÊNCIA SEXUAL", "ABUSO SEXUAL", "ESTUPRO DE VULNERÁVEL", "LICENÇA MATERNIDADE", "FEMININO", "MULHER NEGRA", "MULHERES NEGRAS", "MULHERES QUILOMBOLAS", "MULHERES INDÍGENAS", "NEGRAS", "NEGRA", "RACISMO", "RAÇA", "RACIAL", "ABUSO SEXUAL", "MATERNIDADE", "MÃE", "AMAMENTAÇÃO", "SEXUALIDADE", "SEXO", "GÊNERO", "FEMINISMO", "MACHISMO", "GUARDA DE FILHOS", "GUARDA DOS FILHOS", "IGUALDADE DE GÊNERO", "IDENTIDADE DE GÊNERO", "IDEOLOGIA DE GÊNERO", "EDUCAÇÃO SEXUAL", "ESCOLA SEM PARTIDO", "TRANSEXUAL", "TRANSEXUALIDADE", "MULHER TRANS", "MULHERES TRANS", "MUDANÇA DE SEXO", "READEQUAÇÃO SEXUAL", "EXPLORAÇÃO SEXUAL", "PROSTITUIÇÃO", "ORIENTAÇÃO SEXUAL", "HOMOSSEXUAL", "HOMOSSEXUALIDADE", "HOMOSSEXUALISMO",  "LÉSBICA",  "LÉSBICAS",  "DIREITO DOS HOMENS", "EDUCAÇÃO RELIGIOSA",  "DEUS", "RELIGIÃO", "EDUCACÃO DOMICILIAR", "HOMESCHOOLING", "CRECHE",  "EDUCAÇÃO INFANTIL",  "CASAMENTO INFANTIL"]
mask = df_projetos_api_final['EmentaMateria'].str.contains('|'.join(search_list))
seleciona = df_projetos_api_final[mask]
seleciona.info()
like image 671
Reinaldo Chaves Avatar asked Nov 29 '19 00:11

Reinaldo Chaves


People also ask

How do you separate words in pandas?

split() function is used to split strings around given separator/delimiter. The function splits the string in the Series/Index from the beginning, at the specified delimiter string. Equivalent to str. split().

How to select rows from pandas Dataframe?

Steps to Select Rows from Pandas DataFrame 1 Step 1: Gather your data#N#Firstly, you’ll need to gather your data. Here is an example of a data gathered about... 2 Step 2: Create a DataFrame#N#Once you have your data ready, you’ll need to create a DataFrame to capture that data in... 3 Step 3: Select Rows from Pandas DataFrame More ...

What is indexing and selecting data in pandas?

Indexing and Selecting Data with Pandas. Indexing in Pandas : Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.

What is a pandas Dataframe?

Let’s take a look at what you’ll learn! The pandas Dataframe class is described as a two-dimensional, size-mutable, potentially heterogeneous tabular data. This, in plain-language, means: You can create an empty dataframe by simply writing df = pd.DataFrame (), which creates an empty dataframe object.

How do you call a column from a list in pandas?

Since Pandas doesn’t actually know what to call the column, we need to more explicit and use the columns= argument. The columns= argument takes a list-like object, passing in column headers in the order in which the columns are created.


Video Answer


1 Answers

Diagnosis

First, let's take a look at why your code doesn't work. @jorijnsmit gave it away (and shared a useful answer), your regex matches characters regardless of where they are. Let us illustrate with a simpler example, which I will be using throughout:

We want to match the words 'app' and 'he', so we construct a regex much like yours.

strings_to_match = ['app', 'he']

match_pattern = '|'.join(strings_to_match) # "app|he"

We join the strings we want to match using the alternation operator and we're good to go, right? Thanks to the magic of regex101, here are the results of applying our pattern to a few strings (matches are in square brackets):

  • [he]llo
  • brot[he]r
  • [app]lication
  • [he]
  • [app]le
  • h[app]ier
  • [app]
  • [he]ll

Our pattern matches the strings 'app' and 'he' anywhere, when we want just the words themselves!


RegEx Solution

What can we do to fix this? Our first thought may be to change our pattern to ' app | he ', which does fix the issue with strings like 'application'. Unfortunately, this isn't foolproof. That pattern fails to recognize the word 'app' in 'I downloaded an app.', which is perfectly valid to us. Fortunately for us, regex has just the solution we need: Word boundaries, represented by the token '\b', whose principle is rather self-explanatory.

Here are some results for the new pattern, '\bapp\b|\bhe\b':

  • '[he]'
  • 'apple'
  • 'happier'
  • ' [app] '
  • 'hell'
  • 'I downloaded an [app]!'

Exactly what we expect! While it does work correctly, that pattern is needlessly difficult to read. We can use only one set of word boundary tokens by putting all our substrings into a non-capturing group: '\b(?:app|he)\b'. A capturing group, well, groups and captures a subset of a regex. In this case, the group would return the same thing as the entire match. A non-capturing group eliminates that redundancy while still allowing us to logically separate part of our expression.

Here is a complete program which demonstrates constructing the pattern and using it on a Pandas Series:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

match_pattern = fr"\b(?:{'|'.join(strings_to_match)})\b"  # "\b(?:app|he)\b"

match_res = test_series.str.contains(match_pattern, case=False)

Output of print(match_res):

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8     True
9     True
dtype: bool

A few notes on other solutions

1.

Please note that these methods can only match words, not arbitrary substrings. As such, they are not actually valid solutions for this particular problem, and are only covered here for the sake of completeness.

This is the same style as the solution by @FBruzzesi, which we will call version 1. For reference:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]

While their solution collects all the matches, we are only concerned with whether or not there is a match in. Let's look at a refactored version of their solution, version 2:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

series_split = test_series.str.split()

match_res = series_split.map(lambda curr_words: any((curr_sub in curr_words for curr_sub in strings_to_match)))

Unlike version 1, version 2 keeps the number of split() operations, which were a real cause for concern (I estimated around 8,000,000 split() operations for the entire column), to a minimum. It should also be more efficient since the number of iterations depends on the number of substrings to match, which should often be lower than the number of words in the string to check.

2.

I have seen a few mentions of the regex parameter which I feel are unclear or misleading. Yes, passing regex=False will match a literal string, no, just changing the parameter won't make your current code work (why would it?).


I hope this is the sort of thing you had in mind when you asked for a canonical answer. Let me know if anything is unclear or you have any further questions :)

like image 123
AMC Avatar answered Sep 18 '22 15:09

AMC