In Python3 and pandas I have the dataframe:
df_projetos_api_final.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 93631 entries, 1 to 93667
Data columns (total 21 columns):
AnoMateria 93631 non-null object
CodigoMateria 93631 non-null object
DescricaoIdentificacaoMateria 93631 non-null object
DescricaoSubtipoMateria 93631 non-null object
IndicadorTramitando 93631 non-null object
NomeCasaIdentificacaoMateria 93631 non-null object
NumeroMateria 93631 non-null object
ApelidoMateria 891 non-null object
DataApresentacao 93631 non-null object
DataLeitura 54213 non-null object
EmentaMateria 93631 non-null object
ExplicacaoEmentaMateria 9461 non-null object
IndicadorComplementar 93631 non-null object
DescricaoNatureza 54352 non-null object
NomeAutor 93100 non-null object
IndicadorOutrosAutores 93214 non-null object
CodigoParlamentar 49786 non-null object
NomeParlamentar 49786 non-null object
NomeCompletoParlamentar 49786 non-null object
UfParlamentar 45613 non-null object
DescricaoSituacao 78783 non-null object
dtypes: object(21)
memory usage: 8.2+ MB
The column "EmentaMateria" has in each row a series of sentences. I plan to create a new dataframe from rows that contain any or several of these words (or group of words) in this column:
str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"
So I did like this:
seleciona2 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\
str.contains(str_choice, na=False)]
The newly generated dataframe has collected multiple sentences that have one or more of these words. However, many lines do not have these words, as
"ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A
PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO
PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO
DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00
(TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E
SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E
EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA."
Please, is this because similar words are also being searched for in sentences? Or because many sentences have too much white space or line breaks between some words?
Edit 7/12/2019
Thanks so much all friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem - I didn't know the original base had been changed
I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents and with the original database, and then it worked - I haven't checked all the lines yet, but all I've seen are correct so far
So the new str_choice (I used the name search_list), I used was this:
df_projetos_api_final['EmentaMateria'] = df_projetos_api_final['EmentaMateria'].str.upper()
search_list = ["MULHER", "MULHERES", "TRABALHO DOMÉSTICO", "VIOLÊNCIA CONTRA A MULHER", "VIOLÊNCIA DOMÉSTICA", "VIOLÊNCIA DE GÊNERO", "MARIA DA PENHA", "ABORTO", "ABORTAMENTO", "INTERRUPÇÃO DE GRAVIDEZ", "INTERRUPÇÃO DE GESTAÇÃO", "DIREITO REPRODUTIVO", "DIREITOS REPRODUTIVOS", "DIREITO À VIDA", "CONCEPÇÃO", "CONTRACEPÇÃO", "CONTRACEPTIVO", "MISOPROSTOL", "MIFEPRISTONE", "CYTOTEC", "ÚTERO", "GESTAÇÃO", "GRAVIDEZ", "PARTO", "VIOLÊNCIA OBSTÉTRICA", "FETO", "BEBÊ", "CRIANÇA", "VIOLÊNCIA SEXUAL", "FEMINICÍDIO", "MORTE DE MULHER", "MORTE DE MULHERES", "HOMICÍDIO DE MULHER", "HOMICÍDIO DE MULHERES", "ASSÉDIO SEXUAL", "ASSÉDIO", "ESTUPRO", "VIOLÊNCIA SEXUAL", "ABUSO SEXUAL", "ESTUPRO DE VULNERÁVEL", "LICENÇA MATERNIDADE", "FEMININO", "MULHER NEGRA", "MULHERES NEGRAS", "MULHERES QUILOMBOLAS", "MULHERES INDÍGENAS", "NEGRAS", "NEGRA", "RACISMO", "RAÇA", "RACIAL", "ABUSO SEXUAL", "MATERNIDADE", "MÃE", "AMAMENTAÇÃO", "SEXUALIDADE", "SEXO", "GÊNERO", "FEMINISMO", "MACHISMO", "GUARDA DE FILHOS", "GUARDA DOS FILHOS", "IGUALDADE DE GÊNERO", "IDENTIDADE DE GÊNERO", "IDEOLOGIA DE GÊNERO", "EDUCAÇÃO SEXUAL", "ESCOLA SEM PARTIDO", "TRANSEXUAL", "TRANSEXUALIDADE", "MULHER TRANS", "MULHERES TRANS", "MUDANÇA DE SEXO", "READEQUAÇÃO SEXUAL", "EXPLORAÇÃO SEXUAL", "PROSTITUIÇÃO", "ORIENTAÇÃO SEXUAL", "HOMOSSEXUAL", "HOMOSSEXUALIDADE", "HOMOSSEXUALISMO", "LÉSBICA", "LÉSBICAS", "DIREITO DOS HOMENS", "EDUCAÇÃO RELIGIOSA", "DEUS", "RELIGIÃO", "EDUCACÃO DOMICILIAR", "HOMESCHOOLING", "CRECHE", "EDUCAÇÃO INFANTIL", "CASAMENTO INFANTIL"]
mask = df_projetos_api_final['EmentaMateria'].str.contains('|'.join(search_list))
seleciona = df_projetos_api_final[mask]
seleciona.info()
split() function is used to split strings around given separator/delimiter. The function splits the string in the Series/Index from the beginning, at the specified delimiter string. Equivalent to str. split().
Steps to Select Rows from Pandas DataFrame 1 Step 1: Gather your data#N#Firstly, you’ll need to gather your data. Here is an example of a data gathered about... 2 Step 2: Create a DataFrame#N#Once you have your data ready, you’ll need to create a DataFrame to capture that data in... 3 Step 3: Select Rows from Pandas DataFrame More ...
Indexing and Selecting Data with Pandas. Indexing in Pandas : Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.
Let’s take a look at what you’ll learn! The pandas Dataframe class is described as a two-dimensional, size-mutable, potentially heterogeneous tabular data. This, in plain-language, means: You can create an empty dataframe by simply writing df = pd.DataFrame (), which creates an empty dataframe object.
Since Pandas doesn’t actually know what to call the column, we need to more explicit and use the columns= argument. The columns= argument takes a list-like object, passing in column headers in the order in which the columns are created.
First, let's take a look at why your code doesn't work. @jorijnsmit gave it away (and shared a useful answer), your regex matches characters regardless of where they are. Let us illustrate with a simpler example, which I will be using throughout:
We want to match the words 'app'
and 'he'
, so we construct a regex much like yours.
strings_to_match = ['app', 'he']
match_pattern = '|'.join(strings_to_match) # "app|he"
We join the strings we want to match using the alternation operator and we're good to go, right? Thanks to the magic of regex101, here are the results of applying our pattern to a few strings (matches are in square brackets):
[he]llo
brot[he]r
[app]lication
[he]
[app]le
h[app]ier
[app]
[he]ll
Our pattern matches the strings 'app'
and 'he'
anywhere, when we want just the words themselves!
What can we do to fix this? Our first thought may be to change our pattern to ' app | he '
, which does fix the issue with strings like 'application'
. Unfortunately, this isn't foolproof. That pattern fails to recognize the word 'app'
in 'I downloaded an app.'
, which is perfectly valid to us. Fortunately for us, regex has just the solution we need: Word boundaries, represented by the token '\b'
, whose principle is rather self-explanatory.
Here are some results for the new pattern, '\bapp\b|\bhe\b'
:
'[he]'
'apple'
'happier'
' [app] '
'hell'
'I downloaded an [app]!'
Exactly what we expect! While it does work correctly, that pattern is needlessly difficult to read. We can use only one set of word boundary tokens by putting all our substrings into a non-capturing group: '\b(?:app|he)\b'
. A capturing group, well, groups and captures a subset of a regex. In this case, the group would return the same thing as the entire match. A non-capturing group eliminates that redundancy while still allowing us to logically separate part of our expression.
Here is a complete program which demonstrates constructing the pattern and using it on a Pandas Series:
import pandas as pd
test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
'I downloaded an app.']
test_series = pd.Series(data=test_strs)
strings_to_match = ['app', 'he']
match_pattern = fr"\b(?:{'|'.join(strings_to_match)})\b" # "\b(?:app|he)\b"
match_res = test_series.str.contains(match_pattern, case=False)
Output of print(match_res)
:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 True
9 True
dtype: bool
Please note that these methods can only match words, not arbitrary substrings. As such, they are not actually valid solutions for this particular problem, and are only covered here for the sake of completeness.
This is the same style as the solution by @FBruzzesi, which we will call version 1. For reference:
# Convert string into list of strings
str_list = str_choice.split(|)
# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)
#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]
While their solution collects all the matches, we are only concerned with whether or not there is a match in. Let's look at a refactored version of their solution, version 2:
import pandas as pd
test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
'I downloaded an app.']
test_series = pd.Series(data=test_strs)
strings_to_match = ['app', 'he']
series_split = test_series.str.split()
match_res = series_split.map(lambda curr_words: any((curr_sub in curr_words for curr_sub in strings_to_match)))
Unlike version 1, version 2 keeps the number of split()
operations, which were a real cause for concern (I estimated around 8,000,000 split()
operations for the entire column), to a minimum. It should also be more efficient since the number of iterations depends on the number of substrings to match, which should often be lower than the number of words in the string to check.
I have seen a few mentions of the regex
parameter which I feel are unclear or misleading. Yes, passing regex=False
will match a literal string, no, just changing the parameter won't make your current code work (why would it?).
I hope this is the sort of thing you had in mind when you asked for a canonical answer. Let me know if anything is unclear or you have any further questions :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With