Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all articles, connector words, etc., from a string in Python

Tags:

python

string

I have a list that contains many sentences. I want to iterate through the list, removing from all sentences words like "and", "the", "a", "are", etc.

I tried this:

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
    text = text.replace(i, j)
return text

As you can probably tell, however, this will remove "a" and "an" when it appears in the middle of the word. I need to remove only the instances of the words when they are delimited by blank space, and not when they are within a word. What is the most efficient way of going about this?

like image 643
Parseltongue Avatar asked Jan 17 '11 03:01

Parseltongue


People also ask

How do you remove all occurrences of a string from a string in Python?

Remove All Occurrences of a Character From a String in Python Using the translate() Method. We can also use the translate() method to remove characters from a string. The translate() method, when invoked on a string, takes a translation table as an input argument.

How do you remove all elements from a string in Python?

Python Remove Character from String using translate() Python string translate() function replace each character in the string using the given translation table. We have to specify the Unicode code point for the character and 'None' as a replacement to remove it from the result string.

How do you remove an article from a string in Python?

Using translate(): translate() is another method that can be used to remove a character from a string in Python. translate() returns a string after removing the values passed in the table. Also, remember that to remove a character from a string using translate() you have to replace it with None and not "" .

How do I remove multiple words from a string in Python?

Use str. replace() to remove multiple characters from a string.


1 Answers

I would go for regex, something like:

def removearticles(text):
  re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)

or if you want to remove the leading whitespace as well:

def removearticles(text):
  re.sub('\s+(a|an|and|the)(\s+)', '\2', text)
like image 135
Nemo157 Avatar answered Nov 15 '22 08:11

Nemo157