I was wondering how to implement a function <code>get_words()</code> that returns the words in a string in a list, stripping away the punctuation. How I would like to have it implemented is replace non <code>string.ascii_letters</code> with <code>''</code> and return a <code>.split()</code>. <pre class="prettyprint"><code>def get_words(text): '''The function should take one argument which is a string''' returns text.split() </code></pre> For example: <pre class="prettyprint"><code>>>>get_words('Hello world, my name is...James!') </code></pre> returns: <pre class="prettyprint"><code>>>>['Hello', 'world', 'my', 'name', 'is', 'James'] </code></pre>

This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression: <pre class="prettyprint"><code>import re def getWords(text): return re.compile('\w+').findall(text) </code></pre> Demo: <pre class="prettyprint"><code>>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!') ['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd'] </code></pre> If you don't care about numbers, replace <code>\w</code> with <code>[A-Za-z]</code> for just letters, or <code>[A-Za-z']</code> to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex. <hr> I almost answered this question here: Split Strings with Multiple Delimiters? But your question is actually under-specified: Do you want <code>'this is: an example'</code> to be split into: <ul> <li><code>['this', 'is', 'an', 'example']</code></li> <li>or <code>['this', 'is', 'an', '', 'example']</code>?</li> </ul> I assumed it was the first case. <hr> <blockquote> [this', 'is', 'an', example'] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? – James Smith 2 mins ago </blockquote> The regexp is the most elegant, but yes, you could this as follows: <pre class="prettyprint"><code>def getWords(text): """ Returns a list of words, where a word is defined as a maximally connected substring of uppercase or lowercase alphabetic letters, as defined by "a".isalpha() >>> get_words('Hello world, my name is... Élise!') # works in python3 ['Hello', 'world', 'my', 'name', 'is', 'Élise'] """ return ''.join((c if c.isalnum() else ' ') for c in text).split() </code></pre> or <code>.isalpha()</code> <hr> Sidenote: You could also do the following, though it requires importing another standard library: <pre class="prettyprint"><code>from itertools import * # groupby is generally always overkill and makes for unreadable code # ... but is fun def getWords(text): return [ ''.join(chars) for isWord,chars in groupby(' My name, is test!', lambda c:c.isalnum()) if isWord ] </code></pre> <hr> If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. Don't do that; it's not a good way to program (though sometimes the abstraction is useful).

Try to use <code>re</code>: <pre class="prettyprint"><code>>>> [w for w in re.split('\W', 'Hello world, my name is...James!') if w] ['Hello', 'world', 'my', 'name', 'is', 'James'] </code></pre> Although I'm not sure that it will catch all your use cases. If you want to solve it in another way, you may specify characters that you want to be in result: <pre class="prettyprint"><code>>>> re.findall('[%s]+' % string.ascii_letters, 'Hello world, my name is...James!') ['Hello', 'world', 'my', 'name', 'is', 'James'] </code></pre>

Extracting words from a string, removing punctuation and returning a list with separated words

Tags:

python

string

list

I was wondering how to implement a function get_words() that returns the words in a string in a list, stripping away the punctuation.

How I would like to have it implemented is replace non string.ascii_letters with '' and return a .split().

def get_words(text):

    '''The function should take one argument which is a string'''

    returns text.split()

For example:

>>>get_words('Hello world, my name is...James!')

returns:

>>>['Hello', 'world', 'my', 'name', 'is', 'James']

628

asked Oct 03 '11 09:10

James Smith

2 Answers

This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:

import re
def getWords(text):
    return re.compile('\w+').findall(text)

Demo:

>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']

If you don't care about numbers, replace \w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.

I almost answered this question here: Split Strings with Multiple Delimiters?

But your question is actually under-specified: Do you want 'this is: an example' to be split into:

['this', 'is', 'an', 'example']
or ['this', 'is', 'an', '', 'example']?

I assumed it was the first case.

[this', 'is', 'an', example'] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? – James Smith 2 mins ago

The regexp is the most elegant, but yes, you could this as follows:

def getWords(text):
    """
        Returns a list of words, where a word is defined as a
        maximally connected substring of uppercase or lowercase
        alphabetic letters, as defined by "a".isalpha()

        >>> get_words('Hello world, my name is... Élise!')  # works in python3
        ['Hello', 'world', 'my', 'name', 'is', 'Élise']
    """
    return ''.join((c if c.isalnum() else ' ') for c in text).split()

or .isalpha()

Sidenote: You could also do the following, though it requires importing another standard library:

from itertools import *

# groupby is generally always overkill and makes for unreadable code
# ... but is fun

def getWords(text):
    return [
        ''.join(chars)
            for isWord,chars in 
            groupby(' My name, is test!', lambda c:c.isalnum()) 
            if isWord
    ]

If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. Don't do that; it's not a good way to program (though sometimes the abstraction is useful).

answered Nov 09 '22 00:11

ninjagecko

Try to use re:

>>> [w for w in re.split('\W', 'Hello world, my name is...James!') if w]
['Hello', 'world', 'my', 'name', 'is', 'James']

Although I'm not sure that it will catch all your use cases.

If you want to solve it in another way, you may specify characters that you want to be in result:

>>> re.findall('[%s]+' % string.ascii_letters, 'Hello world, my name is...James!')
['Hello', 'world', 'my', 'name', 'is', 'James']

answered Nov 09 '22 01:11

Roman Bodnarchuk

Related questions
                            
                                Python: is there a C-like for loop available?
                            
                                Python random sequence with seed
                            
                                Python: Excluding Modules Pyinstaller
                            
                                Behaviour of Python's "yield"
                            
                                Play video file with VLC, then quit VLC
                            
                                Skip unittest if some-condition in SetUpClass fails
                            
                                Removing the common elements between two lists [duplicate]
                            
                                Image Cropping Tool (Python)
                            
                                How to remove this \xa0 from a string in python?
                            
                                Pickle File too large to load
                            
                                Pandas: select all dates with specific month and day
                            
                                Convert a standard python key value dictionary list to pyspark data frame
                            
                                sklearn pipeline - how to apply different transformations on different columns
                            
                                How to get the font pixel height using PIL's ImageFont class?
                            
                                Python equivalent of Scala case class
                            
                                Opening web camera in Google Colab
                            
                                Shapefile reader in Python?
                            
                                Executing a Django Shell Command from the Command Line
                            
                                Why don't scripting languages output Unicode to the Windows console?
                            
                                Pyramid authorization for stored items

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With