Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a string into words and punctuation

People also ask

How do you split a string in words and punctuation?

findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.

How do I split a string into a list of words?

To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.

How do you split a string into characters?

Split(Char, Int32, StringSplitOptions) Splits a string into a maximum number of substrings based on a specified delimiting character and, optionally, options. Splits a string into a maximum number of substrings based on the provided character separator, optionally omitting empty substrings from the result.


This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • This will not work with (single) quotes in the string.
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • Anything not explicitely mentioned in the re is silently dropped.

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.


If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.


Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']