Splitting a string into words and punctuation

People also ask

How do you split a string in words and punctuation?

findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.

How do I split a string into a list of words?

To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.

How do you split a string into characters?

Split(Char, Int32, StringSplitOptions) Splits a string into a maximum number of substrings based on a specified delimiting character and, optionally, options. Splits a string into a maximum number of substrings based on the provided character separator, optionally omitting empty substrings from the result.

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
This will not work with (single) quotes in the string.
Put any additional punctuation marks you want to use in the right half of the regular expression.
Anything not explicitely mentioned in the re is silently dropped.

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

Related questions
                            
                                Why is a list comprehension so much faster than appending to a list?
                            
                                Django: How to create a model dynamically just for testing
                            
                                Numpy slice of arbitrary dimensions
                            
                                Import error, No module named xxxx [duplicate]
                            
                                What is key=lambda
                            
                                No usable temporary directory found
                            
                                Linear regression analysis with string/categorical features (variables)?
                            
                                Best Practices for Python Exceptions?
                            
                                Strange result when removing item from a list while iterating over it
                            
                                Unexpected Exception: name 'basestring' is not defined when invoking ansible2
                            
                                How to add and remove new layers in keras after loading weights?
                            
                                embedding short python scripts inside a bash script
                            
                                Check if any item in Python list is None (but include zero)
                            
                                Is there a php function like python's zip?
                            
                                Python socket connection timeout
                            
                                Remove non-numeric rows in one column with pandas
                            
                                Remove nodes from graph or reset entire default graph
                            
                                Python equivalent of npm or rubygems
                            
                                Copy all values in a column to a new column in a pandas dataframe
                            
                                Convert a string to integer with decimal in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Splitting a string into words and punctuation

Tags:

python

string

split

People also ask

Recent Activity

Donate For Us