Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a sentence string into words, but also make punctuation a separate element

Tags:

python

token

nlp

I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:

'Hello, my name is John. What's your name?'

If I used split() on this sentence then I would get

['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']

What I want to get is:

['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.

Does anybody know if there's a more efficient way to do this?

Thank you.

like image 668
Sean Avatar asked Jul 30 '19 05:07

Sean


2 Answers

You can do a trick:

text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace("  ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list

Or just this with input:

mList = input().replace(",", " , ").replace(".", " . ")replace("  ", " ").split(" ")
like image 157
Alexandre Aragão Avatar answered Sep 23 '22 19:09

Alexandre Aragão


Here is an approach using re.finditer which at least seems to work with the sample data you provided:

inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
    parts.append(match.group())

print(parts)

Output:

['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

The idea here is to match one of the following two patterns:

[^.,?!\s]+    which matches any non punctuation, non whitespace character
[.,?!]        which matches a single punctuation character

Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.

Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.

like image 20
Tim Biegeleisen Avatar answered Sep 22 '22 19:09

Tim Biegeleisen