Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting the sentences in python

I am trying to split the sentences in words.

words = content.lower().split()

this gives me the list of words like

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

and with this code:

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

I get something like:

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??

like image 655
Yun Tae Hwang Avatar asked Feb 06 '23 05:02

Yun Tae Hwang


1 Answers

I would suggest a regex-based solution:

import re

def to_words(text):
    return re.findall(r'\w+', text)

This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.

like image 192
FlipTack Avatar answered Feb 08 '23 15:02

FlipTack