i currently have a file that contains a list that is looks like
example = ['Mary had a little lamb' ,
'Jack went up the hill' ,
'Jill followed suit' ,
'i woke up suddenly' ,
'it was a really bad dream...']
"example" is a list of such sentences , and i want the output to look as :
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill' ....]
and so on.
I need the sentences to be separate with each word tokenized so that i can compare each word from a sentence of mod_example
(at a time using for loop) with a reference sentence.
I tried this :
for sentence in example:
text3 = sentence.split()
print text3
and got the follwing as output :
['it', 'was', 'a', 'really', 'bad', 'dream...']
How do I get this for all the sentences? it keeps overwriting . and yes , also mention whether my approach is right? this should remain a list of sentences with the words tokenized.. thanks
Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
tokenize() ,which returns a list, will ignore empty string (when a delimiter appears twice in succession) where as split() keeps such string. The split() can take regex as delimiter where as tokenize does not.
You could use the word tokenizer in NLTK (http://nltk.org/api/nltk.tokenize.html) with a list comprehension, see http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
>>> from nltk.tokenize import word_tokenize
>>> example = ['Mary had a little lamb' ,
... 'Jack went up the hill' ,
... 'Jill followed suit' ,
... 'i woke up suddenly' ,
... 'it was a really bad dream...']
>>> tokenized_sents = [word_tokenize(i) for i in example]
>>> for i in tokenized_sents:
... print i
...
['Mary', 'had', 'a', 'little', 'lamb']
['Jack', 'went', 'up', 'the', 'hill']
['Jill', 'followed', 'suit']
['i', 'woke', 'up', 'suddenly']
['it', 'was', 'a', 'really', 'bad', 'dream', '...']
i make this script to make all people understood how to tokenize, so they can build their Natural Language Processing's engine by them self.
import re
from contextlib import redirect_stdout
from io import StringIO
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
def token_to_sentence(str):
f = StringIO()
with redirect_stdout(f):
regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
regex_of_sentence = [x for x in regex_of_sentence if x is not '']
for i in regex_of_sentence:
print(i)
first_step_to_sentence = (f.getvalue()).split('\n')
g = StringIO()
with redirect_stdout(g):
for i in first_step_to_sentence:
try:
regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
print(regex_to_clear_sentence.group(1))
except:
print(i)
sentence = (g.getvalue()).split('\n')
return sentence
def token_to_words(str):
f = StringIO()
with redirect_stdout(f):
for i in str:
regex_of_word = re.findall('([\w]{0,})', i)
regex_of_word = [x for x in regex_of_word if x is not '']
for word in regex_of_word:
print(regex_of_word)
words = (f.getvalue()).split('\n')
i make a different process, i restart the process from paragraph, to make everybody more understood of word processing. paragraph to process is:
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
tokenize paragraph to sentence:
sentence = token_to_sentence(example)
will result:
['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']
tokenize to words:
words = token_to_words(sentence)
will result:
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']
i will explain how this work.
first, i used regex to search all word and spaces which separate the words and stop until found a punctuation, the regex is:
([\w\s]{0,})[^\w\s]{0,}
so the computation wil be took the words and spaces in bracket:
'(Mary had a little lamb),( Jack went up the hill, Jill followed suit),( i woke up suddenly),( it was a really bad dream)...'
the result is still not clear, contain some 'None' characters. so i used this script to removed the 'None' characters:
[x for x in regex_of_sentence if x is not '']
so the paragraph will tokenize to sentence, but not clear sentence the result is:
['Mary had a little lamb', ' Jack went up the hill', ' Jill followed suit', ' i woke up suddenly', ' it was a really bad dream']
as you see the result show some sentence start by a space. so to make a clear paragraph without starting a space, i make this regex:
\s([\w\s]{0,})
it will make a clear sentence like:
['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']
so, we must make two process to make a good result.
the answer of your question is start from here...
to tokenize the sentence to words, i make the paragraph iteration and used regex just to capture the word while it was iterating with this regex:
([\w]{0,})
and clear the empty characters again with:
[x for x in regex_of_word if x is not '']
so the result is really clear only the list of words:
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']
in the future to make a good NLP, you need to have your own phrase database and search if the phrase is in the sentence, after make a list of phrase, the rest of words is clear a word.
with this method, i can build my own NLP in my language (bahasa Indonesia) which really-really lack of module.
edited:
i don't see your question that want to compare the words. so you have another sentence to compare....i give you bonus not only bonus, i give you how to count it.
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']
in this case the step you must do is: 1. iter the mod_example 2. compare the first sentence with the words from mod_example. 3. make some calculation
so the script will be:
import re
from contextlib import redirect_stdout
from io import StringIO
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']
def token_to_sentence(str):
f = StringIO()
with redirect_stdout(f):
regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
regex_of_sentence = [x for x in regex_of_sentence if x is not '']
for i in regex_of_sentence:
print(i)
first_step_to_sentence = (f.getvalue()).split('\n')
g = StringIO()
with redirect_stdout(g):
for i in first_step_to_sentence:
try:
regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
print(regex_to_clear_sentence.group(1))
except:
print(i)
sentence = (g.getvalue()).split('\n')
return sentence
def token_to_words(str):
f = StringIO()
with redirect_stdout(f):
for i in str:
regex_of_word = re.findall('([\w]{0,})', i)
regex_of_word = [x for x in regex_of_word if x is not '']
for word in regex_of_word:
print(regex_of_word)
words = (f.getvalue()).split('\n')
def convert_to_words(str):
sentences = token_to_sentence(str)
for i in sentences:
word = token_to_words(i)
return word
def compare_list_of_words__to_another_list_of_words(from_strA, to_strB):
fromA = list(set(from_strA))
for word_to_match in fromA:
totalB = len(to_strB)
number_of_match = (to_strB).count(word_to_match)
data = str((((to_strB).count(word_to_match))/totalB)*100)
print('words: -- ' + word_to_match + ' --' + '\n'
' number of match : ' + number_of_match + ' from ' + str(totalB) + '\n'
' percent of match : ' + data + ' percent')
#prepare already make, now we will use it. The process start with script below:
if __name__ == '__main__':
#tokenize paragraph in example to sentence:
getsentences = token_to_sentence(example)
#tokenize sentence to words (sentences in getsentences)
getwords = token_to_words(getsentences)
#compare list of word in (getwords) with list of words in mod_example
compare_list_of_words__to_another_list_of_words(getwords, mod_example)
first_split = []
for i in example:
first_split.append(i.split())
second_split = []
for j in first_split:
for k in j:
second_split.append(k.split())
final_list = []
for m in second_split:
for n in m:
if(n not in final_list):
final_list.append(n)
print(final_list)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With