Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tokenize natural English text in an input file in python?

Tags:

python

nltk

I want to tokenize input file in python please suggest me i am new user of python .

I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

like image 400
Target Avatar asked Dec 09 '22 20:12

Target


2 Answers

Try something like this:

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html

like image 134
del Avatar answered Dec 11 '22 10:12

del


Using NLTK

If your file is small:

  • Open the file with the context manager with open(...) as x,
  • then do a .read() and tokenize it with word_tokenize()

[code]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

  • Open the file with the context manager with open(...) as x,
  • read the file line by line with a for-loop
  • tokenize the line with word_tokenize()
  • output to your desired format (with the write flag set)

[code]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)
like image 26
alvas Avatar answered Dec 11 '22 09:12

alvas