Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Advantages of creating own corpus in NLTK

I have a large amount of text in Mysql tables. I want to do some statistical analysis and later on some NLP on my text using the NLTK toolkit. I have two choices:

  1. Extract all the text at once from my DB table (maybe putting them in a file if needed) and use the NLTK functions
  2. Extract the text and turning it into a "corpus" that can be used with NLTK.

The latter seems quite complicated and I haven't found any articles that actually describes how to use it I only found this: Creating a MongoDB backed corpus reader which uses MongoDB as its database and the code is quite complicated and also requires knowing MongoDB. On the other hand, the former seems really straightforward but results in an overhead extracting the texts from DB.

Now the question is that what are the advantages of corpus in NLTK? In other words, if I take the challenge and dig into overwriting NTLK methods so it can read from MySQL database, would it be worth the hassle? Does turning my text into a corpus give me something that I cannot (or with a lot of difficulty) do with ordinary NLTK functions?

Also if you know something about connecting MySQL to NLTK please let me know. Thanks

like image 291
Hossein Avatar asked Feb 15 '11 11:02

Hossein


1 Answers

Well after reading a lot I found out the answer. There are several very useful functions such as collocations,search,common_context,similar that can be used on texts that are saved as corpus in NLTK. implementing them yourself takes quite some time. If Select my text from the database and put in a file and use the nltk.Text function then I can use all the functions that I mentioned before without the need of writing so many lines of code or even overwriting methods so that I can connect to MySql.Here is the link for more info: nltk.Text

like image 156
Hossein Avatar answered Sep 20 '22 11:09

Hossein