Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on.

There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents.

Can I use Python to do this, or is Python too slow? Is it best to use Java?

If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's speed. I have used Python for some small scale text processing tasks with only a couple of thousand documents, but I am not sure how well it scales up.

like image 538
kga Avatar asked May 17 '11 11:05

kga


People also ask

Is Python good for text processing?

Python Programming can be used to process text data for the requirements in various textual data analysis. A very important area of application of such text processing ability of python is for NLP (Natural Language Processing).

What is the difference between text mining and NLP?

NLP and text mining differ in the goal for which they are used. NLP is used to understand human language by analyzing text, speech, or grammatical syntax. Text mining is used to extract information from unstructured and structured content. It focuses on structure rather than the meaning of content.

Can I use Java for NLP?

Java can be applied to a wide range of processes in machine learning and data science, including data export and import, data cleaning, deep learning, statistical analysis, NLP, ML, and data visualization.

Which programming language is best for NLP?

Although languages such as Java and R are used for natural language processing, Python is favored, thanks to its numerous libraries, simple syntax, and its ability to easily integrate with other programming languages. Developers eager to explore NLP would do well to do so with Python as it reduces the learning curve.


3 Answers

Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).

NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.

Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...

I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.

If you have custom scripts, you might want to check out how well they perform with PyPy.

like image 109
Chris Avatar answered Oct 15 '22 23:10

Chris


It's very difficult to answer questions like this without trying. So why don't you

  1. Figure out what would be a difficult operation
  2. Implement that (and I mean the simplest, quickest hack that you can make work)
  3. Run it with a lot of data, and see how long it takes
  4. Figure out if it's too slow

I've done this in the past and it's really the way to see if something performs well enough for something.

like image 25
StackExchange saddens dancek Avatar answered Oct 16 '22 00:10

StackExchange saddens dancek


Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized. There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.

like image 28
Jakob Bowyer Avatar answered Oct 16 '22 00:10

Jakob Bowyer