What are good starting points for someone interested in natural language processing? [closed]

Question

So I've recently came up with some new possible projects that would have to deal with deriving 'meaning' from text submitted and generated by users.

Natural language processing is the field that deals with these kinds of issues, and after some initial research I found the OpenNLP Hub and university collaborations like the attempto project. And stackoverflow has this.

If anyone could link me to some good resources, from reseach papers and introductionary texts to apis, I'd be happier than a 6 year-old kid opening his christmas presents!

Update

Through one of your recommendations I've found opencyc ('the world's largest and most complete general knowledge base and commonsense reasoning engine'). Even more amazing still, there's a project that is a distilled version of opencyc called UMBEL. It features semantic data in rdf/owl/skos n3 syntax.

I've also stumbled upon antlr, a parser generator for 'constructing recognizers, interpreters, compilers, and translators from grammatical descriptions'.

And there's a question on here by me, that lists tons of free and open data.

Thanks stackoverflow community!

808

asked Oct 17 '08 13:10

kitsune

1 Answers

Tough call, NLP is a much wider field than most people think it is. Basically, language can be split up into several categories, which will require you to learn totally different things.

Before I start, let me tell you that I doubt you'll have any notable success (as a professional, at least) without having a degree in some (closely related) field. There is a lot of theory involved, most of it is dry stuff and hard to learn. You'll need a lot of endurance and most of all: time.

If you're interested in the meaning of text, well, that's the Next Big Thing. Semantic search engines are predicted as initiating Web 3.0, but we're far from 'there' yet. Extracting logic from a text is dependant on several steps:

Tokenization, Chunking
Disambiguation on a lexical level (Time flies like an arrow, but fruit flies like a banana.)
Syntactic Parsing
Morphological analysis (tense, aspect, case, number, whatnot)

A small list, off the top of my head. There's more :-), and many more details to each point. For example, when I say "parsing", what is this? There are many different parsing algorithms, and there are just as many parsing formalisms. Among the most powerful are Tree-adjoining grammar and Head-driven phrase structure grammar. But both of them are hardly used in the field (for now). Usually, you'll be dealing with some half-baked generative approach, and will have to conduct morphological analysis yourself.

Going from there to semantics is a big step. A Syntax/Semantics interface is dependant both, on the syntactic and semantic framework employed, and there is no single working solution yet. On the semantic side, there's classic generative semantics, then there is Discourse Representation Theory, dynamic semantics, and many more. Even the logical formalism everything is based on is still not well-defined. Some say one should use first-order logic, but that hardly seems sufficient; then there is intensional logic, as used by Montague, but that seems overly complex, and computationally unfeasible. There also is dynamic logic (Groenendijk and Stokhof have pioneered this stuff. Great stuff!) and very recently, this summer actually, Jeroen Groenendijk presented a new formalism, Inquisitive Semantics, also very interesting.

If you want to get started on a very simple level, read Blackburn and Bos (2005), it's great stuff, and the de-facto introduction to Computational Semantics! I recently extended their system to cover the partition-theory of questions (question answering is a beast!), as proposed by Groenendijk and Stokhof (1982), but unfortunately, the theory has a complexity of O(n²) over the domain of individuals. While doing so, I found B&B's implementation to be a bit, erhm… hackish, at places. Still, it is going to really, really help you dive into computational semantics, and it is still a very impressive showcase of what can be done. Also, they deserve extra cool-points for implementing a grammar that is settled in Pulp Fiction (the movie).

And while I'm at it, pick up Prolog. A lot of research in computational semantics is based on Prolog. Learn Prolog Now! is a good intro. I can also recommend "The Art of Prolog" and Covington's "Prolog Programming in Depth" and "Natural Language Processing for Prolog Programmers", the former of which is available for free online.

answered Oct 12 '22 12:10

Aleksandar Dimitrov

Related questions
                            
                                How to interpret scikit's learn confusion matrix and classification report?
                            
                                Computing precision and recall in Named Entity Recognition
                            
                                How best to parse a simple grammar?
                            
                                Definition of downstream tasks in NLP
                            
                                Python NLTK pos_tag not returning the correct part-of-speech tag
                            
                                How to extract numbers (along with comparison adjectives or ranges)
                            
                                Stemming algorithm that produces real words
                            
                                Can an algorithm detect sarcasm [closed]
                            
                                Efficiently count word frequencies in python
                            
                                NLTK and language detection
                            
                                How do I do dependency parsing in NLTK?
                            
                                NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?
                            
                                Code Golf: Number to Words
                            
                                Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
                            
                                Restore original text from Keras’s imdb dataset
                            
                                How to tweak the NLTK sentence tokenizer
                            
                                How to connect Cortana commands to custom scripts?
                            
                                Doc2Vec Get most similar documents
                            
                                Use of PunktSentenceTokenizer in NLTK
                            
                                TFIDF for Large Dataset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With