I have to do a final project for my computational linguistics class. We've been using OCaml the entire time, but I also have familiarity with Java. We've studied morphology, FSMs, collecting parse trees, CYK parsing, tries, pushdown automata, regular expressions, formal language theory, some semantics, etc. Here are some ideas I've come up with. Do you have anything you think would be cool? <ol> <li>A script that scans Facebook threads for obnoxious* comments and silently hides them with JS (this would be run with the user's consent, obviously)</li> <li>An analysis of a piece of writing using semantics, syntax, punctuation usage, and other metrics, to try to "fingerprint" the author. It could be used to determine if two works are likely written by the same author. Or, someone could put in a bunch of writing he's done over time, and get a sense of how his style has changed.</li> <li>A chat bot (less interesting/original)</li> </ol> I may be permitted to use pre-existing libraries to do this. Do any exist for OCaml? Without a library/toolkit, the above three ideas are probably infeasible, unless I limit it to a very specific domain. Lower level ideas: <ol> <li>Operations on finite state machines - minimizing, composing transducers, proving that an FSM is in a minimal possible state. I am very interested in graph theory, so any overlap with FSMs could be a good venue to explore. (What else can I do with FSMs?)</li> <li>Something cool with regex?</li> <li>Something cool with CYK?</li> </ol> Does anyone else have any cool ideas? *obnoxious defined as having following certain patterns typical of junior high schoolers. The vagueness of this term is not an issue; for the credit I could define whatever I want and target that.

<ol> <li> Obnoxious language filtering - I think this will reduce down to a process very similar to spam email filtering. That is, counting the frequency of a set of more-or-less 'obnoxious' words. It doesn't sound like you will get the scope to do anything particularly clever, unless you also use other sources of information (e.g. the structure of the social links shared between the sender and recipient, perhaps). On the other hand, online bullying is a very serious thing and you can bet Facebook/Myspace and the other social networking sites care a lot about tackling it. </li> <li> Stylistic Analysis - There has been some work done on this in various forms, often under the name authorship analysis. Shlomo Argamon does a lot of work in this area and you could probably discover a lot more from the references in his papers. One of the best ways to profile an author is to learn the distribution of their usage of a set of stopwords (a.k.a functional words), such as 'and' ,'but', 'if', etc. I think there's a lot more scope to do something new and interesting in this area - authorship analysis on internet data is a hard problem - but also a lot more scope to fail. </li> <li> Chat bot - You're right, this is a pretty standard project. It's also quite hard to measure success/failure. I think the project would be more compelling if it was a chat-bot with some kind of purpose, like answering questions in a limited domain, but that's something that's very difficult to do well. </li> </ol> The rest are really too vague to make any comments on, sorry. There aren't any NLP libraries that I know of in OCaml, it's just not a particularly popular programming language. However, I do know of a machine learning library in Ocaml, called MEGAM, written by Hal Daume, who is a very good NLP researcher, which has been used for NLP tasks. I get a feeling that figuring out MEGAM and using it to do some NLP task might be too big a project to take on, however. Some other ideas: <ul> <li> Sentiment Analysis - A very trendy area of research. You could make this task as easy or hard as you like, from scoring a document as positive/negative to extracting specific topics and generating a sentiment score for each one.</li> <li> Coreference/Anaphora resolution - A difficult task but a very important one. Some approaches use a graph representation (each mention is a node with edges between them if they co-refer) to enforce things like transitivity.</li> <li> Document Classification - You could try and learn a system on the StackOverflow data set to suggest tags for a given question. It's a fairly well known problem with some established techniques, but an it's interesting data set and has an obvious and useful application to the real world . You could also see if you can find specific features of a question (word choice, length, formatting, punctuation, etc.) that cause them to be voted highly.</li> <li> Haiku Generation - Kind of a silly one, but I always thought it was an interesting idea. Syllable counting could be done with the CMU pronouncing dictionary. Should be a lot of fun, if not particularly useful.</li> </ul>

Ideas for Natural Language Processing project? [closed]

Tags:

parsing

nlp

ocaml

I have to do a final project for my computational linguistics class. We've been using OCaml the entire time, but I also have familiarity with Java. We've studied morphology, FSMs, collecting parse trees, CYK parsing, tries, pushdown automata, regular expressions, formal language theory, some semantics, etc.

Here are some ideas I've come up with. Do you have anything you think would be cool?

A script that scans Facebook threads for obnoxious* comments and silently hides them with JS (this would be run with the user's consent, obviously)
An analysis of a piece of writing using semantics, syntax, punctuation usage, and other metrics, to try to "fingerprint" the author. It could be used to determine if two works are likely written by the same author. Or, someone could put in a bunch of writing he's done over time, and get a sense of how his style has changed.
A chat bot (less interesting/original)

I may be permitted to use pre-existing libraries to do this. Do any exist for OCaml? Without a library/toolkit, the above three ideas are probably infeasible, unless I limit it to a very specific domain.

Lower level ideas:

Operations on finite state machines - minimizing, composing transducers, proving that an FSM is in a minimal possible state. I am very interested in graph theory, so any overlap with FSMs could be a good venue to explore. (What else can I do with FSMs?)
Something cool with regex?
Something cool with CYK?

Does anyone else have any cool ideas?

*obnoxious defined as having following certain patterns typical of junior high schoolers. The vagueness of this term is not an issue; for the credit I could define whatever I want and target that.

505

asked Nov 24 '09 22:11

Nick Heiner

1 Answers

Obnoxious language filtering - I think this will reduce down to a process very similar to spam email filtering. That is, counting the frequency of a set of more-or-less 'obnoxious' words. It doesn't sound like you will get the scope to do anything particularly clever, unless you also use other sources of information (e.g. the structure of the social links shared between the sender and recipient, perhaps). On the other hand, online bullying is a very serious thing and you can bet Facebook/Myspace and the other social networking sites care a lot about tackling it.
Stylistic Analysis - There has been some work done on this in various forms, often under the name authorship analysis. Shlomo Argamon does a lot of work in this area and you could probably discover a lot more from the references in his papers. One of the best ways to profile an author is to learn the distribution of their usage of a set of stopwords (a.k.a functional words), such as 'and' ,'but', 'if', etc. I think there's a lot more scope to do something new and interesting in this area - authorship analysis on internet data is a hard problem - but also a lot more scope to fail.
Chat bot - You're right, this is a pretty standard project. It's also quite hard to measure success/failure. I think the project would be more compelling if it was a chat-bot with some kind of purpose, like answering questions in a limited domain, but that's something that's very difficult to do well.

The rest are really too vague to make any comments on, sorry.

There aren't any NLP libraries that I know of in OCaml, it's just not a particularly popular programming language. However, I do know of a machine learning library in Ocaml, called MEGAM, written by Hal Daume, who is a very good NLP researcher, which has been used for NLP tasks. I get a feeling that figuring out MEGAM and using it to do some NLP task might be too big a project to take on, however.

Some other ideas:

Sentiment Analysis - A very trendy area of research. You could make this task as easy or hard as you like, from scoring a document as positive/negative to extracting specific topics and generating a sentiment score for each one.
Coreference/Anaphora resolution - A difficult task but a very important one. Some approaches use a graph representation (each mention is a node with edges between them if they co-refer) to enforce things like transitivity.
Document Classification - You could try and learn a system on the StackOverflow data set to suggest tags for a given question. It's a fairly well known problem with some established techniques, but an it's interesting data set and has an obvious and useful application to the real world . You could also see if you can find specific features of a question (word choice, length, formatting, punctuation, etc.) that cause them to be voted highly.
Haiku Generation - Kind of a silly one, but I always thought it was an interesting idea. Syllable counting could be done with the CMU pronouncing dictionary. Should be a lot of fun, if not particularly useful.

110

answered Oct 12 '22 01:10

Stompchicken

Related questions
                            
                                Is there an ANTLR4 grammar for YAML?
                            
                                ANTLR4 grammar token recognition error after import
                            
                                Combined unparser/parser generator
                            
                                PHP: Parse date from localized format
                            
                                DXF Parser : Ellipses angle direction
                            
                                Resolving html entities with NSXMLParser on iPhone
                            
                                How to read & understand C & C++ Standards and the language grammar used therein?
                            
                                Any parsers for RFC documents? [closed]
                            
                                Haskell/Parsec: how do I use Text.Parsec.Token with Text.Parsec.Indent (from the indents package)
                            
                                Java name parse library?
                            
                                ALLOW_UNQUOTED_FIELD_NAMES in jackon JSON library
                            
                                Implementing a C preprocessor
                            
                                How to access groups captured by recursive perl regexes?
                            
                                Parsing "->" assignment operator in R
                            
                                Reasons for using lex/yacc alternatives?
                            
                                python regex, match in multiline, but still want to get the line number
                            
                                Does C# have (direct) flex/yacc port? Or what lexer/parser people use for C#? [closed]
                            
                                How to read JSON(server response) in Javascript?
                            
                                C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With