Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start? [closed]

Tags:

My current web-app project calls for a little NLP:

Tokenizing text into sentences, via Punkt and similar;
Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not)
A Bayesian model fit for chunking paragraphs with an even feel, no orphans or widows and minimal awkward splits (maybe)

... which much of that is a childishly easy task if you’ve got NLTK — which I do, sort of: the app backend is Django on Tornado; you’d think doing these things would be a non-issue.

However, I’ve got to interactively provide the user feedback for which the tokenizers are necessitated, so I need to do tokenize the data clientside.

Right now I actually am using NLTK, via a REST API call to a Tornado process that wraps the NLTK function and little else. At the moment, things like latency and concurrency are obviously suboptimal w/r/t this ad-hoc service, to put it politely. What I should be doing, I think, is getting my hands on Coffee/Java versions of this function if not reimplementing it myself.

And but so then from what I've seen, JavaScript hasn’t been considered cool long enough to have accumulated the not-just-web-specific, general-purpose library schmorgasbörd one can find in C or Python (or even Erlang). NLTK of course is a standout project by anyones’ measure but I only need a few percent of what it is packing.

But so now I am at a crossroads — I have to double down on either:

The “learning scientific JavaScript technique fit for reimplementing algorithms I am Facebook friends with at best” plan, or:
The less interesting but more deterministically doable “settle for tokenizing over the wire, but overcompensate for the dearth of speed and programming interestingness — ensure a beachball-free UX by elevating a function call into a robustly performant paragon of web-scale service architecture, making Facebook look like Google+” option.

Or something else entirely. What should I do? Like to start things off. This is my question. I’m open to solutions involving an atypical approach — as long as your recommendation is not distasteful (e.g. “use Silverlight”) and/or a time vortex (e.g. “get a computational linguistics PhD you troglodyte”) I am game. Thank you in advance.

935

asked Mar 15 '12 13:03

fish2000

1 Answers

I think that, as you wrote in the comment, the amount of data needed for efficient algorithms to run will eventually prevent you from doing things client-side. Even basic processing require lots of data, for instance bigram/trigram frequencies, etc. On the other hand, symbolic approaches also need significant data (grammar rules, dictionaries, etc.). From my experience, you can't run a good NLP process without at the very least 3MB to 5MB of data, which I think is too big for today's clients.

So I would do things over the wire. For that I would recommend an asynchronous/push approach, maybe use Faye or Socket.io ? I'm sure you can achieve a perfect and fluid UX as long as the user is not stuck while the client is waiting for the server to process the text.

176

answered Sep 23 '22 03:09

Blacksad

Related questions
                            
                                What does this regular expression part add?
                            
                                Firefox extension: Cancel requests and emit fake responses
                            
                                JavaScript libraries for drawing UML class diagramms [closed]
                            
                                Never-ending "Connecting" message after AJAX form submit
                            
                                How do I manage the current user session on the client side?
                            
                                Backbone.history has already been started
                            
                                How to use JQuery UI slide effect to seamlessly slide out one div and slide in another div?
                            
                                Mongoose : Inserting JS object directly into db
                            
                                Can I have JavaScript select printer to use? [duplicate]
                            
                                Canvas Game: My snake eats itself
                            
                                Cleanup of AudioNodes in Web Audio
                            
                                What really happens in Javascript Sort
                            
                                Javascript Serialization of Typed Objects
                            
                                Where are attached event-handlers 'meta-data' stored? On the "DOM," object, or...?
                            
                                Calling PowerShell from NodeJS
                            
                                Escape forward slash in jscript regexp character class?
                            
                                image.onload fires before image is completely loaded
                            
                                Insert image in "contenteditable" div using popup
                            
                                Is there a way that a touch start event will not trigger the click event?
                            
                                document.querySelectorAll length is always 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start? [closed]

Tags:

javascript

coffeescript

tokenize

nlp

user-experience

fish2000

People also ask

1 Answers

Blacksad

Recent Activity

Donate For Us