Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

Question

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?

The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'

In the novel, the Sheep Man is translated as saying things like:

"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.

What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?

Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.

I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers. Thanks!

JoshD · Accepted Answer

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

Radomir Dopieralski · Answer

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.

Once you have the words, you have two possible approaches:

You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.

On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.

I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

Tags:

python

nlp

linguistics

craigs

2 Answers

JoshD

Radomir Dopieralski

Recent Activity

Donate For Us

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

Tags:

python

nlp

linguistics

craigs

2 Answers

JoshD

Radomir Dopieralski

Related questions

Recent Activity

Donate For Us