Java Stanford NLP: Spell checking

Question

I'm trying to check spelling accuracy of text samples using the Stanford NLP. It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform.

My first idea was to check if the word is known by the lexicon:

private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");

@Analyze(weight=25, name="Spelling")
    public double spelling() {
        int result = 0;

        for (List<? extends HasWord> list : sentences) {
            for (HasWord w : list) {
                if (! lp.getLexicon().isKnown(w.word())) {
                    System.out.format("misspelled: %s
", w.word());
                    result++;
                }
            }
        }

        return result / sentences.size();
    }

However, this produces quite a lot of false positives:

misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus

Any ideas on how to do this better?

Christopher Manning · Accepted Answer

Using the parser's lexicon's isKnown(String) method as a spellchecker isn't a viable use case of the parser. The method is correct: "false" means that this word was not seen (with the given capitalization) in the approximately 1 million words of text the parser is trained from. But 1 million words just isn't enough text to train a comprehensive spellchecker from in a data-driven manner. People would typically use at least two orders of magnitude of text more, and might well add some cleverness to handle capitalization. The parser includes some of this cleverness to handle words that were unseen in the training data, but this isn't reflected in what the isKnown(String) method returns.

Java Stanford NLP: Spell checking

Tags:

java

nlp

spell-checking

stanford-nlp

Nick Heiner

1 Answers

Christopher Manning

Recent Activity

Donate For Us

Java Stanford NLP: Spell checking

Tags:

java

nlp

spell-checking

stanford-nlp

Nick Heiner

1 Answers

Christopher Manning

Related questions

Recent Activity

Donate For Us