Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if a string is English sentence or code?

Tags:

java

string

nlp

Consider the following two strings, the first one is code, the second one is English sentence (phrase to be precise). How can I detect that the first one is code and the second is not.

1. for (int i = 0; i < b.size(); i++) {
2. do something in English (not necessary to be a sentence).

I'm thinking about counting special characters (such as "=", ";", "++", etc ), and set if to some threshold. Are there any better ways to do this? Any Java libraries?

Note that the code may not parsable, because it is not a complete method/statement/expression.

My assumption is that English sentences are pretty regular, it most likely contains only ",", ".", "_", "(", ")", etc. They do not contains something like this: write("the whole lot of text");

like image 747
Ryan Avatar asked Oct 21 '14 03:10

Ryan


People also ask

How do you check if a string is a sentence in Java?

You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.

How do you find a string in a sentence?

If you have the string: string sample = "If you know what's good for you, you'll shut the door!"; And you want to find where it is in a sentence, you can use the IndexOf method. A non -1 answer means the string has been located.

How do you check if all alphabets are present in a string?

We can use the regex ^[a-zA-Z]*$ to check a string for alphabets. This can be done using the matches() method of the String class, which tells whether the string matches the given regex.


4 Answers

You can try the OpenNLP sentence parser. It returns the n best parses for a sentence. For most English sentences it returns at least one. I believe, that for most code snippets it won't return any and hence you can be quite sure it is not an English sentence.

Use this code for parsing:

    // Initialize the sentence detector
    final SentenceDetectorME sdetector = EasyParserUtils
            .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);

    // Initialize the parser
    final Parser parser = EasyParserUtils
            .getOpenNLPParser(Constants.PARSER_DATA_LOC);

    // Get sentences of the text
    final String sentences[] = sdetector.sentDetect(essay);

    // Go through the sentences and parse each
    for (final String sentence : sentences) {
        // Parse the sentence, produce only 1 parse
        final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
        if (parses.length == 0) {
            // Most probably this is code
        }
        else {
            // An English sentence
        }
    }

and these are the two helper methods (from EasyParserUtils) used in the code:

public static Parser getOpenNLPParser(final String parserDataURL) {
    try (final InputStream isParser = new FileInputStream(parserDataURL);) {
        // Get model for the parser and initialize it
        final ParserModel parserModel = new ParserModel(isParser);
        return ParserFactory.create(parserModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}

and

public static SentenceDetectorME getOpenNLPSentDetector(
        final String sentDetDataURL) {
    try (final InputStream isSent = new FileInputStream(sentDetDataURL)) {
        // Get models for sentence detector and initialize it
        final SentenceModel sentDetModel = new SentenceModel(isSent);
        return new SentenceDetectorME(sentDetModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}
like image 54
Augustin Avatar answered Sep 30 '22 12:09

Augustin


Look into lexical analysis and parsing (same as if you were writing a compiler). You might not even need a parser if you're not requiring full statements.

like image 29
Platinum Azure Avatar answered Sep 30 '22 12:09

Platinum Azure


The basic idea is to convert the string to a set to tokens. For example, the code line above may become "KEY,SEPARATOR,ID,ASSIGN,NUMBER,SEPARATOR,...". And then we can use simple rules to separate code from English.

check out the code here

like image 44
user2250367 Avatar answered Sep 30 '22 14:09

user2250367


You could use a Java parser or create one using the BNF but the issue here is that you said the code may not be parsable so it will fail.

My advice : use some custom regexp to detect special patterns in the code. Use as many as possible to have a good success rate.

Some examples :

  • for\s*\( (for loop)
  • while\s*\( (while loop)
  • [a-zA-Z_$][a-zA-Z\d_$]*\s*\( (constructor)
  • \)\s*\{ (begin of a block / method)
  • ...

Yes it's a long shot but looking at what you want, you don't have many possibility.

like image 45
ToYonos Avatar answered Sep 30 '22 12:09

ToYonos