Consider the following two strings, the first one is code, the second one is English sentence (phrase to be precise). How can I detect that the first one is code and the second is not.
1. for (int i = 0; i < b.size(); i++) {
2. do something in English (not necessary to be a sentence).
I'm thinking about counting special characters (such as "=", ";", "++", etc ), and set if to some threshold. Are there any better ways to do this? Any Java libraries?
Note that the code may not parsable, because it is not a complete method/statement/expression.
My assumption is that English sentences are pretty regular, it most likely contains only ",", ".", "_", "(", ")", etc. They do not contains something like this: write("the whole lot of text");
You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.
If you have the string: string sample = "If you know what's good for you, you'll shut the door!"; And you want to find where it is in a sentence, you can use the IndexOf method. A non -1 answer means the string has been located.
We can use the regex ^[a-zA-Z]*$ to check a string for alphabets. This can be done using the matches() method of the String class, which tells whether the string matches the given regex.
You can try the OpenNLP sentence parser. It returns the n best parses for a sentence. For most English sentences it returns at least one. I believe, that for most code snippets it won't return any and hence you can be quite sure it is not an English sentence.
Use this code for parsing:
// Initialize the sentence detector
final SentenceDetectorME sdetector = EasyParserUtils
.getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);
// Initialize the parser
final Parser parser = EasyParserUtils
.getOpenNLPParser(Constants.PARSER_DATA_LOC);
// Get sentences of the text
final String sentences[] = sdetector.sentDetect(essay);
// Go through the sentences and parse each
for (final String sentence : sentences) {
// Parse the sentence, produce only 1 parse
final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
if (parses.length == 0) {
// Most probably this is code
}
else {
// An English sentence
}
}
and these are the two helper methods (from EasyParserUtils) used in the code:
public static Parser getOpenNLPParser(final String parserDataURL) {
try (final InputStream isParser = new FileInputStream(parserDataURL);) {
// Get model for the parser and initialize it
final ParserModel parserModel = new ParserModel(isParser);
return ParserFactory.create(parserModel);
}
catch (final IOException e) {
e.printStackTrace();
return null;
}
}
and
public static SentenceDetectorME getOpenNLPSentDetector(
final String sentDetDataURL) {
try (final InputStream isSent = new FileInputStream(sentDetDataURL)) {
// Get models for sentence detector and initialize it
final SentenceModel sentDetModel = new SentenceModel(isSent);
return new SentenceDetectorME(sentDetModel);
}
catch (final IOException e) {
e.printStackTrace();
return null;
}
}
Look into lexical analysis and parsing (same as if you were writing a compiler). You might not even need a parser if you're not requiring full statements.
The basic idea is to convert the string to a set to tokens. For example, the code line above may become "KEY,SEPARATOR,ID,ASSIGN,NUMBER,SEPARATOR,...". And then we can use simple rules to separate code from English.
check out the code here
You could use a Java parser or create one using the BNF but the issue here is that you said the code may not be parsable so it will fail.
My advice : use some custom regexp to detect special patterns in the code. Use as many as possible to have a good success rate.
Some examples :
for\s*\(
(for loop)while\s*\(
(while loop)[a-zA-Z_$][a-zA-Z\d_$]*\s*\(
(constructor)\)\s*\{
(begin of a block / method)Yes it's a long shot but looking at what you want, you don't have many possibility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With