Any idea how to detect a source code (Java, C#, SQL and so on) in a text file with Java without looking at the file extension or using an extraordinary long, selfmade regular expression?
Maybe there are some tools doing this work already?
Linguist
We use this library at GitHub to detect blob languages, highlight code, ignore binary files, suppress generated files in diffs and generate language breakdown graphs.
Unfortunately it is written in Ruby, maybe JRuby can handle it?
You should find a minimalistic amount of keywords and define some logical rules. If you define the right rules, the regular expression defined by them will be not extraordinary big. Note, that the fewer keywrods and rules you have, the bigger is the probability of a mistake (SourceCode = true for a file which is not a source code, SourceCode = false for a file which is a source code). Also, the more keywords and rules you have the more time is needed to check whether a file is a source code or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With