Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting programming language from a snippet [closed]

People also ask

How do you identify what programming language was used?

Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test. tpl (more generally linguist <path-to-code-snippet-file> ). It will tell you the type, mime type, and language.

Is there a website that can recognize and identify what programming language is being input pasted )?

There are many websites that can do this, but I think the best one (accuracy-wise) is Github.

What are closed source languages?

Closed source software refers to the computer software which source code is closes means public is not given access to the source code. In short it is referred as CSS. In closed source software the source code is protected. The only individual or organization who has created the software can only change it.

What is language snippet?

Language snippet scope#Single-language user-defined snippets are defined in a specific language's snippet file (for example javascript. json ), which you can access by language identifier through Preferences: Configure User Snippets. A snippet is only accessible when editing the language for which it is defined.


I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

Let me find the code.

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

I hope it helps you:

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @totals[lang] += 1
    @data[lang] ||= Hash.new(1)
    words(code).each {|w| @data[lang][w] += 1 }
  end

  def classify(code)
    ws = words(code)
    @data.keys.max_by do |lang|
      # We really want to multiply here but I use logs 
      # to avoid floating point underflow
      # (adding logs is equivalent to multiplication)
      Math.log(@totals[lang]) +
      ws.map{|w| Math.log(@data[lang][w])}.reduce(:+)
    end
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)

Language detection solved by others:

Ohloh's approach: https://github.com/blackducksw/ohcount/

Github's approach: https://github.com/github/linguist


Guesslang is a possible solution:

http://guesslang.readthedocs.io/en/latest/index.html

There's also SourceClassifier:

https://github.com/chrislo/sourceclassifier/tree/master

I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".


An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.

UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.


First, I would try to find the specific keyworks of a language e.g.

"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...

It's very hard and sometimes impossible. Which language is this short snippet from?

int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
    j = j + 1000 / i;
    k = k + i * j;
}

(Hint: It could be any one out of several.)

You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.

If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.