lxndryng - a blog by a millennial with a job in IT

Mar 03, 2017

Building a Naive Bayesian Programming Language Classifier

If this post is useful to you, I'd greatly appreciate you giving me a tip over at PayPal or giving DigitalOcean's hosting services a try - you'll get 10USD's worth of credit for nothing

GitHub's Linguist is a very capable Ruby project for classifying the programming language(s) of a given file or repository, but struggles a little when there isn't a file extension present to give a first initial hint as to what programming language may be being used: given this lack of an initial hint, none of the clever heuristics that are present within Linguist can be applied as part of analysis of the source code. As part of a project I'm working on at the moment, I have around 32,000 code snippets with no file extension information that I'd like to classify, with the further knowledge that some of these snippets may not be in a programming language at all, but rather a natural language or maybe just be encrypted or encoded pieces of text. Applying the Pythonic if it quacks like a duck, it's a duck approach, a naive Bayesian approach whereby we just see if a snippet looks like something we've seen in another language seems like it might work well enough.

So why a Bayesian method?

In the main: I'm lazy and not a particularly mathematically inclined person. I also wrote half of a dissertation on Bayesian methods as applied to scientific method, so I've got enough previous in this space to at least pretend I've got some background in the field. On top of that, Bayesian classifiers give us an easy way to assume that the incidence of any evidence is independent of the incidence of any other. We end up with a fairly simple equation for finding the probability of a given programming language given the elements of language we have in a code snippet.

P(Language|Snippet n-grams) = P(Snippet n-gram(1)|Language) * P(Snippet n-gram(2)|Language) ... P(Snippet n-gram(n)|Language)
                              -----------------------------------------------------------------------------------------------
                                          P(Snippet n-gram(1)) * P(Snippet n-gram(2)) ... P(Snippet n-gram(n))

We end up with very small numbers here, so much that we get floating point underflow. To avoid this, we can use the natural logarithms of the probabilities on the right hand-side, and add rather than multiply them.

How do we identify languages?

Linguist has six so-called "strategies" for identifying programming languages: the presence of emacs/vim modelines in a file, the filename, the file extension, the presence of a shebang (eg, !#/bin/bash) in a given file, some predefined heuristics and a Bayesian classifier of its own, though with no persistence of the training data across runs of the tool. In this approach, we'll only be implementing the classifier, but using heuristic-like methods to supplement the ability of the model to accurately identify certain languages.

The first element of the classification model will be based upon n-grams, where n will be between 1 and 4. I want to be able to classify on the basis of single keywords (eg, puts in Ruby), as well as strings of words (eg, the public static void main method signature in Java).

At the core of this, we have a very basic tokeniser that should give us enough information to put together some tokens that will give us enough to go on and create the n-grams that will give us the ability to infer the language code snippets are written in.

A simple improvement on this would be to remove anything that would add plaintext to the mix: comments, docstrings and the like. As I said above, I'm not really concerned with 100% accuracy: something that quacks like a duck might be enough for us to say that it's a duck here.

Languages I have to deal with

From a cursory look at the 32,000 snippets, I know that I definitely have to be able to identify and distinguish between Python, Ruby, C#, C, C++, x86 (I think, this could be a rabbit hole and a half to go down) assembly and Java. We can reasonably expect that differentiation between Java and C# and C and C++ will be painful and prone to error until we refine the model given the similarities that these languages have to one another.

To start, I will just be attempting to demonstrate that my broad approach works with at least Python, Ruby, Assembly, C# and Java before looking to incorporate more languages.

Persistence of the probability data

With a Bayesian approach, we need to be able to refer to a trained model of what the probabilities of given features are for a given programming language in order to make a prediction of which programming language a given snippet will be. In order to do this, we need to store this probability information somewhere. For the sake of simplicity, I'll be doing this in MariaDB, with the basic schema below:

DROP DATABASE IF EXISTS bayesian_training;
CREATE DATABASE bayesian_training;
USE bayesian_training;
CREATE TABLE grams(
    id INT AUTO_INCREMENT PRIMARY KEY,
    gram VARCHAR(100) UNIQUE NOT NULL
);
CREATE TABLE languages(
    id INT AUTO_INCREMENT PRIMARY KEY,
    language VARCHAR(20) UNIQUE NOT NULL
);
CREATE TABLE occurences(
    gram_id INT NOT NULL,
    language_id INT NOT NULL,
    number INT NOT NULL,
    PRIMARY KEY(gram_id, language_id),
    FOREIGN KEY(gram_id) REFERENCES grams(id),
    FOREIGN KEY(language_id) REFERENCES languages(id)
);

Training of the model

To train the model, I used the following codebases:

  • Python
    • Django
    • Twisted
  • Ruby
    • Sinatra
    • Discourse
  • Java
    • Jenkins
    • Lucene and Solr
  • Assembly
    • BareMetalOS
    • Floppy Bird
  • C#
    • GraphEngine
    • Json.Net
    • wtrace

These are, in the realms of real-world code usage, pretty small samples to be going on, but should hopefully give us enough to get a system together that works.

How effective was our initial model?

In order to test how well we did, I tested the following files against the model:

The results:

linguist.rb: [(7953, 'asm', -6416.136889371387), (8002, 'c#', -6630.869975742312), (3931, 'java', -6849.643512121844), (1, 'python', -6302.470348917564), (1763, 'ruby', -5991.090879392727)]
ZipFile.cs: [(7953, 'asm', -164549.47730156567), (8002, 'c#', -144878.96700648475), (3931, 'java', -152243.66607448383), (1, 'python', -158673.75993403912), (1763, 'ruby', -159188.1657594956)]
flask/app.py: [(7953, 'asm', -189603.1365282128), (8002, 'c#', -195084.66248479518), (3931, 'java', -196401.08214636377), (1, 'python', -171435.95779745755), (1763, 'ruby', -183980.2802635695)]
tetros.asm: [(7953, 'asm', -17961.240272497354), (8002, 'c#', -28535.4183269716), (3931, 'java', -28894.289472605506), (1, 'python', -28315.088969821816), (1763, 'ruby', -27569.732161692715)]

For all of the tested files, the maximum value of the logarithms is the language that we knew it was: at least we're getting the right answers, for the most part.

Technical niggles

The way that they're constructed, the databases queries used in the training stage can become incredibly large. These queries can be too large for the default value of max_allowed_packet of 1MB in my.cnf. Setting this to 64MB was sufficient to have all of my queries resolved.

Code

The code used for this classifier can be found at GitLab. This may also be released to PyPI at some point.