lxndryng - a blog about nothing

Building a Docker Container for Taiga

Mar 11, 2017

It's no secret to people who know me that I am not the most organised person in the world when it comes to my personal life: far too often, things can wait until... well, until I forget about them. As part of a general bit to be more proactive about the things I want to get done in my free time, I had a look at the open-source market for open-source project management software (the people who use Jira at work extensively always seem to be the most organised, but I'm not paying for this experiment to look into my own sloth) and came out wanting to give Taiga a try, with it being a Python application that I'd be able to extend with a minimum of effort if there was some piece of obscura I'd wish to contribute to it. Of course, my compunction towards self-hosting all of the web-based tools I use meant that the second half of the question would be to find a means by which I could easily deploy, upgrade and manage it.

Enter Docker. I'd initially found some Docker images on Docker Hub that worked and in a jovial fit of inattentive, proceeded to use them without quite realising how old they were. Eventually, I noticed that they were last built nineteen months ago, for a project that has a fairly rapid release cadence. Fortunately, the creator of those images had published their Dockerfiles and configuration on GitHub; unfortunately, however, that configuration was itself out of date given recent changes in the supporting libraries for Taiga. The option of looking for other people's Docker containers, of course, did not occur to me, so I endeavoured to update and expand upon the work that had been done previously.

Taiga's architecture

Taiga consists of a frontend application written in Angular.js (I'm not a frontend person - I couldn't tell you if it was Angular 1 or Angular 2) and a backend application based on the Django framework. The database is a PostgreSQL database, nothing really fancy about it.

A half-done transformation

Looking at the code used to generate the Docker images, I noticed that there was a discrepancy between several of the paths used in building the interface between the frontend and backend applications: in the backend application, everything seemed to point towards /usr/src/app/taiga-back/, whereas in the frontend application, references were made to /taiga. This dated from the backend application being built around the python base image, before being changed to python-onbuild. The -onbuild variety of the image gives some convenience methods around running pip install -r requirements.txt without manual intervention, which I can see as a worthwhile bit of effort in terms of making future updates to the image easier. Unfortunately, it does change the path of your application: something that hadn't been fixed up to now. Fortunately, a trival change of the frontend paths to /usr/src/app/taiga-back solved the issue,

Le temps detruit tout

Some time between the last time the previous author pushed his git repository to GitHub and now, the version of Django used by Taiga changed, introducing some breaking module name changes. The Postgres database backend module changed from transaction_hooks.backends.postgresql to django.db.backends.postgresql, with the new value having to be declared in the settings file that was to be injected into the backend container.

Doing something sensible about data

Taiga allows users to upload files to support the user stories and features catalogued within the tool, putting these files in a subdirectory of the backend application's working directory. Now, if we're to take our containers to be immutable and replacable, this just won't do: the deletion of the container would result in the deletion of all data therein. Given that the Postgres container was set up to store its data on the filesystem of the host, outside of the container, it's a little odd that the backend application didn't have the same consideration taken into account. By declaring the media and static directories within the application to be VOLUMEs in the Dockerfile resolved this issue.

Don't make assumptions about how this will be deployed

In the original repository, the ports and where HTTPS was being used for communication between the front and backend had been hard-coded into the JSON configuration for the frontend application: it was HTTP (rather than HTTPS) on port 8000. Now, if one was to deploy this onto a device running SELinux was the default policy, setting up a reverse proxy to terminate SSL would have been impossible because of the expectation that port 8000 would only be used by soundd - with anything else trying to bind to that port being told that it can't. To remedy this, I made the port aprotocol being used configurable from environment variables at the time of container instantiation.


The repository put together previously contained, as well as the Dockerfiles for generation of the images, scripts to deploy the images together and have the application work. It did not, however, have any cconsideration how an upgrade could work. With that in mind, I put together a script that would pull the latest versions of the images I'd put together, tear down the existing containers, stand up new ones and run any necessary database migrations. Nothing more complex than the below:


if [[ -z "$API_NAME" ]]; then

if [[ -z "$API_PORT" ]]; then

if [[ -z "$API_PROTOCOL" ]]; then

docker pull lxndryng/taiga-back
docker pull lxndryng/taiga-front
docker stop taiga-back taiga-front
docker rm taiga-back taiga-front
docker run -d --name taiga-back  -p -e API_NAME=$API_NAME  -v /data/taiga-media:/usr/src/app/taiga-back/media --link postgres:postgres lxndryng/taiga-back
docker run -d --name taiga-front -p -e API_NAME=$API_NAME -e API_PORT=$API_PORT -e API_PROTOCOL=$API_PROTOCOL --link taiga-back:taiga-back --volumes-from taiga-back lxndryng/taiga-front
docker run -it --rm -e API_NAME=$API_NAME --link postgres:postgres lxndryng/taiga-back /bin/bash -c "cd /usr/src/app/taiga-back; python manage.py migrate --noinput; python manage.py compilemessages; python manage.py collectstatic --noinput"

GitHub repository

The Docker configuration for my spin on the Taiga Docker images can be found here.

Building a Naive Bayesian Programming Language Classifier

Mar 03, 2017

GitHub's Linguist is a very capable Ruby project for classifying the programming language(s) of a given file or repository, but struggles a little when there isn't a file extension present to give a first initial hint as to what programming language may be being used: given this lack of an initial hint, none of the clever heuristics that are present within Linguist can be applied as part of analysis of the source code. As part of a project I'm working on at the moment, I have around 32,000 code snippets with no file extension information that I'd like to classify, with the further knowledge that some of these snippets may not be in a programming language at all, but rather a natural language or maybe just be encrypted or encoded pieces of text. Applying the Pythonic if it quacks like a duck, it's a duck approach, a naive Bayesian approach whereby we just see if a snippet looks like something we've seen in another language seems like it might work well enough.

So why a Bayesian method?

In the main: I'm lazy and not a particularly mathematically inclined person. I also wrote half of a dissertation on Bayesian methods as applied to scientific method, so I've got enough previous in this space to at least pretend I've got some background in the field. On top of that, Bayesian classifiers give us an easy way to assume that the incidence of any evidence is independent of the incidence of any other. We end up with a fairly simple equation for finding the probability of a given programming language given the elements of language we have in a code snippet.

P(Language|Snippet n-grams) = P(Snippet n-gram(1)|Language) * P(Snippet n-gram(2)|Language) ... P(Snippet n-gram(n)|Language)
                                          P(Snippet n-gram(1)) * P(Snippet n-gram(2)) ... P(Snippet n-gram(n))

We end up with very small numbers here, so much that we get floating point underflow. To avoid this, we can use the natural logarithms of the probabilities on the right hand-side, and add rather than multiply them.

How do we identify languages?

Linguist has six so-called "strategies" for identifying programming languages: the presence of emacs/vim modelines in a file, the filename, the file extension, the presence of a shebang (eg, !#/bin/bash) in a given file, some predefined heuristics and a Bayesian classifier of its own, though with no persistence of the training data across runs of the tool. In this approach, we'll only be implementing the classifier, but using heuristic-like methods to supplement the ability of the model to accurately identify certain languages.

The first element of the classification model will be based upon n-grams, where n will be between 1 and 4. I want to be able to classify on the basis of single keywords (eg, puts in Ruby), as well as strings of words (eg, the public static void main method signature in Java).

At the core of this, we have a very basic tokeniser that should give us enough information to put together some tokens that will give us enough to go on and create the n-grams that will give us the ability to infer the language code snippets are written in.

A simple improvement on this would be to remove anything that would add plaintext to the mix: comments, docstrings and the like. As I said above, I'm not really concerned with 100% accuracy: something that quacks like a duck might be enough for us to say that it's a duck here.

Languages I have to deal with

From a cursory look at the 32,000 snippets, I know that I definitely have to be able to identify and distinguish between Python, Ruby, C#, C, C++, x86 (I think, this could be a rabbit hole and a half to go down) assembly and Java. We can reasonably expect that differentiation between Java and C# and C and C++ will be painful and prone to error until we refine the model given the similarities that these languages have to one another.

To start, I will just be attempting to demonstrate that my broad approach works with at least Python, Ruby, Assembly, C# and Java before looking to incorporate more languages.

Persistence of the probability data

With a Bayesian approach, we need to be able to refer to a trained model of what the probabilities of given features are for a given programming language in order to make a prediction of which programming language a given snippet will be. In order to do this, we need to store this probability information somewhere. For the sake of simplicity, I'll be doing this in MariaDB, with the basic schema below:

DROP DATABASE IF EXISTS bayesian_training;
CREATE DATABASE bayesian_training;
USE bayesian_training;
CREATE TABLE languages(
    language VARCHAR(20) UNIQUE NOT NULL
CREATE TABLE occurences(
    gram_id INT NOT NULL,
    language_id INT NOT NULL,
    number INT NOT NULL,
    PRIMARY KEY(gram_id, language_id),
    FOREIGN KEY(gram_id) REFERENCES grams(id),
    FOREIGN KEY(language_id) REFERENCES languages(id)

Training of the model

To train the model, I used the following codebases:

  • Python
    • Django
    • Twisted
  • Ruby
    • Sinatra
    • Discourse
  • Java
    • Jenkins
    • Lucene and Solr
  • Assembly
    • BareMetalOS
    • Floppy Bird
  • C#
    • GraphEngine
    • Json.Net
    • wtrace

These are, in the realms of real-world code usage, pretty small samples to be going on, but should hopefully give us enough to get a system together that works.

How effective was our initial model?

In order to test how well we did, I tested the following files against the model:

The results:

linguist.rb: [(7953, 'asm', -6416.136889371387), (8002, 'c#', -6630.869975742312), (3931, 'java', -6849.643512121844), (1, 'python', -6302.470348917564), (1763, 'ruby', -5991.090879392727)]
ZipFile.cs: [(7953, 'asm', -164549.47730156567), (8002, 'c#', -144878.96700648475), (3931, 'java', -152243.66607448383), (1, 'python', -158673.75993403912), (1763, 'ruby', -159188.1657594956)]
flask/app.py: [(7953, 'asm', -189603.1365282128), (8002, 'c#', -195084.66248479518), (3931, 'java', -196401.08214636377), (1, 'python', -171435.95779745755), (1763, 'ruby', -183980.2802635695)]
tetros.asm: [(7953, 'asm', -17961.240272497354), (8002, 'c#', -28535.4183269716), (3931, 'java', -28894.289472605506), (1, 'python', -28315.088969821816), (1763, 'ruby', -27569.732161692715)]

For all of the tested files, the maximum value of the logarithms is the language that we knew it was: at least we're getting the right answers, for the most part.

Technical niggles

The way that they're constructed, the databases queries used in the training stage can become incredibly large. These queries can be too large for the default value of max_allowed_packet of 1MB in my.cnf. Setting this to 64MB was sufficient to have all of my queries resolved.


The code used for this classifier can be found at GitLab. This may also be released to PyPI at some point.

Flask, Safari and HTTP 206 Partial Media Requests

Jun 08, 2014

While working on my Python 3, Flask-based application Binbogami in order that a friend would be able to put their rebirthed podcast online, a test scenario that I hadn't thought to check upon came to light: streaming MP3 in Safari on an iOS device. It turns out that attempting to do this resulted in an error in Safari along the lines of the below:

Safari iOS error

A little more investigation showed that this error was repeated on Safari on OSX. Given the unfortunate trinity of erroneous situations that Binbogami seemed to fall foul of, it seemd that the problem lay with how Safari, or QuickTime as the interface for media streaming under Safari on these platforms, was attempting to fetch the file.

The problem

A cursory DuckDuckGo search led me to find that where Firefox, Chrome, Internet Explorer and Opera all use a standard HTTP GET request for fetching media, even where this media could be considered to be being streamed, Safari's dependency on QuickTime for media playback meant that upon attempting to fetch the file, an initial request for the first two bytes of the file is made to determine its length and other header-type information, using the Range request header, with Range requests consequent to these two bytes being made subsequently.

By default, the method that I was making use of in Flask to serve static files does not issue the HTTP 206 response headers necessary to make this work, as well as not paying any heed to the Range of bytes that are requested in the request headers.


While it seemed apparent that implementation of the correct headers in the HTTP response and implementing some sort of custom method to send only the requested bytes within a file would be the way around this, my head was not particularly in the space of implementation. Again, with some internet searching I came across an instructive blog post, that appeared to have a sensible answer. With a little bit of customisation to suit my own particularities:

def send_file_206(path, safe_name):
    range_header = request.headers.get('Range', None)
    if not range_header:
        return send_from_directory(current_app.config["UPLOAD_FOLDER"], safe_name)

    size = os.path.getsize(path)
    byte1, byte2 = 0, None

    m = re.search('(\d+)-(\d*)', range_header)
    g = m.groups()

    if g[0]: byte1 = int(g[0])
    if g[1]: byte2 = int(g[1])

    length = size - byte1
    if byte2 is not None:
        length = byte2 - byte1

    data = None
    with open(path, 'rb') as f:
        data = f.read(length)

    rv = Response(data,
    rv.headers.add('Content-Range', 'bytes {0}-{1}/{2}'.format(byte1, byte1 + length - 1, size))
    return rv

A secondary issue

While the above did lead Safari to believe that it could indeed play the files, it would always treat them as "live broadcasts", rather than MP3 files of a finite length. This is due to the way in which QuickTime establishes the length of a file through it's initial requests for a few bytes at the head of a file: if it cannot get the number of bytes that it expects, it ceases trying to issue Range requests and instead issues a request with an Icy-Metadata header, implying that it believes the file to be an IceCast stream (WireShark is a wonderful tool).

The issue in the above code is found in the byte1 + length - 1 statement in the issued Content-Range header: where Safari is requesting two bytes in its first request (so the Range header will look like Range: 0-1) this will evaluate to only sending the 0 + (1 - 0) - 1 = 0th byte - not the 0th and 1st byte as requested. The file still looks like a valid MP3 file, however, so Safari requests the whole file as a stream - therefore leading to the "Live Broadcast" designation.

A simple fix was to add +1 to the length declaration, to make it length = byte2 - byte1 + 1.


It's interesting to see how differently major implemenations of media downloading functionality in mainstream browsers can differ based upon the technology underlying it. In the case of Safari's approach however, it seems somewhat contrary to the major use case of this: most people using the browser to access a media file will be seeking to download, rather than "stream" (in a traditional sense) it.

Safari's approach also has the downside of generating a lot of HTTP requests, which as a systems administrator can cause havok if you're yet to set up your log rotations for your webserver and application server container (Nginx and uWSGI in this case). It hadn't been long enough since I'd seen a high wa% in top.