Time Series Accuracy - Graphite vs RRDTool

Back in my ISP days, we used data stored in RRDs to bill our customers. I wouldn’t try this with Graphite. In this write up I try to explain why it is so by comparing the method of recording time series used by Graphite, with the one used by RRDTool.

Graphite uses Whisper to store data, which in the FAQ is portrayed as a better alternative to RRDTool, but this is potentially misleading, because the flexibility afforded by the design of Whisper comes at the price of inaccuracy.

On Time Series

Is it even a thing?

Time Series is on its way to becoming a buzzword in the Information Technology circles. This has to do with the looming Internet of Things which shall cause the Great Reversal of Internet whereby upstream flow of data produced by said Things is expected to exceed the downstream flow. Much of this data is expected to be of the Time Series kind.

This, of course, is a money-making opportunity of the Big Data proportions all over again, and I predict we’re going to see a lot of Time Series support of various shapes and forms appearing in all manners of (mostly commercial) software.

How InfluxDB Stores Data

A nice, reliable, horizontally scalable database that is designed specifically to tackle the problem of Time Series data (and does not require you to stand up a Hadoop cluster) is very much missing from the Open Source Universe right now.

InfluxDB might be able to fill this gap, it certainly aims to.

I was curious about how it structures and stores data and since there wasn’t much documentation on the subject and I ended up just reading the code, I figured I’d write this up. I only looked at the new (currently 0.9.0 in RC stage) version, the previous versions are significantly different.

Ruby, HiveServer2 and Kerberos

Recently I found myself needing to connect to HiveServer2 with Kerberos authentication enabled from a Ruby app. As it turned out rbhive gem we were using did not have support for Kerberos authentication. So I had to roll my own.

This post is to document the experience of figuring out the details of a SASL/GSSAPI connection before it is lost forever in my neurons and synapses.

First, the terminology. The authentication system that Hadoop uses is Kerberos. Note that Kerberos is not a network protocol. It describes the method by which authentication happens, but not the format of how to send Kerberos tickets and what not over the wire. For that, you need SASL and GSSAPI.

Graceful restart in Golang

Update (Apr 2015): Florian von Bock has turned what is described in this article into a nice Go package called endless.

If you have a Golang HTTP service, chances are, you will need to restart it on occasion to upgrade the binary or change some configuration. And if you (like me) have been taking graceful restart for granted because the webserver took care of it, you may find this recipe very handy because with Golang you need to roll your own.

mod_python performance part 2: high(er) concurrency

Tl;dr

As is evident from the table below, mod_python 3.5 (in pre-release testing as of this writing) is currently the fastest tool when it comes to running Python in your web server, and second-fastest as a WSGI container.

Server Version Req/s % of httpd static Notes
nxweb static file 3.2.0-dev 512,767 347.1 % "memcache":false. (626,270 if true)
nginx static file 1.0.15 430,135 291.1 % stock CentOS 6.3 rpm
httpd static file 2.4.4, mpm_event 147,746 100.0 %
mod_python handler 3.5, Python 2.7.5 125,139 84.7 %
uWSGI 1.9.18.2 119,175 80.7 % -p 16 --threads 1
mod_python wsgi 3.5, Python 2.7.5 87,304 59.1 %
mod_wsgi 3.4 76,251 51.6 % embedded mode
nxweb wsgi 3.2.0-dev, Python 2.7.5 15,141 10.2 % posibly misconfigured?

The point of this test

I wanted to see how mod_python compares to other tools of similar purpose on high-end hardware and with relatively high concurrency. As I’ve written before you’d be foolish to base your platform decision on these numbers because speed in this case matters very little. So the point of this is just make sure that mod_python is in the ballpark with the rest and that there isn’t anything seriously wrong with it. And surprisingly, mod_python is actually pretty fast, fastest, even, though in its own category (a raw mod_python handler).

Separate Request and Response or a single Request object?

Are you in favor of a single request object, or two separate objects: request and response? Could it be that the two options are not contradictory or even mutually exclusive?

I thouhgt I always was in favor of a single request object which I expressed on the Web-SIG mailing list thread dating back to October 2003 (ten years ago!). But it is only now that I’ve come to realize that both proponents of a single object and two separate objects were correct, they were just talking about different things.

My thoughts on WSGI

I’m not very fond of it. Here is why.

CGI Origins

WSGI is based on CGI, as the “GI” (Gateway Interface) suggests right there in the name.

CGI solved a very important problem using the very limited tools at hand available at the time. Though CGI wasn’t a standard, it was ubiquitous in the early days of the WWW, despite its inherent slowness and other limitations. It became popular because it worked with any language, was easy to turn on and provided such a thick wall of isolation that admins could turn it on for their users without too much concern for problems caused by user-generated CGI scripts.

mod_python: the long story

This story started back in 1996. I was in my early twenties, working as a programmer at a small company specializing in on-line reporting of certain pharmaceutical data.

There was a web-based application (which was extremely cool considering how long ago this was), but unfortunately it was written in Visual Basic by a contractor and I was determined to do something about it. As was very fashionable at the time, I was very pro Open Source, had Linux running on my home 386 and had recently heard Guido’s talk at the DC Linux user group presenting his new language he called Python. Python seemed like a perfect alternative to the VB monstrosity.

mod_python performance and why it matters not.

TL;DR: mod_python is faster than you think.

Tonight I thought I’d spend some time looking into how the new mod_python fares against other frameworks of similar purpose. In this article I am going to show the results of my findings, and then I will explain why it really does not matter.

I am particularly interested in the following:

  • a pure mod_python handler, because this is as fast as mod_python gets.
  • a mod_python wsgi app, because WSGI is so popular these days.
  • mod_wsgi, because it too runs under Apache and is written entirely in C.
  • uWSGI, because it claims to be super fast.
  • Apache serving a static file (as a point of reference).

The Test

I am testing this on a CentOS instance running inside VirtualBox on an early 2011 MacBook Pro. The VirtualBox has 2 CPU’s and 6GB of RAM allocated to it. Granted this configuration can’t possibly be very performant [if there is such a word], but it should be enough to compare.