Gregory Trubetskoy

Notes to self.

Mod_python Performance and Why It Matters Not.

| Comments

TL;DR: mod_python is faster than you think.

Tonight I thought I’d spend some time looking into how the new mod_python fares against other frameworks of similar purpose. In this article I am going to show the results of my findings, and then I will explain why it really does not matter.

I am particularly interested in the following:

  • a pure mod_python handler, because this is as fast as mod_python gets.
  • a mod_python wsgi app, because WSGI is so popular these days.
  • mod_wsgi, because it too runs under Apache and is written entirely in C.
  • uWSGI, because it claims to be super fast.
  • Apache serving a static file (as a point of reference).

The Test

I am testing this on a CentOS instance running inside VirtualBox on an early 2011 MacBook Pro. The VirtualBox has 2 CPU’s and 6GB of RAM allocated to it. Granted this configuration can’t possibly be very performant [if there is such a word], but it should be enough to compare.

Real-life performance is very much affected by issues related to concurrency and load. I don’t have the resources or tools to comprehensively test such scenarios, and so I’m just using concurrency of 1 and seeing how fast each of the afore-listed set ups can process small requests.

I’m using mod_python 3.4.1 (pre-release), revision 35f35dc, compiled against Apache 2.4.4 and Python 2.7.5. Version of mod_wsgi is 3.4, for uWSGI I use 1.9.17.1.

The Apache configuration is pretty minimal (It could probably trimmed even more, but this is good enough):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
LoadModule unixd_module /home/grisha/mp_test/modules/mod_unixd.so
LoadModule authn_core_module /home/grisha/mp_test/modules/mod_authn_core.so
LoadModule authz_core_module /home/grisha/mp_test/modules/mod_authz_core.so
LoadModule authn_file_module /home/grisha/mp_test/modules/mod_authn_file.so
LoadModule authz_user_module /home/grisha/mp_test/modules/mod_authz_user.so
LoadModule auth_basic_module /home/grisha/mp_test/modules/mod_auth_basic.so
LoadModule python_module /home/grisha/src/mod_python/src/mod_python/src/mod_python.so

ServerRoot /home/grisha/mp_test
PidFile logs/httpd.pid
ServerName 127.0.0.1
Listen 8888
MaxRequestsPerChild 1000000

<Location />
      SetHandler mod_python
      PythonHandler mp
      PythonPath "sys.path+['/home/grisha/mp_test/htdocs']"
</Location>

I should note that <Location /> is there for a purpose - the latest mod_python forgoes the map_to_storage phase when inside a <Location> section, so this makes it a little bit faster.

And the mp.py file referred to by the PythonHandler in the config above looks like this:

1
2
3
4
5
6
7
8
from mod_python import apache

def handler(req):

    req.content_type = 'text/plain'
    req.write('Hello World!')

    return apache.OK

As the benchmark tool, I’m using the good old ab, as follows:

1
$ ab -n 10  http://localhost:8888/

For each test in this article I run 500 requests first as a “warm up”, then another 500K for the actual measurement.

For the mod_python WSGI handler test I use the following config (relevant section):

1
2
3
4
5
<Location />
    PythonHandler mod_python.wsgi
    PythonPath "sys.path+['/home/grisha/mp_test/htdocs']"
    PythonOption mod_python.wsgi.application mp_wsgi
</Location>

And the mp_wsgi.py file looks like this:

1
2
3
4
5
6
7
8
9
def application(environ, start_response):
    status = '200 OK'
    output = 'Hello World!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

For mod_wsgi test I use the exact same file, and the config as follows:

1
2
LoadModule wsgi_module /home/grisha/mp_test/modules/mod_wsgi.so
WSGIScriptAlias / /home/grisha/mp_test/htdocs/mp_wsgi.py

For uWSGI (I am not an expert), I first used the following command:

1
2
3
/home/grisha/src/mp_test/bin/uwsgi \
   --http 0.0.0.0:8888 \
   -M -p 1 -w mysite.wsgi -z 30 -l 120 -L

Which yielded a pretty dismal result, so I tried using a unix socket -s /home/grisha/mp_test/uwsgi.sock and ngnix as the front end as described here, which did make uWSGI come out on top (even if proxied uWSGI is an orange among the apples).

The results, requests per second, fastest at the top:

1
2
3
4
5
6
| uWSGI/nginx         | 2391 |
| mod_python handler  | 2332 |
| static file         | 2312 |
| mod_wsgi            | 2143 |
| mod_python wsgi     | 1937 |
| uWSGI --http        | 1779 |

What’s interesting and unexpected at first is that uWSGI and the mod_python handler perform better than sending a static file, which I expected to be the fastest. On a second thought though it does make sense, once you consider that no (on average pretty expensive) filesystem operations are performed to serve the request.

Mod_wsgi performs better than the mod_python WSGI handler, and that is expected, because the mod_python version is mostly Python, vs mod_wsgi’s C version.

I think that with a little work mod_python wsgi handler could perform on par with uWSGI, though I’m not sure the effort would be worth it. Because as we all know, premature optimization is the root of all evil.

Why It Doesn’t Really Matter

Having seen the above you may be tempted to jump on the uWSGI wagon, because after all, what matters more than speed?

But let’s imagine a more real world scenario, because it’s not likely that all your application does is send "Hello World!".

To illustrate the point a little better I created a very simple Django app, which too sends "Hello World!", only it does it using a template:

1
2
3
4
def hello(request):
    t = get_template("hello.txt")
    c = Context({'name':'World'})
    return HttpResponse(t.render(c))

Using the mod_python wsgi handler (the slowest), we can process 455 req/s, using uWSGI (the fastest) 474. This means that by moving this “application” from mod_pyhton to uWSGI we would improve performance by a measley 5%.

Now let’s add some database action to our so-called “application”. For every request I’m going to pull my user record from the Django auth_users table:

1
2
3
4
5
6
7
from django.contrib.auth.models import User

def hello(request):
    grisha = User.objects.get(username='grisha')
    t = get_template("hello.txt")
    c = Context({'name':str(grisha)[0:5]}) # world was 5 characters
    return HttpResponse(t.render(c))

Now we are down to 237 req/s for the mod_python WSGI handler and 245 req/s in uWSGI, and the difference between the two has shrunk to just over 3%.

Mind you, our “application” still has less than 10 lines of code. In a real-world situation the difference in performance is more likely to amount to less than a tenth of a percent.

Bottom line: it’s foolish to pick your web server based on speed alone. Factors such as your comfort level with using it, features, documentation, security, etc., are far more important than how fast it can crank out “Hello world!”.

Last, but not least, mod_python 3.4.1 (used in this article) is ready for pre-release testing, please help me test it!

Running a WSGI App on Apache Should Not Be This Hard

| Comments

If I have a Django app in /home/grisha/mysite, then all I should need to do to run it under Apache is:

1
2
3
4
5
6
7
$ mod_python create /home/grisha/mysite_httpd \
    --listen=8888 \
    --pythonpath=/home/grisha/mysite \
    --pythonhandler=mod_python.wsgi \
    --pythonoption="mod_python.wsgi.application mysite.wsgi::application"

$ mod_python start /home/grisha/mysite_httpd/conf/httpd.conf

That’s all. There should be no need to become root, tweak various configurations, place files in the right place, check permissions, none of that.

Well… With mod_python 3.4.0 (alpha) that’s exactly how it is…

Please help me test it.

The Next Smallest Step Problem

| Comments

“A journey of a thousand miles begins with a single step”

Most of my journeys never begin, or cannot continue because of that one single step, be it first or next. Because it is hard, at times excruciatingly so.

Here I speak of software, but this applies to many other aspects of my life just as well.

I recon it’s because I do not think in steps. I think of a destination. I imagine the end-result. I can picture it with clarity and in great detail. I know where I need to be. But what is the next step to get there? And it doesn’t help that where I travel, there are no signs.

The problem of deciding what to do next is so common for me that I even have a name for it. I call it “The Next Smallest Step” problem. Whenever I find myself idling, clicking over to online time-wasters, I pause and ask myself “What is the Next Smallest Step? What’s the next smallest thing I can do, right now?”

It doesn’t matter how much further this small step moves me. A nanometer is better than standing still. It has to be something that is simple enough that I can just do. Right now.

I always plan to do big things that take days, weeks or months. But of all that, can I pick that one small simple and quick thing that I can do now?

Sometimes focusing on the next smallest step is so difficult that I pencil this question on a piece of paper, and sometimes I just type it on the command line or in the source code. My short version is:

1
2
$ WHAT NEXT?
bash: WHAT: command not found

(that’s right, in CAPS)

This simple question has taken me on some of the most fascinating and challenging journeys ever. In restrospect, I think I would not be able to travel any of them without repeatedly asking it of myself, over and over again.

It has resulted in most productive and gratifying days of work. Some of my greatest projects began with this question. In many instances it established what I had to do for months ahead (years, even?). All beginning with this one small question.

Conversely not asking it often enough, if at all, led to time having gone by without any results to show for and many a great opportunity lost.

I must also note that some of my next smallest steps took days of thinking to figure out. Nothing wrong with that.

And so I thought I’d share this with you, just in case you might find it helpful. Whenever you find yourself at an impass and progress has stopped, ask yourself:

“What is the Next Smallest Step?”

Hacking on Mod_python (Again)

| Comments

Nearly eight years after my last commit to Mod_python I’ve decided to spend some time hacking on it again.

Five years without active development and thirteen since its first release, it still seems to me an entirely useful and viable tool. The code is exceptionally clean, the documentation is amazing, and the test suite is awesome. Which is a real testament to the noble efforts of all the people who contributed to its development.

We live in this new c10k world now where Apache Httpd no longer has the market dominance it once enjoyed, while the latest Python web frameworks run without requiring or recommending Mod_python. My hunch, however, is that given a thorough dusting it could be quite useful (perhaps more than ever) and applied in very interesting ways to solve the new problems. I also think the Python language is at a very important inflection point. Pyhton 3 is now mature, and is slowly but steadily becoming the preferred language of many interesting communities such as data science, for example.

The current status of Mod_python as an Apache project is that it’s in the attic. This means that the ASF isn’t providing much in the way of infrastructure support any longer, nor will you see an “official” ASF release any time soon. (If ever - Mod_python would have to re-enter as an incubator project and at this point it is entirely premature to even consider such an option).

For now the main goal is to re-establish the community, and as part of that I will have to sort out how to do issue tracking, discussion groups, etc. At this point the only thing I’ve decided is that the main repository will live on github.

The latest code is in 4.0.x branch. My initial development goal is to bring it up to compatibility with Python 2.7 and Apache Httpd 2.4 (I’m nearly there already), then potentially move on to Python 3 support. I have rolled back a few commits (most notably the new importer) because I did not understand them. There are still a few changes in Apache 2.4 that need to be addressed, but they seem relatively minor at this point. Authentication has been changed significantly in 2.4, though mod_python never had much coverage in that area.

Let’s see where this takes us? And if you like this, feel free to star and fork Mod_python on github and follow it on Twitter:

Json2avro

| Comments

As you embark on converting vast quantities of JSON to Avro, you soon discover that things are not as simple as they seem. Here is how it might happen.

A quick Google search eventually leads you to the avro-tools jar, and you find yourself attempting to convert some JSON, such as:

1
2
{"first":"John", "middle":"X", "last":"Doe"}
{"first":"Jane", "last":"Doe"}

Having read Avro documentation and being the clever being that you are, you start out with:

1
2
3
4
5
6
7
8
9
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string"},
             {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
        at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
        at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
        ...

A brief moment of disappointment is followed by the bliss of enlightment: Duh, the “middle” element needs a default! And so you try again, this time having tacked on a default to the definition of “middle”, so it looks like {"name":"middle","type":"string","default":""}:

1
2
3
4
5
6
7
8
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string","default":""},
             {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
        at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
        ...

Why doesn’t this work? Well… You don’t understand Avro, as it turns out. You see, JSON is not Avro, and therefore the wonderful Schema Resolution thing you’ve been reading about does not apply.

But do not despair. I wrote a tool just for you:

json2avro. It does exactly what you want:

1
2
3
4
5
json2avro input.json output.avro -s \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string","default":""},
             {"name":"last","type":"string"}]}'

No errors, and we have an output.avro file, let’s see what’s in it by using the aforementioned avro-tools:

1
2
3
java -jar ~/src/avro/java/avro-tools-1.7.4.jar tojson output.avro
{"first":"John","middle":"X","last":"Doe"}
{"first":"Jane","middle":"","last":"Doe"}

Let me also mention that json2avro is written in C and is fast, it supports Snappy, Deflate and LZMA compression codecs, lets you pick a custom block size and is smart enough to (optionally) skip over lines it cannot parse.

Enjoy!

Avro Performance

| Comments

Here are some un-scientific results on how Avro performs with various codecs, as well as vs JSON-lzo files in Hive and Impala. This testing was done using a 100 million row table that was generated using random two strings and an integer.

1
2
3
4
5
6
7
8
| Format    | Codec          | Data Size     | Hive count(1) time | Impala count(1) time
|-----------|----------------|---------------|--------------------|----------------------
| JSON      | null           | 686,769,821   | not tested         | N/A                  
| JSON      | LZO            | 285,558,314   | 79s                | N/A                  
| JSON      | Deflate (gzip) | 175,878,038   | not tested         | N/A                  
| Avro      | null           | 301,710,126   | 40s                | .4s                  
| Avro      | Snappy         | 260,450,980   | 38s                | .9s                  
| Avro      | Deflate (gzip) | 156,550,144   | 64s                | 2.8s                 

So the winner appears to be Avro/Snappy or uncompressed Avro.

Apache Avro

| Comments

Short version

  • Avro is better than Json for storing table data
  • Avro supports schema resolution so that the schema can evolve over time
  • Hive supports Avro and schema resolution nicely
  • Impala (1.0) can read Avro tables, but does not support schema resolution
  • Mixing compression codecs in the same table works in both Hive and Impala

The TL;DR version

Introduction

If you’re logging data into Hadoop to be analyzed, chances are you’re using JSON. JSON is great because it’s easy to generate in most any language, it’s human-readable, it’s universally supported and infinitely flexible.

It is also space inefficient, prone to errors, the standard has a few ambiguities, all of which eventually catches up to you. It only takes one bad record to spoil a potentially massive amount of data, and finding the bad record and figuring out the root cause of the problem is usually difficult and often even impossible.

So you might be considering a slightly more rigid and space efficient format, and in the Hadoop world it is Apache Avro. Avro is especially compelling because it is supported by Impala, while JSON isn’t (not yet, at least).

Named after a British aircraft maker, Avro is a schema-enforced format for serializing arbitrary data. It is in the same category as Thrift, only it seems like Thrift has found its niche in RPC, whereas Avro appears more compelling as the on-disk format (even though both Avro and Thrift were designed for both storage and RPC). Thrift seems more insistent on you using its code generator, whereas Avro does it the old-school way, but providing you libraries you can use in your code. (It does code generation as well, if that’s your thing. I prefer to hand-write all my code).

I actually don’t want to focus on the details of what Avro is as there is plenty information on that elsewhere. I want to share my findings regarding Avro’s suitability as an alternative to JSON used with Hive and JSON SerDe.

Schema (and its Resolution)

Every Avro file contains a header with the schema describing (in JSON!) the contents of the file’s records. This is very nice, because the file contains all the knowledge necessary to be able to read it.

Avro was designed with the understanding that the schema may change over time (e.g. columns added or changed), and that software designed for a newer schema may need to read older schema files. To support this it provides something called Schema Resolution.

Imagine you’ve been storing people’s names in a file. Then later on you decided to add “age” as another attribute. Now you’re got two schemas, one with “age” and one without. In JSON world you’d have to adjust your program to be able to read old files with some kind of an if/then statement to make sure that when “age” is not there the program knows what to do. In Avro, the new schema can specify a default for the age (e.g. 0), and whatever Avro lib you’d be using should be able to convert a record of the old schema to the new schema automatically, without any code modifications necessary. This is called schema resolution.

Avro support in Hive

First we need an Avro file. Our schema is just one string column named “test”. Here’s a quick Ruby program to generate a file of 10 random records (you’ll need to gem install avro):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'rubygems'
require 'avro'

schema = Avro::Schema.parse('{"name":"test_record", ' +
                            ' "type":"record", ' +
                            ' "fields": [' +
                            '   {"name":"full_name",  "type":"string"}]}')

writer = Avro::IO::DatumWriter.new(schema)
file = File.open('test.avro', 'wb')
dw = Avro::DataFile::Writer.new(file, writer, schema)
3.times do
  dw << {'full_name'=>"X#{rand(10000)} Y#{rand(10000)}"}
end
dw.flush
dw.close

Then we need to create a Hive table and load our file into it:

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE test_avro
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                               {"name":"full_name", "type":"string"}]}');

LOAD DATA LOCAL INPATH 'test.avro' OVERWRITE INTO TABLE test_avro;

Note that the table definition needs its own schema definition, even though our file already contains a schema. This is not a mistake. This is the schema Hive will expect. And if the file that it’s reading is of a different schema, it will attempt to convert it using Avro schema resolution. Also noteworthy is that this table defines no columns. The entire definition is in the avro.schema.literal property.

Let’s make sure this is working:

1
2
3
4
5
hive> select * from test_avro;
OK
X1800 Y9002
X3859 Y8971
X6935 Y5523

Now we also happen to have Impala running, let’s see if it’s able to read this file:

1
2
3
4
5
6
7
8
9
10
> select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...
+-------------+
| full_name   |
+-------------+
| X1800 Y9002 |
| X3859 Y8971 |
| X6935 Y5523 |
+-------------+

So far so good! Now let’s create a second avro file, with one additional column age, using the following Ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
schema = Avro::Schema.parse('{"name":"test_record", ' +
                            ' "type":"record", ' +
                            ' "fields": [' +
                            '   {"name":"full_name",  "type":"string"},' +
                            '   {"name":"age",        "type":"int"}]}')

writer = Avro::IO::DatumWriter.new(schema)
file = File.open('test2.avro', 'wb')
dw = Avro::DataFile::Writer.new(file, writer, schema)
3.times do
  dw << {'full_name'=>"X#{rand(10000)} Y#{rand(10000)}", 'age'=>rand(100)}
end
dw.flush
dw.close

Let’s load this into Hive and see if it still works. (No OVERWRITE keyword this time, we’re appending a second file to our table).

1
2
3
4
5
6
7
8
9
10
hive> LOAD DATA LOCAL INPATH 'test2.avro' INTO TABLE test_avro;
OK
hive> select * from test_avro;
OK
X1800 Y9002
X3859 Y8971
X6935 Y5523
X4720 Y1361
X4605 Y3067
X7007 Y7852

This is working exactly as expected. Hive has shown the 3 original records as before, and the 3 new ones got converted to Hive’s version of the schema, where the “age” column does not exist.

Let’s see what Impala thinks of this:

1
2
3
4
5
6
7
8
9
10
> select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...
+-------------+
| full_name   |
+-------------+
| X1800 Y9002 |
| X3859 Y8971 |
| X6935 Y5523 |
+-------------+

Alas - we’re only getting the 3 original rows. Bummer! What’s worrisome is that no indication was given to us that 3 other rows got swallowed because Impala didn’t do schema resolution. (I’ve posted regarding this on the Impala users list, awaiting response).

Now let’s alter the table schema so that age is part of it. (This is not your typical ALTER TABLE, we’re just changing avro.schema.literal).

1
2
3
4
5
6
7
8
9
10
hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                              {"name":"full_name", "type":"string"},
                              {"name":"age",  "type":"int"}]}');
OK
hive> select * from test_avro;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found test_record, expecting test_record

This error should not be surprising. Hive is trying to provide a value for age for those records where it did not exist, but we neglected to specify a default. (The error message is a little cryptic, though). So let’s try again, this time with a default:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                              {"name":"full_name", "type":"string"},
                              {"name":"age",  "type":"int", "default":999}]}');
OK
hive> select * from test_avro;
OK
X1800 Y9002     999
X3859 Y8971     999
X6935 Y5523     999
X4720 Y1361     10
X4605 Y3067     34
X7007 Y7852     17

Woo-hoo!

Other notes

  • There is a nice Avro-C library, but it currently does not support defaults (version 1.7.4).
  • Some benchmarks
  • Avro’s integer encoding is same as Lucene with zigzag encoding on top of it. (It’s just like SQLite, only little-endian). Curiously, it is used for all integers, including array indexes internal to Avro, which can never be negative and thus ZigZag is of no use. This is probably to keep all integer operation consistent.
  • A curious post on how Lucene/Avro variable-integer format is CPU-unfriendly

Simple Solution to Password Reuse

| Comments

Here’s a KISS solution to all your password reuse problems. It requires remembering only *one* strong password, lets you have a virtually limitless number of passwords, and, most importantly, does NOT store anything anywhere or transfer anything over the network (100% browser-side Javascript).

Stupid Simple Password Generator

Step 1:

Think of a phrase you will always remember. Keep typing until the text on the right says “strong”. Punctuation, spaces, unusual words and mixed case while not required, are generally a good idea. The most important thing is that the script considers it strong.

Make sure this passphrase is impossible to guess by people who know you, e.g. don’t pick quotes from your favorite song or movie. Don’t ever write it down or save it on your computer in any way or form!

Step 2:

Think of a short keyword describing a password, e.g. “amazon”, “gmail”, etc. This word has to be easy to remember and there is no need for it to be unique or hard to guess.

Passphrase: Strength:
Verify: Correct:
KeywordPassword

That’s it! You can regenerate any of the passwords above at any time by coming back to this page, all you need to know is the passphrase (and the keywords).

Fine print: This is a proof-of-concept, use at your own risk!

How does it work?

First, credits where they are due: This page uses Brian Turek’s excellent jsSHA Javascript SHA lib and Dan Wheeler’s amazing zxcvbn password strength checking lib. All we are doing here is computing a SHA-512 of the passphrase + keyword, then selecting a substring of the result. (We also append a 0 and/or a ! to satisfy most password checker requirements for numbers and punctuation). If you don’t trust that generated passwords are strong, just paste them into the passphrase field, I assure you, no password here will ever be weak. (Or, rather, it is extremely unlikely). Some improvements could be made, but the point here is that there is no reason to keep encrypted files with your passwords along with software to open them around, all that’s needed is one strong password and a well established and easily available algorithm.

Compiling Impala From Github

| Comments

Apparently Impala has two versions of source code, one internal to Cloudera, the other available on Github. I’m presuming that code gets released to Github once undergone some level of internal scrutiny, but I’ve not seen any documentation on how one could tie publically available code to the official Impala (binary) release, currently 1.0.

Anyway, I tried compiling the github code last night, and here are the steps that worked for me.

My setup:

  • Linux CentOS 6.2 (running inside a VirtualBox instance on an Early 2011 MacBook, Intel i7).

  • CDH 4.3 installed using Cloudera RPM’s. No configuration was done, I just ran yum install as described in the installation guide.

  • Impala checked out from Github, HEAD is 60cb0b75 (Mon May 13 09:36:37 2013).

  • Boost 1.42, compiled and installed manually (see below).

The steps I followed:

  • Check out Impala:
1
2
3
4
5
git clone git://github.com/cloudera/impala.git
<...>

git branch -v
* master 60cb0b7 Fixed formatting in README
  • Install Impala pre-requisites as per Impala README, except for Boost:
1
2
3
sudo yum install libevent-devel automake libtool flex bison gcc-c++ openssl-devel \
    make cmake doxygen.x86_64 glib-devel python-devel bzip2-devel svn \
    libevent-devel cyrus-sasl-devel wget git unzip
  • Install LLVM. Follow the precise steps in the Impala README, it works.

  • Make sure you have the Oracle JDK 6, not OpenJDK. I found this link helpful.

  • Remove the CentOS version of Boost (1.41) if you have it. Impala needs uuid, which is only supported in 1.42 and later:

1
2
# YMMV - this is how I did it, you may want to be more cautious
sudo yum erase `rpm -qa | grep boost`
  • Download and untar Boost 1.42 from [http://www.boost.org/users/history/version1420.html](http://www.boost.org/users/history/version1420.html)

  • Compile and install Boost. Note that Boost must be compiled with multi-threaded support and the layout matters too. I ended up up using the following:

1
2
3
4
5
cd boost_1_42_0/
./bootstrap.sh
# not sure how necessary the --libdir=/usr/lib64 is, there was a post mentioning it, i followed this advice blindly
./bjam --libdir=/usr/lib64 threading=multi --layout=tagged
sudo ./bjam --libdir=/usr/lib64 threading=multi --layout=tagged install

SQLite DB Stored in a Redis Hash

| Comments

In a recent post I explained how a relational database could be backed by a key-value store by virtue of B-Trees. This sounded great in theory, but I wanted to see that it actually works. And so last night I wrote a commit to Thredis, which does exactly that.

If you’re not familiar with Thredis - it’s something I hacked together last Christmas. Thredis started out as threaded Redis, but eventually evolved into SQLite + Redis. Thredis uses a separate file to save SQLite data. But with this patch it’s no longer necessary - a SQLite DB is entirely stored in a Redis Hash object.

A very neat side-effect of this little hack is that it lets a SQLite database be automatically replicated using Redis replication.

I was able to code this fairly easily because SQLite provides a very nice way of implementing a custom Virtual File System (VFS).

Granted this is only proof-of-concept and not anything you should dare use anywhere near production, it’s enough to get a little taste, so let’s start an empty Thredis instance and create a SQL table:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ redis-cli
redis 127.0.0.1:6379> sql "create table test (a int, b text)"
(integer) 0
redis 127.0.0.1:6379> sql "insert into test values (1, 'hello')"
(integer) 1
redis 127.0.0.1:6379> sql "select * from test"
1) 1) 1) "a"
      2) "int"
   2) 1) "b"
      2) "text"
2) 1) (integer) 1
   2) "hello"
redis 127.0.0.1:6379> 

Now let’s start a slave on a different port and fire up another redis-client to connect to it. (This means slaveof is set to localhost:6379 and slave-read-only is set to false, I won’t bore you with a paste of the config here).

1
2
3
4
5
6
7
8
9
$ redis-cli -p 6380
redis 127.0.0.1:6380> sql "select * from test"
1) 1) 1) "a"
      2) "int"
   2) 1) "b"
      2) "text"
2) 1) (integer) 1
   2) "hello"
redis 127.0.0.1:6380> 

Here you go - the DB’s replicated!

You can also see what SQLite data looks like in Redis (not terribly exciting):

1
2
3
4
redis 127.0.0.1:6379> hlen _sql:redis_db
(integer) 2
redis 127.0.0.1:6379> hget _sql:redis_db 0
"SQLite format 3\x00 \x00\x01\x01\x00@  \x00\x00\x00\x02\x00\x00\x00\x02\x00\x00 ...

Another potential benefit to this approach is that with not too much more tinkering the database could be backed by Redis Cluster, which would give you a fully-functional horizontally-scalable clustered in-memory SQL database. Of course, only the store would be distributed, not the query processing. So this would be no match to Impala and the like which can process queries in a distributed fasion, but still, it’s pretty cool for some 300 lines of code, n’est-ce pas?