Running a WSGI app on Apache should not be this hard

If I have a Django app in /home/grisha/mysite, then all I should need to do to run it under Apache is:

$ mod_python create /home/grisha/mysite_httpd \
    --listen=8888 \
    --pythonpath=/home/grisha/mysite \
    --pythonhandler=mod_python.wsgi \
    --pythonoption="mod_python.wsgi.application mysite.wsgi::application"

$ mod_python start /home/grisha/mysite_httpd/conf/httpd.conf

That’s all. There should be no need to become root, tweak various configurations, place files in the right place, check permissions, none of that.

Well… With mod_python 3.4.0 (alpha) that’s exactly how it is…

The Next Smallest Step Problem

“A journey of a thousand miles begins with a single step”

Most of my journeys never begin, or cannot continue because of that one single step, be it first or next. Because it is hard, at times excruciatingly so.

Here I speak of software, but this applies to many other aspects of my life just as well.

I recon it’s because I do not think in steps. I think of a destination. I imagine the end-result. I can picture it with clarity and in great detail. I know where I need to be. But what is the next step to get there? And it doesn’t help that where I travel, there are no signs.

Hacking on mod_python (again)

Nearly eight years after my last commit to Mod_python I’ve decided to spend some time hacking on it again.

Five years without active development and thirteen since its first release, it still seems to me an entirely useful and viable tool. The code is exceptionally clean, the documentation is amazing, and the test suite is awesome. Which is a real testament to the noble efforts of all the people who contributed to its development.

json2avro

As you embark on converting vast quantities of JSON to Avro, you soon discover that things are not as simple as they seem. Here is how it might happen.

A quick Google search eventually leads you to the avro-tools jar, and you find yourself attempting to convert some JSON, such as:

{"first":"John", "middle":"X", "last":"Doe"}
{"first":"Jane", "last":"Doe"}

Having read Avro documentation and being the clever being that you are, you start out with:

java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string"},
             {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
        at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
        at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
        ...

A brief moment of disappointment is followed by the bliss of enlightment: Duh, the “middle” element needs a default! And so you try again, this time having tacked on a default to the definition of “middle”, so it looks like {"name":"middle","type":"string","default":""}:

Avro performance

Here are some un-scientific results on how Avro performs with various codecs, as well as vs JSON-lzo files in Hive and Impala. This testing was done using a 100 million row table that was generated using random two strings and an integer.

| Format    | Codec          | Data Size     | Hive count(1) time | Impala count(1) time
|-----------|----------------|---------------|--------------------|----------------------
| JSON      | null           | 686,769,821   | not tested         | N/A                  
| JSON      | LZO            | 285,558,314   | 79s                | N/A                  
| JSON      | Deflate (gzip) | 175,878,038   | not tested         | N/A                  
| Avro      | null           | 301,710,126   | 40s                | .4s                  
| Avro      | Snappy         | 260,450,980   | 38s                | .9s                  
| Avro      | Deflate (gzip) | 156,550,144   | 64s                | 2.8s                 

So the winner appears to be Avro/Snappy or uncompressed Avro.

Apache Avro

Short version

  • Avro is better than Json for storing table data
  • Avro supports schema resolution so that the schema can evolve over time
  • Hive supports Avro and schema resolution nicely
  • Impala (1.0) can read Avro tables, but does not support schema resolution
  • Mixing compression codecs in the same table works in both Hive and Impala

The TL;DR version

Introduction

If you’re logging data into Hadoop to be analyzed, chances are you’re using JSON. JSON is great because it’s easy to generate in most any language, it’s human-readable, it’s universally supported and infinitely flexible.

Simple Solution to Password Reuse

Here's a KISS solution to all your password reuse problems. It requires remembering only *one* strong password, lets you have a virtually limitless number of passwords, and, most importantly, does NOT store anything anywhere or transfer anything over the network (100% browser-side Javascript).

Stupid Simple Password Generator

Step 1:

Think of a phrase you will always remember. Keep typing until the text on the right says "strong". Punctuation, spaces, unusual words and mixed case while not required, are generally a good idea. The most important thing is that the script considers it strong.

Compiling Impala from Github

Apparently Impala has two versions of source code, one internal to Cloudera, the other available on Github. I’m presuming that code gets released to Github once undergone some level of internal scrutiny, but I’ve not seen any documentation on how one could tie publically available code to the official Impala (binary) release, currently 1.0.

Anyway, I tried compiling the github code last night, and here are the steps that worked for me.

SQLite DB stored in a Redis hash

In a recent post I explained how a relational database could be backed by a key-value store by virtue of B-Trees. This sounded great in theory, but I wanted to see that it actually works. And so last night I wrote a commit to Thredis, which does exactly that.

If you’re not familiar with Thredis - it’s something I hacked together last Christmas. Thredis started out as threaded Redis, but eventually evolved into SQLite + Redis. Thredis uses a separate file to save SQLite data. But with this patch it’s no longer necessary - a SQLite DB is entirely stored in a Redis Hash object.

Checking out Cloudera Impala

I’ve decided to check out Impala last week and here’s some notes on how that went.

First thoughts

I was very impressed with how easy it was to install, even considering our unusual set up (see below). In my simple ad-hoc tests Impala performed orders of magnitude faster than Hive. So far it seems solid down to the little details, like the shell prompt with a fully functional libreadline and column headers nicely formatted.