Gregory Trubetskoy

Notes to self.

Running a WSGI App on Apache Should Not Be This Hard

| Comments

If I have a Django app in /home/grisha/mysite, then all I should need to do to run it under Apache is:

1
2
3
4
5
6
7
$ mod_python create /home/grisha/mysite_httpd \
    --listen=8888 \
    --pythonpath=/home/grisha/mysite \
    --pythonhandler=mod_python.wsgi \
    --pythonoption="mod_python.wsgi.application mysite.wsgi::application"

$ mod_python start /home/grisha/mysite_httpd/conf/httpd.conf

That’s all. There should be no need to become root, tweak various configurations, place files in the right place, check permissions, none of that.

Well… With mod_python 3.4.0 (alpha) that’s exactly how it is…

Please help me test it.

The Next Smallest Step Problem

| Comments

“A journey of a thousand miles begins with a single step”

Most of my journeys never begin, or cannot continue because of that one single step, be it first or next. Because it is hard, at times excruciatingly so.

Here I speak of software, but this applies to many other aspects of my life just as well.

I recon it’s because I do not think in steps. I think of a destination. I imagine the end-result. I can picture it with clarity and in great detail. I know where I need to be. But what is the next step to get there? And it doesn’t help that where I travel, there are no signs.

The problem of deciding what to do next is so common for me that I even have a name for it. I call it “The Next Smallest Step” problem. Whenever I find myself idling, clicking over to online time-wasters, I pause and ask myself “What is the Next Smallest Step? What’s the next smallest thing I can do, right now?”

It doesn’t matter how much further this small step moves me. A nanometer is better than standing still. It has to be something that is simple enough that I can just do. Right now.

I always plan to do big things that take days, weeks or months. But of all that, can I pick that one small simple and quick thing that I can do now?

Sometimes focusing on the next smallest step is so difficult that I pencil this question on a piece of paper, and sometimes I just type it on the command line or in the source code. My short version is:

1
2
$ WHAT NEXT?
bash: WHAT: command not found

(that’s right, in CAPS)

This simple question has taken me on some of the most fascinating and challenging journeys ever. In restrospect, I think I would not be able to travel any of them without repeatedly asking it of myself, over and over again.

It has resulted in most productive and gratifying days of work. Some of my greatest projects began with this question. In many instances it established what I had to do for months ahead (years, even?). All beginning with this one small question.

Conversely not asking it often enough, if at all, led to time having gone by without any results to show for and many a great opportunity lost.

I must also note that some of my next smallest steps took days of thinking to figure out. Nothing wrong with that.

And so I thought I’d share this with you, just in case you might find it helpful. Whenever you find yourself at an impass and progress has stopped, ask yourself:

“What is the Next Smallest Step?”

Hacking on Mod_python (Again)

| Comments

Nearly eight years after my last commit to Mod_python I’ve decided to spend some time hacking on it again.

Five years without active development and thirteen since its first release, it still seems to me an entirely useful and viable tool. The code is exceptionally clean, the documentation is amazing, and the test suite is awesome. Which is a real testament to the noble efforts of all the people who contributed to its development.

We live in this new c10k world now where Apache Httpd no longer has the market dominance it once enjoyed, while the latest Python web frameworks run without requiring or recommending Mod_python. My hunch, however, is that given a thorough dusting it could be quite useful (perhaps more than ever) and applied in very interesting ways to solve the new problems. I also think the Python language is at a very important inflection point. Pyhton 3 is now mature, and is slowly but steadily becoming the preferred language of many interesting communities such as data science, for example.

The current status of Mod_python as an Apache project is that it’s in the attic. This means that the ASF isn’t providing much in the way of infrastructure support any longer, nor will you see an “official” ASF release any time soon. (If ever - Mod_python would have to re-enter as an incubator project and at this point it is entirely premature to even consider such an option).

For now the main goal is to re-establish the community, and as part of that I will have to sort out how to do issue tracking, discussion groups, etc. At this point the only thing I’ve decided is that the main repository will live on github.

The latest code is in 4.0.x branch. My initial development goal is to bring it up to compatibility with Python 2.7 and Apache Httpd 2.4 (I’m nearly there already), then potentially move on to Python 3 support. I have rolled back a few commits (most notably the new importer) because I did not understand them. There are still a few changes in Apache 2.4 that need to be addressed, but they seem relatively minor at this point. Authentication has been changed significantly in 2.4, though mod_python never had much coverage in that area.

Let’s see where this takes us? And if you like this, feel free to star and fork Mod_python on github and follow it on Twitter:

Json2avro

| Comments

As you embark on converting vast quantities of JSON to Avro, you soon discover that things are not as simple as they seem. Here is how it might happen.

A quick Google search eventually leads you to the avro-tools jar, and you find yourself attempting to convert some JSON, such as:

1
2
{"first":"John", "middle":"X", "last":"Doe"}
{"first":"Jane", "last":"Doe"}

Having read Avro documentation and being the clever being that you are, you start out with:

1
2
3
4
5
6
7
8
9
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string"},
             {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
        at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
        at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
        ...

A brief moment of disappointment is followed by the bliss of enlightment: Duh, the “middle” element needs a default! And so you try again, this time having tacked on a default to the definition of “middle”, so it looks like {"name":"middle","type":"string","default":""}:

1
2
3
4
5
6
7
8
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string","default":""},
             {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
        at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
        ...

Why doesn’t this work? Well… You don’t understand Avro, as it turns out. You see, JSON is not Avro, and therefore the wonderful Schema Resolution thing you’ve been reading about does not apply.

But do not despair. I wrote a tool just for you:

json2avro. It does exactly what you want:

1
2
3
4
5
json2avro input.json output.avro -s \
 '{"type":"record","name":"whatever",
   "fields":[{"name":"first", "type":"string"},
             {"name":"middle","type":"string","default":""},
             {"name":"last","type":"string"}]}'

No errors, and we have an output.avro file, let’s see what’s in it by using the aforementioned avro-tools:

1
2
3
java -jar ~/src/avro/java/avro-tools-1.7.4.jar tojson output.avro
{"first":"John","middle":"X","last":"Doe"}
{"first":"Jane","middle":"","last":"Doe"}

Let me also mention that json2avro is written in C and is fast, it supports Snappy, Deflate and LZMA compression codecs, lets you pick a custom block size and is smart enough to (optionally) skip over lines it cannot parse.

Enjoy!

Avro Performance

| Comments

Here are some un-scientific results on how Avro performs with various codecs, as well as vs JSON-lzo files in Hive and Impala. This testing was done using a 100 million row table that was generated using random two strings and an integer.

1
2
3
4
5
6
7
8
| Format    | Codec          | Data Size     | Hive count(1) time | Impala count(1) time
|-----------|----------------|---------------|--------------------|----------------------
| JSON      | null           | 686,769,821   | not tested         | N/A                  
| JSON      | LZO            | 285,558,314   | 79s                | N/A                  
| JSON      | Deflate (gzip) | 175,878,038   | not tested         | N/A                  
| Avro      | null           | 301,710,126   | 40s                | .4s                  
| Avro      | Snappy         | 260,450,980   | 38s                | .9s                  
| Avro      | Deflate (gzip) | 156,550,144   | 64s                | 2.8s                 

So the winner appears to be Avro/Snappy or uncompressed Avro.

Apache Avro

| Comments

Short version

  • Avro is better than Json for storing table data
  • Avro supports schema resolution so that the schema can evolve over time
  • Hive supports Avro and schema resolution nicely
  • Impala (1.0) can read Avro tables, but does not support schema resolution
  • Mixing compression codecs in the same table works in both Hive and Impala

The TL;DR version

Introduction

If you’re logging data into Hadoop to be analyzed, chances are you’re using JSON. JSON is great because it’s easy to generate in most any language, it’s human-readable, it’s universally supported and infinitely flexible.

It is also space inefficient, prone to errors, the standard has a few ambiguities, all of which eventually catches up to you. It only takes one bad record to spoil a potentially massive amount of data, and finding the bad record and figuring out the root cause of the problem is usually difficult and often even impossible.

So you might be considering a slightly more rigid and space efficient format, and in the Hadoop world it is Apache Avro. Avro is especially compelling because it is supported by Impala, while JSON isn’t (not yet, at least).

Named after a British aircraft maker, Avro is a schema-enforced format for serializing arbitrary data. It is in the same category as Thrift, only it seems like Thrift has found its niche in RPC, whereas Avro appears more compelling as the on-disk format (even though both Avro and Thrift were designed for both storage and RPC). Thrift seems more insistent on you using its code generator, whereas Avro does it the old-school way, but providing you libraries you can use in your code. (It does code generation as well, if that’s your thing. I prefer to hand-write all my code).

I actually don’t want to focus on the details of what Avro is as there is plenty information on that elsewhere. I want to share my findings regarding Avro’s suitability as an alternative to JSON used with Hive and JSON SerDe.

Schema (and its Resolution)

Every Avro file contains a header with the schema describing (in JSON!) the contents of the file’s records. This is very nice, because the file contains all the knowledge necessary to be able to read it.

Avro was designed with the understanding that the schema may change over time (e.g. columns added or changed), and that software designed for a newer schema may need to read older schema files. To support this it provides something called Schema Resolution.

Imagine you’ve been storing people’s names in a file. Then later on you decided to add “age” as another attribute. Now you’re got two schemas, one with “age” and one without. In JSON world you’d have to adjust your program to be able to read old files with some kind of an if/then statement to make sure that when “age” is not there the program knows what to do. In Avro, the new schema can specify a default for the age (e.g. 0), and whatever Avro lib you’d be using should be able to convert a record of the old schema to the new schema automatically, without any code modifications necessary. This is called schema resolution.

Avro support in Hive

First we need an Avro file. Our schema is just one string column named “test”. Here’s a quick Ruby program to generate a file of 10 random records (you’ll need to gem install avro):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'rubygems'
require 'avro'

schema = Avro::Schema.parse('{"name":"test_record", ' +
                            ' "type":"record", ' +
                            ' "fields": [' +
                            '   {"name":"full_name",  "type":"string"}]}')

writer = Avro::IO::DatumWriter.new(schema)
file = File.open('test.avro', 'wb')
dw = Avro::DataFile::Writer.new(file, writer, schema)
3.times do
  dw << {'full_name'=>"X#{rand(10000)} Y#{rand(10000)}"}
end
dw.flush
dw.close

Then we need to create a Hive table and load our file into it:

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE test_avro
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                               {"name":"full_name", "type":"string"}]}');

LOAD DATA LOCAL INPATH 'test.avro' OVERWRITE INTO TABLE test_avro;

Note that the table definition needs its own schema definition, even though our file already contains a schema. This is not a mistake. This is the schema Hive will expect. And if the file that it’s reading is of a different schema, it will attempt to convert it using Avro schema resolution. Also noteworthy is that this table defines no columns. The entire definition is in the avro.schema.literal property.

Let’s make sure this is working:

1
2
3
4
5
hive> select * from test_avro;
OK
X1800 Y9002
X3859 Y8971
X6935 Y5523

Now we also happen to have Impala running, let’s see if it’s able to read this file:

1
2
3
4
5
6
7
8
9
10
> select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...
+-------------+
| full_name   |
+-------------+
| X1800 Y9002 |
| X3859 Y8971 |
| X6935 Y5523 |
+-------------+

So far so good! Now let’s create a second avro file, with one additional column age, using the following Ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
schema = Avro::Schema.parse('{"name":"test_record", ' +
                            ' "type":"record", ' +
                            ' "fields": [' +
                            '   {"name":"full_name",  "type":"string"},' +
                            '   {"name":"age",        "type":"int"}]}')

writer = Avro::IO::DatumWriter.new(schema)
file = File.open('test2.avro', 'wb')
dw = Avro::DataFile::Writer.new(file, writer, schema)
3.times do
  dw << {'full_name'=>"X#{rand(10000)} Y#{rand(10000)}", 'age'=>rand(100)}
end
dw.flush
dw.close

Let’s load this into Hive and see if it still works. (No OVERWRITE keyword this time, we’re appending a second file to our table).

1
2
3
4
5
6
7
8
9
10
hive> LOAD DATA LOCAL INPATH 'test2.avro' INTO TABLE test_avro;
OK
hive> select * from test_avro;
OK
X1800 Y9002
X3859 Y8971
X6935 Y5523
X4720 Y1361
X4605 Y3067
X7007 Y7852

This is working exactly as expected. Hive has shown the 3 original records as before, and the 3 new ones got converted to Hive’s version of the schema, where the “age” column does not exist.

Let’s see what Impala thinks of this:

1
2
3
4
5
6
7
8
9
10
> select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...
+-------------+
| full_name   |
+-------------+
| X1800 Y9002 |
| X3859 Y8971 |
| X6935 Y5523 |
+-------------+

Alas - we’re only getting the 3 original rows. Bummer! What’s worrisome is that no indication was given to us that 3 other rows got swallowed because Impala didn’t do schema resolution. (I’ve posted regarding this on the Impala users list, awaiting response).

Now let’s alter the table schema so that age is part of it. (This is not your typical ALTER TABLE, we’re just changing avro.schema.literal).

1
2
3
4
5
6
7
8
9
10
hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                              {"name":"full_name", "type":"string"},
                              {"name":"age",  "type":"int"}]}');
OK
hive> select * from test_avro;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found test_record, expecting test_record

This error should not be surprising. Hive is trying to provide a value for age for those records where it did not exist, but we neglected to specify a default. (The error message is a little cryptic, though). So let’s try again, this time with a default:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
                            "type":"record",
                            "fields": [
                              {"name":"full_name", "type":"string"},
                              {"name":"age",  "type":"int", "default":999}]}');
OK
hive> select * from test_avro;
OK
X1800 Y9002     999
X3859 Y8971     999
X6935 Y5523     999
X4720 Y1361     10
X4605 Y3067     34
X7007 Y7852     17

Woo-hoo!

Other notes

  • There is a nice Avro-C library, but it currently does not support defaults (version 1.7.4).
  • Some benchmarks
  • Avro’s integer encoding is same as Lucene with zigzag encoding on top of it. (It’s just like SQLite, only little-endian). Curiously, it is used for all integers, including array indexes internal to Avro, which can never be negative and thus ZigZag is of no use. This is probably to keep all integer operation consistent.
  • A curious post on how Lucene/Avro variable-integer format is CPU-unfriendly

Simple Solution to Password Reuse

| Comments

Here’s a KISS solution to all your password reuse problems. It requires remembering only *one* strong password, lets you have a virtually limitless number of passwords, and, most importantly, does NOT store anything anywhere or transfer anything over the network (100% browser-side Javascript).

Stupid Simple Password Generator

Step 1:

Think of a phrase you will always remember. Keep typing until the text on the right says “strong”. Punctuation, spaces, unusual words and mixed case while not required, are generally a good idea. The most important thing is that the script considers it strong.

Make sure this passphrase is impossible to guess by people who know you, e.g. don’t pick quotes from your favorite song or movie. Don’t ever write it down or save it on your computer in any way or form!

Step 2:

Think of a short keyword describing a password, e.g. “amazon”, “gmail”, etc. This word has to be easy to remember and there is no need for it to be unique or hard to guess.

Passphrase: Strength:
Verify: Correct:
KeywordPassword

That’s it! You can regenerate any of the passwords above at any time by coming back to this page, all you need to know is the passphrase (and the keywords).

Fine print: This is a proof-of-concept, use at your own risk!

How does it work?

First, credits where they are due: This page uses Brian Turek’s excellent jsSHA Javascript SHA lib and Dan Wheeler’s amazing zxcvbn password strength checking lib. All we are doing here is computing a SHA-512 of the passphrase + keyword, then selecting a substring of the result. (We also append a 0 and/or a ! to satisfy most password checker requirements for numbers and punctuation). If you don’t trust that generated passwords are strong, just paste them into the passphrase field, I assure you, no password here will ever be weak. (Or, rather, it is extremely unlikely). Some improvements could be made, but the point here is that there is no reason to keep encrypted files with your passwords along with software to open them around, all that’s needed is one strong password and a well established and easily available algorithm.

Compiling Impala From Github

| Comments

Apparently Impala has two versions of source code, one internal to Cloudera, the other available on Github. I’m presuming that code gets released to Github once undergone some level of internal scrutiny, but I’ve not seen any documentation on how one could tie publically available code to the official Impala (binary) release, currently 1.0.

Anyway, I tried compiling the github code last night, and here are the steps that worked for me.

My setup:

  • Linux CentOS 6.2 (running inside a VirtualBox instance on an Early 2011 MacBook, Intel i7).

  • CDH 4.3 installed using Cloudera RPM’s. No configuration was done, I just ran yum install as described in the installation guide.

  • Impala checked out from Github, HEAD is 60cb0b75 (Mon May 13 09:36:37 2013).

  • Boost 1.42, compiled and installed manually (see below).

The steps I followed:

  • Check out Impala:
1
2
3
4
5
git clone git://github.com/cloudera/impala.git
<...>

git branch -v
* master 60cb0b7 Fixed formatting in README
  • Install Impala pre-requisites as per Impala README, except for Boost:
1
2
3
sudo yum install libevent-devel automake libtool flex bison gcc-c++ openssl-devel \
    make cmake doxygen.x86_64 glib-devel python-devel bzip2-devel svn \
    libevent-devel cyrus-sasl-devel wget git unzip
  • Install LLVM. Follow the precise steps in the Impala README, it works.

  • Make sure you have the Oracle JDK 6, not OpenJDK. I found this link helpful.

  • Remove the CentOS version of Boost (1.41) if you have it. Impala needs uuid, which is only supported in 1.42 and later:

1
2
# YMMV - this is how I did it, you may want to be more cautious
sudo yum erase `rpm -qa | grep boost`
  • Download and untar Boost 1.42 from [http://www.boost.org/users/history/version1420.html](http://www.boost.org/users/history/version1420.html)

  • Compile and install Boost. Note that Boost must be compiled with multi-threaded support and the layout matters too. I ended up up using the following:

1
2
3
4
5
cd boost_1_42_0/
./bootstrap.sh
# not sure how necessary the --libdir=/usr/lib64 is, there was a post mentioning it, i followed this advice blindly
./bjam --libdir=/usr/lib64 threading=multi --layout=tagged
sudo ./bjam --libdir=/usr/lib64 threading=multi --layout=tagged install

SQLite DB Stored in a Redis Hash

| Comments

In a recent post I explained how a relational database could be backed by a key-value store by virtue of B-Trees. This sounded great in theory, but I wanted to see that it actually works. And so last night I wrote a commit to Thredis, which does exactly that.

If you’re not familiar with Thredis - it’s something I hacked together last Christmas. Thredis started out as threaded Redis, but eventually evolved into SQLite + Redis. Thredis uses a separate file to save SQLite data. But with this patch it’s no longer necessary - a SQLite DB is entirely stored in a Redis Hash object.

A very neat side-effect of this little hack is that it lets a SQLite database be automatically replicated using Redis replication.

I was able to code this fairly easily because SQLite provides a very nice way of implementing a custom Virtual File System (VFS).

Granted this is only proof-of-concept and not anything you should dare use anywhere near production, it’s enough to get a little taste, so let’s start an empty Thredis instance and create a SQL table:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ redis-cli
redis 127.0.0.1:6379> sql "create table test (a int, b text)"
(integer) 0
redis 127.0.0.1:6379> sql "insert into test values (1, 'hello')"
(integer) 1
redis 127.0.0.1:6379> sql "select * from test"
1) 1) 1) "a"
      2) "int"
   2) 1) "b"
      2) "text"
2) 1) (integer) 1
   2) "hello"
redis 127.0.0.1:6379> 

Now let’s start a slave on a different port and fire up another redis-client to connect to it. (This means slaveof is set to localhost:6379 and slave-read-only is set to false, I won’t bore you with a paste of the config here).

1
2
3
4
5
6
7
8
9
$ redis-cli -p 6380
redis 127.0.0.1:6380> sql "select * from test"
1) 1) 1) "a"
      2) "int"
   2) 1) "b"
      2) "text"
2) 1) (integer) 1
   2) "hello"
redis 127.0.0.1:6380> 

Here you go - the DB’s replicated!

You can also see what SQLite data looks like in Redis (not terribly exciting):

1
2
3
4
redis 127.0.0.1:6379> hlen _sql:redis_db
(integer) 2
redis 127.0.0.1:6379> hget _sql:redis_db 0
"SQLite format 3\x00 \x00\x01\x01\x00@  \x00\x00\x00\x02\x00\x00\x00\x02\x00\x00 ...

Another potential benefit to this approach is that with not too much more tinkering the database could be backed by Redis Cluster, which would give you a fully-functional horizontally-scalable clustered in-memory SQL database. Of course, only the store would be distributed, not the query processing. So this would be no match to Impala and the like which can process queries in a distributed fasion, but still, it’s pretty cool for some 300 lines of code, n’est-ce pas?

Checking Out Cloudera Impala

| Comments

I’ve decided to check out Impala last week and here’s some notes on how that went.

First thoughts

I was very impressed with how easy it was to install, even considering our unusual set up (see below). In my simple ad-hoc tests Impala performed orders of magnitude faster than Hive. So far it seems solid down to the little details, like the shell prompt with a fully functional libreadline and column headers nicely formatted.

Installing

The first problem I encountered was that we use Cloudera tarballs in our set up, but Impala is only available as a package (RPM in our case). I tried compiling it from source, but it’s not a trivial compile - it requires LLVM (which is way cool, BTW) and has a bunch of dependencies, it didn’t work out-of-the-box for me so I’ve decided to take an alternative route (I will definitely get it compiled some weekend soon).

Retreiving contents of an RPM is trivial (because it’s really a cpio archive), and then I’d just have to “make it work”.

1
2
3
4
$ curl -O http://archive.cloudera.com/impala/redhat/6/x86_64/impala/1.0/RPMS/x86_64/impala-server-1.0-1.p0.819.el6.x86_64.rpm
$ mkdir impala
$ cd impala
$ rpm2cpio ../impala-server-1.0-1.p0.819.el6.x86_64.rpm | cpio -idmv

I noticed that usr/bin/impalad is a shell script, and it appears to rely on a few environment vars for configuration, so I created a shell script that sets them which looks approximately like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
export JAVA_HOME=/usr/java/default
export IMPALA_LOG_DIR= # your log dir
export IMPALA_STATE_STORE_PORT=24000
export IMPALA_STATE_STORE_HOST= # probably namenode host or whatever
export IMPALA_BACKEND_PORT=22000

export IMPALA_HOME= # full path to usr/lib/impala from the RPM, e.g. /home/grisha/impala/usr/lib/impala
export IMPALA_CONF_DIR= # config dir, e.g. /home/grisha/impala/etc/impala"
export IMPALA_BIN=${IMPALA_HOME}/sbin-retail
export LIBHDFS_OPTS=-Djava.library.path=${IMPALA_HOME}/lib
export MYSQL_CONNECTOR_JAR= # full path a mysql-connect jar

export HIVE_HOME= # your hive home - note: every impala nodes needs it, just config, not the whole Hive install
export HIVE_CONF_DIR= # this seems redundant, my guess HIVE_HOME is enough, but whatever
export HADOOP_CONF_DIR= # path the hadoop config, the dir that has hdfs-site.xml, etc.

export IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}"
export IMPALA_SERVER_ARGS=" \                                                                                                                                                                                  -log_dir=${IMPALA_LOG_DIR} \                                                                                                                                                                              -state_store_port=${IMPALA_STATE_STORE_PORT} \                                                                                                                                                            -use_statestore \                                                                                                                                                                                         -state_store_host=${IMPALA_STATE_STORE_HOST} \                                                                                                                                                            -be_port=${IMPALA_BACKEND_PORT}"

With the above environment vars set, starting Impala should amount to the following (you probably want to run those in separate windows, also note that the state store needs to be started first):

1
2
$ ./usr/bin/statestored ${IMPALA_STATE_STORE_ARGS} # do this on IMPALA_STATE_STORE_HOST only
$ ./usr/bin/impalad ${IMPALA_SERVER_ARGS} # do this on every node

The only problem that I encountered was that Impala needed short-circuit access enabled, so I had to add the following to the hdfs-site.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 <property>
   <name>dfs.client.read.shortcircuit</name>
   <value>true</value>
 </property>
 <property>
   <name>dfs.domain.socket.path</name>
<!-- adjust this to your set up: -->
   <value>/var/run/dfs_domain_socket_PORT.sock</value>
 </property>
 <property>
   <name>dfs.client.file-block-storage-locations.timeout</name>
   <value>3000</value>
 </property>
 <property>
<!-- adjust this too: -->
   <name>dfs.block.local-path-access.user</name>
   <value><!-- user name --></value>
 </property>

Once the above works, we need impala-shell to test it. Again, I pulled it out of the RPM:

1
2
3
$ curl -O http://archive.cloudera.com/impala/redhat/6/x86_64/impala/1.0/RPMS/x86_64/impala-shell-1.0-1.p0.819.el6.x86_64.rpm
$ mkdir shell ; cd shell
$ rpm2cpio ../impala-shell-1.0-1.p0.819.el6.x86_64.rpm | cpio -idmv

I was then able to start the shell and connect. You can connect to any Impala node (read the docs):

1
2
3
4
5
6
7
8
9
10
11
12
13
$ ./usr/bin/impala-shell
[localhost:21000] > connect some_node;
Connected to some_node:21000
Server version: impalad version 1.0 RELEASE (build d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)
[some_node:21000] > select count(*) from your_favorite_table;
Query: select count(*) from your_favorite_table
Query finished, fetching results ...
+-----------+
| count(*)  |
+-----------+
| 302052158 |
+-----------+
Returned 1 row(s) in 2.35s

Ta-da! The above query takes a good few minutes in Hive, BTW.

Other Notes

  • Impala does not support custom SerDe’s so it won’t work if you’re relying on JSON. It does support Avro.
  • There is no support for UDF’s, so our HiveSwarm is of no use.
  • INSERT OVERWRITE works, which is good.
  • LZO support works too.
  • Security Warning: Everything Impala does will appear in HDFS as the user under which Impala is running. Be careful with this if you’re relying on HDFS permissions to prevent an accidental “INSERT OVERWRITE”, as you might inadvertently give your users superuser privs on HDFS via Hue, for example. (Oh did I mention Hue completely supports Impala too?). From what I can tell there is no way to set a username, this is a bit of a show-stopper for us, actually.