# Checking Out Cloudera Impala

I’ve decided to check out Impala last week and here’s some notes on how that went.

## First thoughts

I was very impressed with how easy it was to install, even considering our unusual set up (see below). In my simple ad-hoc tests Impala performed orders of magnitude faster than Hive. So far it seems solid down to the little details, like the shell prompt with a fully functional libreadline and column headers nicely formatted.

## Installing

The first problem I encountered was that we use Cloudera tarballs in our set up, but Impala is only available as a package (RPM in our case). I tried compiling it from source, but it’s not a trivial compile - it requires LLVM (which is way cool, BTW) and has a bunch of dependencies, it didn’t work out-of-the-box for me so I’ve decided to take an alternative route (I will definitely get it compiled some weekend soon).

Retreiving contents of an RPM is trivial (because it’s really a cpio archive), and then I’d just have to “make it work”.

I noticed that usr/bin/impalad is a shell script, and it appears to rely on a few environment vars for configuration, so I created a shell script that sets them which looks approximately like this:

With the above environment vars set, starting Impala should amount to the following (you probably want to run those in separate windows, also note that the state store needs to be started first):

The only problem that I encountered was that Impala needed short-circuit access enabled, so I had to add the following to the hdfs-site.xml:

Once the above works, we need impala-shell to test it. Again, I pulled it out of the RPM:

I was then able to start the shell and connect. You can connect to any Impala node (read the docs):

Ta-da! The above query takes a good few minutes in Hive, BTW.

## Other Notes

• Impala does not support custom SerDe’s so it won’t work if you’re relying on JSON. It does support Avro.
• There is no support for UDF’s, so our HiveSwarm is of no use.
• INSERT OVERWRITE works, which is good.
• LZO support works too.
• Security Warning: Everything Impala does will appear in HDFS as the user under which Impala is running. Be careful with this if you’re relying on HDFS permissions to prevent an accidental “INSERT OVERWRITE”, as you might inadvertently give your users superuser privs on HDFS via Hue, for example. (Oh did I mention Hue completely supports Impala too?). From what I can tell there is no way to set a username, this is a bit of a show-stopper for us, actually.