Most of my journeys never begin, or cannot continue because of that one
single step, be it first or next. Because it is hard, at times
excruciatingly so.
Here I speak of software, but this applies to many other aspects of my
life just as well.
I recon it’s because I do not think in steps. I think of a
destination. I imagine the end-result. I can picture it with clarity
and in great detail. I know where I need to be. But what is the next
step to get there? And it doesn’t help that where I travel, there are
no signs.
The problem of deciding what to do next is so common for me that I
even have a name for it. I call it “The Next Smallest Step”
problem. Whenever I find myself idling, clicking over to online
time-wasters, I pause and ask myself “What is the Next Smallest Step?
What’s the next smallest thing I can do, right now?”
It doesn’t matter how much further this small step moves me. A
nanometer is better than standing still. It has to be something that
is simple enough that I can just do. Right now.
I always plan to do big things that take days, weeks or months. But of
all that, can I pick that one small simple and quick thing that I can
do now?
Sometimes focusing on the next smallest step is so difficult that I
pencil this question on a piece of paper, and sometimes I just type it
on the command line or in the source code. My short version is:
12
$ WHAT NEXT?
bash: WHAT: command not found
(that’s right, in CAPS)
This simple question has taken me on some of the most fascinating and
challenging journeys ever. In restrospect, I think I would not be able
to travel any of them without repeatedly asking it of myself, over and
over again.
It has resulted in most productive and gratifying days of work. Some
of my greatest projects began with this question. In many instances it
established what I had to do for months ahead (years, even?). All
beginning with this one small question.
Conversely not asking it often enough, if at all, led to time having
gone by without any results to show for and many a great opportunity
lost.
I must also note that some of my next smallest steps took days of
thinking to figure out. Nothing wrong with that.
And so I thought I’d share this with you, just in case you might find
it helpful. Whenever you find yourself at an impass and progress
has stopped, ask yourself:
Nearly eight years after my last commit
to Mod_python I’ve decided to spend some time hacking on it again.
Five years without active development and thirteen since its first
release, it still seems to me an entirely useful and viable tool. The
code is exceptionally clean, the documentation is amazing, and the
test suite is awesome. Which is a real testament to the noble efforts
of all the people who contributed to its development.
We live in this new
c10k world now where
Apache Httpd no longer has the market dominance it once enjoyed, while
the latest Python web frameworks run without requiring or recommending
Mod_python. My hunch, however, is that given a thorough dusting it
could be quite useful (perhaps more than ever) and applied in very
interesting ways to solve the new problems. I also think the Python
language is at a very important inflection point. Pyhton 3 is now
mature, and is slowly but steadily becoming the preferred language of
many interesting communities such as data science, for example.
The current status of Mod_python as an Apache project is that it’s
in the attic. This means that the
ASF isn’t providing much in the way of
infrastructure support any longer, nor will you see an “official” ASF
release any time soon. (If ever - Mod_python would have to re-enter as
an incubator project and at this point
it is entirely premature to even consider such an option).
For now the main goal is to re-establish the community, and as part of
that I will have to sort out how to do issue tracking, discussion
groups, etc. At this point the only thing I’ve decided is that the
main repository will live on
github.
The latest code is in 4.0.x branch. My initial
development goal is to bring it up to compatibility with Python 2.7
and Apache Httpd 2.4 (I’m nearly there already), then potentially move
on to Python 3 support. I have rolled back a few commits (most notably
the new importer) because I did not understand them. There are still a
few changes in Apache 2.4 that need to be addressed, but they seem
relatively minor at this point. Authentication has been changed
significantly in 2.4, though mod_python never had much coverage in that
area.
Let’s see where this takes us? And if you like this, feel free to
star and fork Mod_python on github and follow it on Twitter:
As you embark on converting vast quantities of JSON to Avro, you soon
discover that things are not as simple as they seem. Here is how it might happen.
A quick Google
search eventually leads you to the
avro-tools
jar, and you find yourself attempting to convert some JSON, such as:
Having read Avro documentation and being the clever being that you are, you start out with:
123456789
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \'{"type":"record","name":"whatever", "fields":[{"name":"first", "type":"string"}, {"name":"middle","type":"string"}, {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) ...
A brief moment of disappointment is followed by the bliss of
enlightment: Duh, the “middle” element needs a default! And so you try
again, this time having tacked on a default to the definition of “middle”, so it looks like {"name":"middle","type":"string","default":""}:
12345678
java -jar ~/src/avro/java/avro-tools-1.7.4.jar fromjson input.json --schema \'{"type":"record","name":"whatever", "fields":[{"name":"first", "type":"string"}, {"name":"middle","type":"string","default":""}, {"name":"last","type":"string"}]}' > output.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477) ...
Why doesn’t this work? Well… You don’t understand Avro, as it turns
out. You see, JSON is not Avro, and therefore the wonderful Schema
Resolution thing you’ve been reading about does not apply.
Let me also mention that json2avro is written in C and is fast, it
supports Snappy, Deflate and LZMA compression codecs, lets you pick a
custom block size and is smart enough to (optionally) skip over lines
it cannot parse.
Here are some un-scientific results on how Avro performs with various
codecs, as well as vs JSON-lzo files in Hive and Impala. This testing
was done using a 100 million row table that was generated using random
two strings and an integer.
Avro supports schema resolution so that the schema can evolve over time
Hive supports Avro and schema resolution nicely
Impala (1.0) can read Avro tables, but does not support schema resolution
Mixing compression codecs in the same table works in both Hive and Impala
The TL;DR version
Introduction
If you’re logging data into Hadoop to be analyzed, chances are you’re
using JSON. JSON is great because it’s easy to generate in most any
language, it’s human-readable, it’s universally supported and
infinitely flexible.
It is also space inefficient, prone to errors, the standard has a few
ambiguities, all of which eventually catches up to you. It only takes
one bad record to spoil a potentially massive amount of data, and
finding the bad record and figuring out the root cause of the problem
is usually difficult and often even impossible.
So you might be considering a slightly more rigid and space efficient
format, and in the Hadoop world it is Apache Avro.
Avro is especially
compelling because it is supported by
Impala,
while JSON isn’t (not yet, at least).
Named after a British aircraft maker, Avro is a schema-enforced
format for serializing arbitrary data. It is in the same category as
Thrift, only it seems like Thrift has found its niche in
RPC, whereas Avro appears more compelling as the on-disk
format (even though both Avro and Thrift were designed for both storage
and RPC). Thrift seems more insistent on you using its code
generator, whereas Avro does it the old-school way, but providing
you libraries you can use in your code. (It does code generation as
well, if that’s your thing. I prefer to hand-write all my code).
I actually don’t want to focus on the details of what Avro is as there
is plenty information on that elsewhere. I want to share my findings
regarding Avro’s suitability as an alternative to JSON used with Hive
and JSON SerDe.
Schema (and its Resolution)
Every Avro file contains a header with the schema describing (in
JSON!) the contents of the file’s records. This is very nice, because
the file contains all the knowledge necessary to be able to read it.
Avro was designed with the understanding that the schema may change
over time (e.g. columns added or changed), and that software designed
for a newer schema may need to read older schema files. To support
this it provides something called Schema Resolution.
Imagine you’ve been storing people’s names in a file. Then later on
you decided to add “age” as another attribute. Now you’re got two
schemas, one with “age” and one without. In JSON world you’d have to
adjust your program to be able to read old files with some kind of an
if/then statement to make sure that when “age” is not there the
program knows what to do. In Avro, the new schema can specify a
default for the age (e.g. 0), and whatever Avro lib you’d be using
should be able to convert a record of the old schema to the new schema
automatically, without any code modifications necessary. This is
called schema resolution.
Avro support in Hive
First we need an Avro file. Our schema is just one string column named
“test”. Here’s a quick Ruby program to generate a file of 10 random
records (you’ll need to gem install avro):
Note that the table definition needs its own schema definition, even
though our file already contains a schema. This is not a mistake. This
is the schema Hive will expect. And if the file that it’s reading is of
a different schema, it will attempt to convert it using Avro schema
resolution. Also noteworthy is that this table defines no
columns. The entire definition is in the avro.schema.literal
property.
This is working exactly as expected. Hive has shown the 3 original
records as before, and the 3 new ones got converted to Hive’s version
of the schema, where the “age” column does not exist.
Alas - we’re only getting the 3 original rows. Bummer! What’s
worrisome is that no indication was given to us that 3 other rows got
swallowed because Impala didn’t do schema resolution. (I’ve posted
regarding this on the Impala users list, awaiting response).
Now let’s alter the table schema so that age is part of it. (This is
not your typical ALTER TABLE, we’re just changing avro.schema.literal).
This error should not be surprising. Hive is trying to provide a value
for age for those records where it did not exist, but we neglected
to specify a default. (The error message is a little cryptic,
though). So let’s try again, this time with a default:
Avro’s integer encoding is same as Lucene with zigzag encoding
on top of it. (It’s just like SQLite, only little-endian). Curiously,
it is used for all integers, including array indexes internal to
Avro, which can never be negative and thus ZigZag is of no use. This
is probably to keep all integer operation consistent.
A curious post on how Lucene/Avro variable-integer format is CPU-unfriendly
Here’s a KISS solution to all your password reuse
problems. It requires remembering only *one* strong password, lets you
have a virtually limitless number of passwords, and, most importantly,
does NOT store anything anywhere or transfer anything over the
network (100% browser-side Javascript).
Apparently Impala has two versions of source code, one internal to
Cloudera, the other available on Github. I’m presuming that code gets
released to Github once undergone some level of internal scrutiny, but
I’ve not seen any documentation on how one could tie publically
available code to the official Impala (binary) release, currently 1.0.
Anyway, I tried compiling the github code last night, and here are the
steps that worked for me.
My setup:
Linux CentOS 6.2 (running inside a VirtualBox instance on an Early 2011 MacBook, Intel i7).
CDH 4.3 installed using Cloudera RPM’s. No configuration was done, I
just ran yum install as described in the installation guide.
Impala checked out from Github, HEAD is 60cb0b75 (Mon May 13 09:36:37 2013).
Boost 1.42, compiled and installed manually (see below).
Install LLVM. Follow the precise steps in the Impala README, it works.
Make sure you have the Oracle JDK 6, not OpenJDK. I found this link helpful.
Remove the CentOS version of Boost (1.41) if you have it. Impala
needs uuid, which is only supported in 1.42 and later:
12
# YMMV - this is how I did it, you may want to be more cautioussudo yum erase `rpm -qa | grep boost`
Download and untar Boost 1.42 from [http://www.boost.org/users/history/version1420.html](http://www.boost.org/users/history/version1420.html)
Compile and install Boost. Note that Boost must be compiled with multi-threaded support and the layout matters too. I ended up up using the following:
12345
cd boost_1_42_0/
./bootstrap.sh
# not sure how necessary the --libdir=/usr/lib64 is, there was a post mentioning it, i followed this advice blindly./bjam --libdir=/usr/lib64 threading=multi --layout=tagged
sudo ./bjam --libdir=/usr/lib64 threading=multi --layout=tagged install
In a recent post
I explained how a relational database could be backed by a key-value
store by virtue of B-Trees. This sounded great in theory, but I wanted
to see that it actually works. And so last night I wrote a
commit to
Thredis, which does exactly that.
If you’re not familiar with Thredis - it’s something I hacked together
last Christmas. Thredis started out as threaded Redis, but eventually
evolved into SQLite + Redis. Thredis uses a separate file to save
SQLite data. But with this patch it’s no longer necessary - a SQLite
DB is entirely stored in a Redis Hash object.
A very neat side-effect of this little hack is that it lets a SQLite
database be automatically replicated using Redis replication.
I was able to code this fairly easily because SQLite provides a very nice way of
implementing a custom Virtual File System (VFS).
Granted this is only proof-of-concept and not anything you should dare
use anywhere near production, it’s enough to get a little taste, so
let’s start an empty Thredis instance and create a SQL table:
12345678910111213
$ redis-cli
redis 127.0.0.1:6379> sql "create table test (a int, b text)"
(integer) 0
redis 127.0.0.1:6379> sql "insert into test values (1, 'hello')"
(integer) 1
redis 127.0.0.1:6379> sql "select * from test"
1) 1) 1) "a"
2) "int"
2) 1) "b"
2) "text"
2) 1) (integer) 1
2) "hello"
redis 127.0.0.1:6379>
Now let’s start a slave on a different port and fire up another
redis-client to connect to it. (This means slaveof is set to
localhost:6379 and slave-read-only is set to false, I won’t bore you
with a paste of the config here).
Another potential benefit to this approach is that with not too much
more tinkering the database could be backed by
Redis Cluster, which would give you a
fully-functional horizontally-scalable clustered in-memory SQL
database. Of course, only the store would be distributed, not the
query processing. So this would be no match to Impala and the like
which can process queries in a distributed fasion, but still, it’s
pretty cool for some 300 lines of code, n’est-ce pas?
I’ve decided to check out
Impala
last week and here’s some notes on how that went.
First thoughts
I was very impressed with how easy it was to install, even considering
our unusual set up (see below). In my simple ad-hoc tests Impala
performed orders of magnitude faster than Hive. So far it seems solid
down to the little details, like the shell prompt with a fully
functional libreadline and column headers nicely formatted.
Installing
The first problem I encountered was that we use Cloudera
tarballs
in our set up, but Impala is only available as a package (RPM in our
case). I tried compiling it from
source, but it’s not a trivial
compile - it requires LLVM (which is way cool,
BTW) and has a bunch of dependencies, it didn’t work out-of-the-box
for me so I’ve decided to take an alternative route (I will definitely get it compiled some weekend soon).
Retreiving contents of an RPM is trivial (because it’s really a cpio
archive), and then I’d just have to “make it work”.
I noticed that usr/bin/impalad is a shell script, and it appears to
rely on a few environment vars for configuration, so I created a shell
script that sets them which looks approximately like this:
123456789101112131415161718
export JAVA_HOME=/usr/java/default
export IMPALA_LOG_DIR=# your log direxport IMPALA_STATE_STORE_PORT=24000
export IMPALA_STATE_STORE_HOST=# probably namenode host or whateverexport IMPALA_BACKEND_PORT=22000
export IMPALA_HOME=# full path to usr/lib/impala from the RPM, e.g. /home/grisha/impala/usr/lib/impalaexport IMPALA_CONF_DIR=# config dir, e.g. /home/grisha/impala/etc/impala"export IMPALA_BIN=${IMPALA_HOME}/sbin-retail
export LIBHDFS_OPTS=-Djava.library.path=${IMPALA_HOME}/lib
export MYSQL_CONNECTOR_JAR=# full path a mysql-connect jarexport HIVE_HOME=# your hive home - note: every impala nodes needs it, just config, not the whole Hive installexport HIVE_CONF_DIR=# this seems redundant, my guess HIVE_HOME is enough, but whateverexport HADOOP_CONF_DIR=# path the hadoop config, the dir that has hdfs-site.xml, etc.export IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}"export IMPALA_SERVER_ARGS=" \ -log_dir=${IMPALA_LOG_DIR} \ -state_store_port=${IMPALA_STATE_STORE_PORT} \ -use_statestore \ -state_store_host=${IMPALA_STATE_STORE_HOST} \ -be_port=${IMPALA_BACKEND_PORT}"
With the above environment vars set, starting Impala should amount to
the following (you probably want to run those in separate windows, also note that
the state store needs to be started first):
12
$ ./usr/bin/statestored ${IMPALA_STATE_STORE_ARGS}# do this on IMPALA_STATE_STORE_HOST only$ ./usr/bin/impalad ${IMPALA_SERVER_ARGS}# do this on every node
The only problem that I encountered was that Impala needed
short-circuit access enabled, so I had to add the following to the hdfs-site.xml:
123456789101112131415161718
<property><name>dfs.client.read.shortcircuit</name><value>true</value></property><property><name>dfs.domain.socket.path</name><!-- adjust this to your set up: --><value>/var/run/dfs_domain_socket_PORT.sock</value></property><property><name>dfs.client.file-block-storage-locations.timeout</name><value>3000</value></property><property><!-- adjust this too: --><name>dfs.block.local-path-access.user</name><value><!-- user name --></value></property>
Once the above works, we need impala-shell to test it. Again, I pulled it out of the RPM:
I was then able to start the shell and connect. You can connect to any Impala node (read the docs):
12345678910111213
$ ./usr/bin/impala-shell
[localhost:21000] > connect some_node;
Connected to some_node:21000
Server version: impalad version 1.0 RELEASE (build d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)
[some_node:21000] > select count(*) from your_favorite_table;
Query: select count(*) from your_favorite_table
Query finished, fetching results ...
+-----------+
| count(*) |
+-----------+
| 302052158 |
+-----------+
Returned 1 row(s) in 2.35s
Ta-da! The above query takes a good few minutes in Hive, BTW.
Other Notes
Impala does not support custom SerDe’s so it won’t work if you’re relying on JSON. It does support Avro.
There is no support for UDF’s, so our HiveSwarm is of no use.
INSERT OVERWRITE works, which is good.
LZO support works too.
Security Warning: Everything Impala does will appear in HDFS as
the user under which Impala is running. Be careful with this if
you’re relying on HDFS permissions to prevent an accidental “INSERT
OVERWRITE”, as you might inadvertently give your users superuser
privs on HDFS via Hue, for example. (Oh did I mention Hue completely
supports Impala too?). From what I can tell there is no way to set a
username, this is a bit of a show-stopper for us, actually.