Relational database on top of key-value store explained (or why B-trees are cool)

This post attempts to explain how a relational database can be implemented atop a key/value store, a subject that I’ve long found rather mysterious.

Every once in a while I would come across a mention that a relational database can be implemented using a key/value store (aka dictionary, hash table or hash map - for brevity I’ll be using map from here on).

Whenever I thought about it, it just didn’t make sense. A relational database needs to store rows in order, and that’s one feature that maps do not provide. Imagine we have a table keyed by employee id stored in a map and we need to traverse it by id in ascending order. A hypothetical keys() method would return us a list of ids ordered randomly. It’s impossible to iterate over a hash map in order. So how would a relational database work then?

MapJoin: a simple way to speed up your Hive queries

Mapjoin is a little-known feature of Hive. It allows a table to be loaded into memory so that a (very fast) join could be performed entirely within a mapper without having to use a Map/Reduce step. If your queries frequently rely on small table joins (e.g. cities or countries, etc.) you might see a very substantial speed-up from using mapjoins.

There are two ways to enable it. First is by using a hint, which looks like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style comment should be placed immediately following the SELECT. It directs Hive to load aliasname (which is a table or alias of the query) into memory.

Linus on understanding pointers

A while back Linus Torvalds answered questions on Slashdot.

One of the answers was on the subject of understanding of pointers:

At the opposite end of the spectrum, I actually wish more people understood the really core low-level kind of coding. Not big, complex stuff like the lockless name lookup, but simply good use of pointers-to-pointers etc. For example, I’ve seen too many people who delete a singly-linked list entry by keeping track of the “prev” entry, and then to delete the entry, doing something like

Storm Notes

Some random thoughts on having tinkered with Storm over the past few weeks.

It took me some time to understand what Storm is, and I am still not clear I have found a perfect use for it. (This is not a criticism of Storm, the point is that the concepts it introduces are new, somewhat diffuclt and will need some time so sync in). The best way to get the basic understanding of Storm concepts is to watch Nathan Marz’s excellent presentation.

on prioritization - important vs urgent

Every item on a TODO list can be classified as urgent, important or neither. We should act on important items first, urgent second and ignore the rest.

Sometimes an item lands on our TODO list described as extremely urgent without any explanation of importance. In this case the important item (and thus to be done first) becomes determining the importance of the extremely urgent item in question, even if it means delaying it.

On keeping lots of integers in memory

Once upon a time (over a year ago) I found myself needing to store large numbers of integers in memory. The goal was to store a graph of all our purchasers and items purchased, so that we could quickly identify like-minded purchasers based on common purchases and make real-time recommendations of the form “people like you also bought”. This approach is commonly known as collaborative filtering, and exactly how we did it would be a subject of some future post (perhaps).