Update: there is now a better write up of the PostgreSQL schema. This post was rather half-baked as much was still not understood when I wrote it.
In a previous post I described a simplistic schema to store the Bitcoin blockchain in PostgreSQL. In this post I’m investigating pushing the envelope with a bit of C programming.
The Missing Functionality
Postgres cannot do certain things required to fully handle transactions. The missing functionality is (at least):
-
Support for Variable Length Integer used in the blockchain and more generally the binary encoding of a transaction or its components.
-
Elliptic Curve Signature. Even though postgres integrates with OpenSSL, which has that functionality, there is no way to call the EC functions.
-
Ability to parse and evaluate Bitcoin script. This is a biggie, as transaction verification requires it, and it is one of the more complex and bug-prone aspects of Bitcoin.
It is also important that all of the above be performant. Even though varints, script and even elliptic curve could be implemented in plain PL/pgSQL, it probably wouldn’t be fast enough for practical use. Which leaves us with the only possible option: a C extension.
Avoid Reinventing the Wheel
Anything is possible in C, but can we avoid having to reimplement it from scratch? Are there libraries that could be leveraged?
As it is now, the Bitcoin protocol is primarily specified by its source code, and the source of all truth is the Bitcoin Core. It is possible to use C++ in PG extensions, which means at least in theory the Bitcoin Core code could be leveraged somehow.
My initial conclusion is that this would be a daunting task. Bitcoin Core code requires at least C++11, as well as Boost. It also seems that the core code assumes its own specific storage and caching mechanism and isn’t easily abstracted away from it. Not to mention that using C++ libs from Postgres has complexities of its own.
I looked around for a plain C implementation of Bitcoin and found a few rather incomplete ones. The most functional one seems to be Jeff Garzik’s picocoin. With the looming Segwit2x fork and all the controversy surrounding it this may seem like an odd choice of a library, but for the purpose of what we are doing, I think it’s fine. It also seems like Picocoin isn’t actively developed, which is not great. I would very much appreciate opinions/advice on this, if you know of a better C lib, do leave a comment.
The C extension
Thanks to this excellent series of posts and Postgres’ superb documentation, I was able to put together a proof-of-concept extension, available at https://github.com/blkchain/pg_blkchain. While the C internals of it would be subject for a whole separate post (or few), suffice it to say that it is fairly rudimentary and all the heavy lifting is delegated to the picocoin lib.
As of now, the extension provides a handful of functions:
-
get_vin(tx bytea)
This is a Set Returning Function (SRF), which returns the transaction inputs as rows. -
get_vout(tx bytea)
Similarly to get_vin(), an SRF that returns outputs. -
parse_script(script bytea)
An SRF which parses a Bitcoin script and returns (more or less) human-readable rows. -
verify_sig(tx bytea, previous_tx bytea, n int)
Verifies a specific input of a transaction (denoted byn
), given a the previous transaction to which the input refers. Returns a boolean.
This is hardly enough to support all of what would be required by a full node, but this is sufficient to do some interesting stuff.
Note that the function names and signatures are not final, this is a work in progress and I expect this all to evolve and change. For example, initially I implemented get_vout() which returned an array, but in the end an SRF seemed like a more flexible approach.
The Schema
In the last post I used separate tables for the transaction, inputs and outputs. With the ability to serialize/deserialize transactions at our disposal, there are more interesting options.
The most compact way to store transactions is to just use the serialized binary form in a binary (bytea) column. We can get at any particulars of it by using our functions.
The examples below are based on a single table created as
1 2 3 4 |
|
I imported the first 100K blocks or so into this table, how it was done I might describe in a separate post.
I’ll introduce the extension with my favorite example: the decoding of the signature of the genesis block input:
1 2 3 4 5 6 7 8 9 10 |
|
Expression Indexes
One neat feature of PostgreSQL is ability to index expressions. For example, we know that we can compute a transaction hash with
1 2 3 4 |
|
Note that this is little-endian, while conventionally transaction id’s are represented with bytes reversed (big-endian): fd6445c8eb3c4ac147ad06d314a9efe153e9752bab5f994483309e024ab0296e
Now if we want to be able to look up transactions quickly by the transaction hash, as is the convention, we can create an expression index like so:
1
|
|
When we do this, PostgreSQL scans the entire table, computes the hash and stores it in the index. An index, after all, is just another table (of sorts), and there is nothing wrong with indexes containing values that do not exist in the table to which the index refers.
Once we do this, any time the expression digest(digest(tx, 'sha256'), 'sha256')
is used in reference to the rtxs
table, PostgreSQL will not execute
the digest()
function, but would instead use the value stored in
the index.
We can attest to this with
1 2 3 4 5 6 7 8 9 10 |
|
This is pretty clever - even though we do not have an actual “transaction hash” column in our table, we do have the value and an index in the database.
Views
But what if we wanted to have a better readable representation of transactions, for example something that includes the transaction hash?
The best way to do this is via a view:
1 2 3 |
|
Postgres is clever enough to use the above index for the view:
1 2 3 4 5 6 7 |
|
A similar technique can applied to inputs and outputs, for example for outputs we could create a view like so:
1 2 3 |
|
The outputs are now easily accessibly as:
1 2 3 4 5 6 7 |
|
Want to know the most popular opcode used in scripts?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Anyway, that’s it for now. Please comment your questions/comments below, or via twitter, I am very curious on what people think on this approach!