Gregory Trubetskoy

Notes to self.

How InfluxDB Stores Data

| Comments

A nice, reliable, horizontally scalable database that is designed specifically to tackle the problem of Time Series data (and does not require you to stand up a Hadoop cluster) is very much missing from the Open Source Universe right now.

InfluxDB might be able to fill this gap, it certainly aims to.

I was curious about how it structures and stores data and since there wasn’t much documentation on the subject and I ended up just reading the code, I figured I’d write this up. I only looked at the new (currently 0.9.0 in RC stage) version, the previous versions are significantly different.

First of all, InfluxDB is distributed. You can run one node, or a bunch, it seems like a more typical number may be 3 or 5. The nodes use Raft to establish consensus and maintain data consistency.

InfluxDB feels a little like a relational database in some aspects (e.g. it has a SQL-like query language) but not in others.

The top level container is a database. An InfluxDB database is very much like what a database is in MySQL, it’s a collection of other things.

“Other things” are called data points, series, measurements, tags and retention policies. Under the hood (i.e. you never deal with them directly) there are shards and shard groups.

The very first thing you need to do in InfluxDB is create a database and at least one retention policy for this database. Once you have these two things, you can start writing data.

A retention policy is the time period after which the data expires. It can be set to be infinite. A data point, which is a measurement consisting of any number of values and tags associated with a particular point in time, must be associated with a database and a retention policy. A retention policy also specifies the replication factor for the data point.

Let’s say we are tracking disk usage across a whole bunch of servers. Each server runs some sort of an agent which periodically reports the usage of each disk to InfluxDB. Such a report might look like this (in JSON):

1
2
3
4
5
6
{"database" : "foo", "retentionPolicy" : "bar",
 "points" : [
   {"name" : "disk",
    "tags" : {"server" : "bwi23", "unit" : "1"},
    "timestamp" : "2015-03-16T01:02:26.234Z",
    "fields" : {"total" : 100, "used" : 40, "free" : 60}}]}

In the above example, “disk” is a measurement. Thus we can operate on anything “disk”, regardless of what “server” or “unit” it applies to. The data point as a whole belongs to a (time) series identified by the combination of the measurement name and the tags.

There is no need to create series or measurements, they are created on the fly.

To list the measurements, we can use SHOW MEASUREMENTS:

1
2
3
4
> show measurements
name            tags    name
----            ----    ----
measurements            disk

We can use SHOW SERIES to list the series:

1
2
3
4
> show series
name    tags    id      server   unit
----    ----    --      -------  ----
disk            1       bw123    1

If we send a record that contains different tags, we automatically create a different series (or so it seems), for example if we send this (note we changed “unit” to “foo”):

1
2
3
4
5
6
{"database" : "foo", "retentionPolicy" : "bar",
 "points" : [
   {"name" : "disk",
    "tags" : {"server" : "bwi23", "foo" : "bar"},
    "timestamp" : "2015-03-16T01:02:26.234Z",
    "fields" : {"total" : 100, "used" : 40, "free" : 60}}]}

we get

1
2
3
4
5
> show series
name    tags    id      foo     server  unit
----    ----    --      ---     ------  ----
disk            1               bwi23   1
disk            2       bar     bwi23

This is where the distinction between measurement and series becomes a little confusing to me. In actuality (from looking at the code and the actual files InfluxDB created) there is only one series here called “disk”. I understand the intent, but not sure that series is the right terminology here. I think I’d prefer if measurements were simply called series, and to get the equivalent of SHOW SERIES you’d use something like SHOW SERIES TAGS. (May be I’m missing something.)

Under the hood the data is stored in shards, which are grouped by shard groups, which in turn are grouped by retention policies, and finally databases.

A database contains one or more retention policies. Somewhat surprisingly a retention policy is actually a bucket. It makes sense if you think about the problem of having to expire data points - you can remove them all by simply dropping the entire bucket.

If we declare a retention policy of 1 day, then we can logically divide the timeline into a sequence of single days from beginning of the epoch. Any incoming data point falls into its corresponding segment, which is a retention policy bucket. When clean up time comes around, we can delete all days except for the most current day.

To better understand the following paragraphs, consider that having multiple nodes provides the option for two things: redundancy and distribution. Redundancy gives you the ability to lose a node without losing any data. The number of copies of the data is controlled by the replication factor specified as part of the retention policy. Distribution spreads the data across nodes which allows for concurrency: data can be written, read and processed in parallel. For example if we become constrained by write performance, we can solve this by simply adding more nodes. InfluxDB favors redundancy over distribution when having to choose between the two.

Each retention policy bucket is further divided into shard groups, one shard group per series. The purpose of a shard group is to balance series data across the nodes of the cluster. If we have a cluster of 3 nodes, we want the data points to be evenly distributed across these nodes. InfluxDB will create 3 shards, one on each of the nodes. The 3 shards comprise the shard group. This is assuming the replication factor is 1.

But if the replication factor was 2, then there needs to be 2 identical copies of every shard. The shard copies must be on separate nodes. With 3 nodes and replication factor of 2, it is impossible to do any distribution across the nodes - the shard group will have a size of 1, and contain 1 shard, replicated across 2 nodes. In this set up, the third node will have no data for this particular retention policy.

If we had a cluster of 5 nodes and the replication factor of 2, then the shard group can have a size of 2, for 2 shards, replicated across 2 nodes each. Shard one replicas could live on nodes 1 and 3, while shard two replicas on nodes 2 and 4. Now the data is distributed as well as redundant. Note that the 5th node doesn’t do anything. If we up the replication factor to 3 then just like before, the cluster is too small to have any distribution, we only have enough nodes for redundancy.

As of RC15 distributed queries are not yet implemented, so you will always get an error if you have more than one shard in a group.

The shards themselves are instances of Bolt db - a simple to use key/value store written in Go. There is also a separate Bolt db file called meta which stores the metadata, i.e. information about databases, retention policies, measurements, series, etc.

I couldn’t quite figure out the process for typical cluster operations such as recovery from node failure or what happens (or should happen) when nodes are added to existing cluster, whether there is a way to decommission a node or re-balance the cluster similar to the Hadoop balancer, etc. I think as of this writing this has not been fully implemented yet, and there is no documentation, but hopefully it’s coming soon.

Comments