Deploying a Golang App to AWS ECS With Terraform

I’ve put together a basic example of a “Hello World” Go program which runs in Amazon AWS Elastic Compute Service (ECS), which allows running applications in Docker containers and has the ability to scale on demand.

I initially wanted to write about the components of this system and the tools you use to deploy your application, but soon realized that this would make for an extremely long post, as the number of components required for a simple “Hello World” is mind boggling. However problematic it may seem, it’s par for the course, this is what takes to run an application in our cloudy times.

I used Terraform to build all the AWS infrastructure. Initially I was skeptical on how well it could accomplish such a tedious task, but I have say my confidence in Terraform grew the more I used it.

The main top level tool for everything is the good old make, a tool that stood the test of time.

Here is the code of the example, read the README, I hope you find it useful:

https://github.com/grisha/hello-go-ecs-terraform

Holt-Winters Forecasting for Dummies - Part III

If you haven’t read Part I and Part II you probably should, or the following will be hard to make sense of.

In Part I we’ve learned how to forceast one point, in Part II we’ve learned how to forecast two points. In this part we’ll learn how to forecast many points.

More Terminology

Season

If a series appears to be repetitive at regular intervals, such an interval is referred to as a season, and the series is said to be seasonal. Seasonality is required for the Holt-Winters method to work, non-seasonal series (e.g. stock prices) cannot be forecasted using this method (would be nice though if they could be).

Season Length

Season length is the number of data points after which a new season begins. We will use $L$ to denote season length.

Seasonal Component

The seasonal component is an additional deviation from level + trend that repeats itself at the same offset into the season. There is a seasonal component for every point in a season, i.e. if your season length is 12, there are 12 seasonal components. We will use $s$ to denote the seasonal component.

Triple Exponential Smoothing a.k.a Holt-Winters Method

The idea behind triple exponential smoothing is to apply exponential smoothing to the seasonal components in addition to level and trend. The smoothing is applied across seasons, e.g. the seasonal component of the 3rd point into the season would be exponentially smoothed with the the one from the 3rd point of last season, 3rd point two seasons ago, etc. In math notation we now have four equations:

• What’s new:
• We now have a third greek letter, $\gamma$ (gamma) which is the smoothing factor for the seasonal component.
• The expected value index is $x+m$ where $m$ can be any integer meaning we can forecast any number of points into the future (woo-hoo!)
• The forecast equation now consists of level, trend and the seasonal component.

The index of the seasonal component of the forecast $s_{x-L+1+(m-1)modL}$ may appear a little mind boggling, but it’s just the offset into the list of seasonal components from the last set from observed data. (I.e. if we are forecasting the 3rd point into the season 45 seasons into the future, we cannot use seasonal components from the 44th season in the future since that season is also forecasted, we must use the last set of seasonal components from observed points, or from “the past” if you will.) It looks much simpler in Python as you’ll see shortly.

Initial Values

Before we can discuss initial values, let me introduce to you a new tiny series (okay, not as tiny):

This is what it looks like:

You can see that this series is seasonal, there are clearly visible 6 seasons. Although perhaps not easily apparent from the picture, the season length for this series is 12, i.e. it “repeats” every 12 points. In order to apply triple exponential smoothing we need to know what the season length is. (There do exist methods for detecting seasonality in series, but this is way beyond the scope of this text).

Initial Trend

For double exponential smoothing we simply used the first two points for the initial trend. With seasonal data we can do better than that, since we can observe many seasons and can extrapolate a better starting trend. The most common practice is to compute the average of trend averages across seasons.

Good news - this looks simpler in Python than in math notation:

Initial Seasonal Components

The situation is even more complicated when it comes to initial values for the seasonal components. Briefly, we need to compute the average level for every observed season we have, divide every observed value by the average for the season it’s in and finally average each of these numbers across our observed seasons. If you want more detail, here is one thorough description of this process.

I will forgo the math notation for initial seasonal components, but here it is in Python. The result is a season-length array of seasonal components.

The Algorithm

And finally, here is the additive Holt-Winters method in Python. The arguments to the function are the series of observed values, the season length, alpha, beta, gamma and the number of points we want forecasted.:

And here is what this looks like if we were to plot the original series, followed by the last 24 points from the result of the triple_exponential_smoothing() call:

A Note on α, β and γ

You may be wondering how I came up with 0.716, 0.029 and 0.993 for $\alpha$, $\beta$ and $\gamma$, respectively. To make long story short, it was done by way of trial and error: simply running the algorithm over and over again and selecting the values that give you the smallest SSE. As I mentioned before, this process is known as fitting.

To compute the smothing factors to three decimal points we may have to run through 1,000,000,000 iterations, but luckily there are more efficient methods at zooming in on best values. Unfortunately this would take a whole other very long post to describe this process. One good algorithm for this is Nelder-Mead, which is what tgres uses.

Conclusion

Well - here you have it, Holt-Winters method explained the way I wish it would have been explained to me when I needed it. If you think I missed something, found an error or a suggestion, please do not hesitate to comment!

Holt-Winters Forecasting for Dummies - Part II

If you haven’t read Part I you probably should, or the following will be hard to make sense of.

All the forecasting methods we covered so far, including single exponential smoothing, were only good at predicting a single point. We can do better than that, but first we need to agree on a couple of more terms.

More terminology

Level

Expected value has another name, which, again varies depending on who wrote the text book: baseline, intercept (as in Y-intercept) or level. We will stick with “level” here.

So level is that one predicted point that we learned how to calculate in Part I. But because now it’s going to be only part of calculation of the forcast, we can no longer refer to it as $\hat{y}$ and will instead use $\ell$.

Trend or Slope

You should be familiar with slope from your high school algebra class. What you might be a little rusty on is how to calculate it, which is important, because a series slope has an interesting characteristic. Slope is:

where $\Delta{y}$ is the difference in the $y$ coordinates and $\Delta{x}$ is the difference in the $x$ coordinates, respectively, between two points. While in real algebraic problems $\Delta{x}$ could be anything, in a series, from one point to the next, it is always 1. Which means that for a series, slope between two adjacent points is simply $\dfrac{\Delta{y}} {1}$ or $\Delta{y}$, or:

Where $b$ is trend. To the best of my understanding terms “trend” and “slope” are interchangeable. In forecasting parlance “trend” is more common, and in math notation forecasters refer to it as $b$ rather than $m$.

Quick Review

We’ve learned some history, basic terminology (series and how it knows no time, method, error SSE, MSE and fitting). And we’ve learned some basic forecasting methods: naive, simple average, moving average, weighted moving average and, finally, single exponential smoothing.

One very important characteristic of all of the above methods is that remarkably, they can only forecast a single point. That’s correct, just one.

In Part II we will focus on methods that can forecast more than one point.

Storing Time Series in PostgreSQL Efficiently

With the latest advances in PostgreSQL (and other db’s), a relational database begins to look like a very viable TS storage platform. In this write up I attempt to show how to store TS in PostgreSQL.

A TS is a series of [timestamp, measurement] pairs, where measurement is typically a floating point number. These pairs (aka “data points”) usually arrive at a high and steady rate. As time goes on, detailed data usually becomes less interesting and is often consolidated into larger time intervals until ultimately it is expired.

The obvious approach

The “naive” approach is a three-column table, like so:

(Let’s gloss over some details such as an index on the time column and choice of data type for time and value as it’s not relevant to this discussion.)

One problem with this is the inefficiency of appending data. An insert requires a look up of the new id, locking and (usually) blocks until the data is synced to disk. Given the TS’s “firehose” nature, the database can quite quickly get overwhelmed.

This approach also does not address consolidation and eventual expiration of older data points.

Round-robin database

A better alternative is something called a round-robin database. An RRD is a circular structure with a separately stored pointer denoting the last element and its timestamp.

A everyday life example of an RRD is a week. Imagine a structure of 7 slots, one for each day of the week. If you know today’s date and day of the week, you can easily infer the date for each slot. For example if today is Tuesday, April 1, 2008, then the Monday slot refers to March 31st, Sunday to March 30th and (most notably) Wednesday to March 26.

Here’s what a 7-day RRD of average temperature might look as of Tuesday, April 1:

Come Wednesday, April 2nd, our RRD now loooks like this:

Note how little has changed, and that the update required no allocation of space: all we did to record 92F on Wednesday is overwrite one value. Even more remarkably, the previous value automatically “expired” when we overwrote it, thus solving the eventual expiration problem without any additional operations.

RRD’s are also very space-efficient. In the above example we specified the date of every slot for clarity. In an actual implementation only the date of the last slot needs to be stored, thus the RRD can be kept as a sequence of 7 numbers plus the position of the last entry and it’s timestamp. In Python syntax it’d look like this:

Round-robin in PostgreSQL

Here is a naive approach to having a round-robin table. Carrying on with our 7 day RRD example, it might look like this:

Somewhere separately we’d also need to record that the last entry is week_day 3 (Tuesday) and it’s 2008-04-01. Come April 2, we could record the temperature using:

This might be okay for a 7-slot RRD, but a more typical TS might have a slot per minute going back 90 days, which would require 129600 rows. For recording data points one at a time it might be fast enough, but to copy the whole RRD would require 129600 UPDATE statements which is not very efficient.

This is where using PostgrSQL arrays become very useful.

Using PostgreSQL arrays

An array would allow us to store the whole series in a single row. Sticking with the 7-day RRD example, our table would be created as follows:

(Nevemind that there is no id column for now)

We could populate the whole RRD in a single statement:

…or record 92F for Wednesday as so:

(In PostgreSQL arrays are 1-based, not 0-based like in most programming languages)

But it could be even more efficient

Under the hood, PostgreSQL data is stored in pages of 8K. It would make sense to keep chunks in which our RRD is written to disk in line with page size, or at least smaller than one page. (PostgreSQL provides configuration parameters for how much of a page is used, etc, but this is way beyond the scope of this article).

Having the series split into chunks also paves the way for some kind of a caching layer, we could have a server which waits for one row worth of data points to accumulate, then flushes then all at once.

For simplicity, let’s take the above example and expand the RRD to 4 weeks, while keeping 1 week per row. In our table definition we need provide a way for keeping the order of every row of the TS with a column named n, and while we’re at it, we might as well introduce a notion of an id, so as to be able to store multiple TS in the same table.

Let’s start with two tables, one called rrd where we would store the last position and date, and another called ts which would store the actual data.

We could then populate the TS with fictitious data like so:

To update the data for April 2, we would:

The last_pos of 25 is n * 7 + 1 (since arrays are 1-based).

This article omits a lot of detail such as having resolution finer than one day, but it does describe the general idea. For an actual implementation of this you might want to check out a project I’ve been working on: timeriver

Time Series Accuracy - Graphite vs RRDTool

Back in my ISP days, we used data stored in RRDs to bill our customers. I wouldn’t try this with Graphite. In this write up I try to explain why it is so by comparing the method of recording time series used by Graphite, with the one used by RRDTool.

Graphite uses Whisper to store data, which in the FAQ is portrayed as a better alternative to RRDTool, but this is potentially misleading, because the flexibility afforded by the design of Whisper comes at the price of inaccuracy.

A time series is most often described as a sequence of (time, value) tuples [1]. The most naive method of recording a time series is to store timestamps as is. Since the data points might arrive at arbitrary and inexact intervals, to correlate the series with a particular point in time might be tricky. If data points are arriving somewhere in between one minute bounaries (as they always naturally would), to answer the question of what happened during a particular minute would require specifying a range, which is not as clean as being able to specify a precise value. To join two series on a range is even more problematic.

One way to improve upon this is to divide time into equal intervals and assign data points to the intervals. We could then use the beginning of the interval instead of the actual data point timestamp, thereby giving us more uniformity. For example, if our interval size is 10 seconds (I may sometimes refer to it as the step), we could divide the entire timeline starting from the beginning of the epoch and until the end of universe into 10 second slots. Since the first slot begins at 0, any 10-second-step time series will have slots starting at the exact same times. Now correlation across series or other time values becomes much easier.

Calculating the slot is trivially easy: time - time % step (% being the modulo operator). There is, however, a subtle complexity lurking when it comes to storing the datapoint with the adjusted (or aligned) timestamp. Graphite simply changes the timestamp of the data point to the aligned one. If multiple data points arrive in the same step, then the last one “wins”.

On the surface there is little wrong with Graphite’s approach. In fact, under right circumstances, there is absolutely nothing wrong with it. Consider the following example:

Let’s pretend those values are some system metric like the number of files open. The consequence of the 50 being dropped is that we will never know it existed, but towards the end of the 10 second interval it went down to 10, which is still a true fact. If we really wanted to know about the variations within a 10 second interval, we should have chosen a smaller step, e.g. 1 second. By deciding that the step is going to be 10 seconds, we thus declared that variations within a smaller period are of no interest to us, and from this perspective, Graphite is correct.

But what if those numbers are the price of a stock: there may be hundreds of thousand of trades within a 10 second interval, yet we do not want to (or cannot, for technical reasons) record every single one of them? In this scenario having the last value override all previous ones doesn’t exactly seem correct.

Enter RRDTool which uses a different method. RRDTool keeps track of the last timestamp and calculates a weight for every incoming data point based on time since last update or beginning of the step and the step length. Here is what the same sequence of points looks like in RRDTool. The lines marked with a * are not actual data points, but are the last value for the preceding step, it’s used for computing the value for the remainder of the step after a new one has begun.

Note, by the way, that the Whisper FAQ says that “RRD will store your updates in a temporary workspace area and after the minute has passed, aggregate them and store them in the archive”, which to me sounds like there is some sort of a temporary storage area holding all the unsaved updates. In fact, to be able to compute the weighted average, RRD only needs to store the time of the last update and the current sum, i.e. exactly just two variables, regardless of the number of updates in a single step. This is evident from the above figure.

So to compare the results of the two tools:

Before you say “so what, I don’t really understand the difference”, let’s pretend that those numbers were actually the rate of sale of trinkets from our website (per second). Here is a horizontal ascii-art rendition of our timeline, 0 is 1430701270.

At 12 seconds we recorded selling 50 trinkets per second. Assuming we started selling at the beginning of our timeline, i.e. 12 seconds earlier, we can state that during the first step we sold exactly 500 trinkets. Then 2 seconds into the second step we sold another 100 (we’re still selling at 50/s). Then for the next 6 seconds we were selling at 10/s, thus another 60 trinkets, and for the last 2 seconds of the slot we sold another 60 at 30/s. In the third step we were selling steadily at 30/s, thus exactly 300 were sold.

Comparing RRDTool and Graphite side-by-side, the stories are quite different:

Two important observations here:

1. The totals are vastly different.
2. The rate recorded by RRDTool for the second slot (22/s), yields exactly the number of trinkets sold during that period: 220.

Last, but hardly the least, consider what happens when we consolidate data points into larger intervals by averaging the values. Let’s say 20 seconds, twice our step. If we consolidate the second and the third steps, we would get:

Since the Graphite numbers were off to begin with, we have no reason to trust the 400 trinkets number. But using the RRDTool data, the new number happens to still be 100% accurate even after the data points have been consolidated. This is a very useful property of rates in time series. It also explains why RRDTool does not permit updating data prior to the last update: RRD is always accurate.

As an exercise, try seeing it for yourself: pretent the value of 10 in the second step never arrived, which should make the final value of the second slot 34. If the 10 arrived some time later, averaging it in will not give you the correct 22.

Whisper allows past updates, but is quasi-accurate to begin with - I’m not sure I understand which is better - inaccurate data with a data point missing, or the whole inaccurate data. RRD could accomplish the same thing by adding some --inaccurate flag, though it would seem like more of a bug than a feature to me.

If you’re interested in learning more about this, I recommend reading the documentation for rrdtool create, in particular the “It’s always a Rate” section, as well as this post by Alex van den Bogaerdt.

P.S. After this post was written, someone suggested that instead of storing a rate, we coud store a count delta. In other words, instead of recording that we’re selling 10 trinkets per second for the past 6 seconds, we would store the total count of trinkets sold, i.e. 60. At first this seems like the solution to being able to update historical data accurately: if later we found out that we sold another 75 trinkets in the second time slot, we could just add it to the total and all would be well and most importantly accurate.

Here is the problem with this approach: note that in the previous sentence I had to specify that the additional trinkets were sold in the second time slot, a small, but crucial detail. If time series data point is a timestamp and a value, then there isn’t even a way to relay this information in a single data point - we’d need two timestamps. On the other hand if every data point arrived with two timestamps, i.e. as a duration, then which to store, rate or count, becomes a moot point, we can infer one from the other.

So perhaps another way of explaining the historical update problem is that it is possible, but the datapoint must specify a time interval. This is something that neither RRDTool or Graphite currently support, even though it’d be a very useful feature in my opinion.

[1] Perhaps the biggest misconception about time series is that it is a series of data points. What time series represent is continuous rather than descrete, i.e. it’s the line that connects the points that matters, not the specific points themselves, they are just samples at semi-random intervals that help define the line. And as we know, a line cannot be defined by a single point.

On Time Series

Is it even a thing?

Time Series is on its way to becoming a buzzword in the Information Technology circles. This has to do with the looming Internet of Things which shall cause the Great Reversal of Internet whereby upstream flow of data produced by said Things is expected to exceed the downstream flow. Much of this data is expected to be of the Time Series kind.

This, of course, is a money-making opportunity of the Big Data proportions all over again, and I predict we’re going to see a lot of Time Series support of various shapes and forms appearing in all manners of (mostly commercial) software.

But is there really such a thing as the problem specifically inherent to Time Series data which warrants a specialized solution? I’ve been pondering this for some time now, and I am still undecided. This here is my attempt at arguing that TS is not a special problem and that it can be done by using a database like PostgreSQL.

Influx of data and write speeds

One frequently cited issue with time series data is that it arrives in large volumes at a steady pace which renders buffered writes useless. The number of incoming data streams can also be large typically causing a disk seek per stream and further complicating the write situation. TS data also has a property where often more data is written than read because it’s possible for a datapoint to be collected and examined only once, if ever. In short, TS is very write-heavy.

But is this unique? For example logs have almost identical properties. The real question here is whether our tried and true databases such as PostgreSQL are ill-equipped to deal with large volumes of incoming data requiring an alternative solution.

When considering incoming data I am tempted to imagine every US household sending it, which, of course, would require massive infrastructure. But this (unrealistic) scenario is not a TS data problem, it’s one of scale, the same one from which the Hadoops and Cassandras of this world were born. What is really happening here is that TS happens to be yet another thing that requires the difficult to deal with “big data” infrastructure and reiterates the need for an easy-to-setup horizontally scalable database (which PostgreSQL isn’t).

The backfill problem

This is the problem of having to import vast amounts of historical data. For example OpenTSDB goes to great lengths to optimize back-filling by structuring it in specific ways and storing compressed blobs of data.

But just like the write problem, it’s not unique to TS. It is another problem that is becoming more and more pertinent as our backlogs of data going back to when we stopped using paper keep growing and growing.

Downsampling

Very often TS data is used to generate charts. This is an artifact of the human brain being spectacularly good at interpreting a visual representation of a relationship between streams of numbers while nearly incapable of making sense of data in tabular form. When plotting, no matter how much data is being examined, the end result is limited to however many pixels are available on the display. Even plotting aside, most any use of time series data is in an aggregated form.

The process of consolidating datapoints into a smaller number (e.g. the pixel width of the chart), sometimes called downsampling, involves aggregation around a particular time interval or simply picking every Nth datapoint.

As an aside, selecting every Nth row of a table is an interesting SQL challenge, in PostgreSQL it looks like this (for every 100th row):

Aggregation over a time interval similar to how InfluxDB does it with the GROUP BY time(1d) syntax can be easily achieved via the date_trunc('day', time).

Another aspect of downsampling is that since TS data is immutable, there is no need to repeatedly recompute the consolidated version. It makes more sense to downsample immediately upon the receipt of the data and to store it permanently in this form. RRDTool’s Round-Robin database is based entirely on this notion. InfluxDB’s continuous queries is another way persistent downsampling is addressed.

Again, there is nothing TS-specific here. Storing data in summary form is quite common in the data analytics world and a “continuous query” is easily implemented via a trigger.

Derivatives

Sometimes the data from various devices exists in the form of a counter, which requires the database to derive a rate by comparing with a previous datapoint. An example of this is number of bytes sent over a network interface. Only the rate of change of this value is relevant, not the number itself. The rate of change is the difference with the previous value divided over the time interval passed.

Referring to a previous row is also a bit tricky but perfectly doable in SQL. It can accomplished by using windowing functions such as lag().

Expiration

It is useful to downsample data to a less granular form as it ages, aggregating over an ever larger period of time and possibly purging records eventually. For example we might want to store minutely data for a week, hourly for 3 months, daily for 3 years and drop all data beyond 3 years.

Databases do not expire rows “natively” like Cassandra or Redis, but it shouldn’t be too hard to accomplish via some sort of a periodic cron job or possibly even just triggers.

Heartbeat and Interval Filling

It is possible for a time series stream to pause, and this can be interpreted in different ways: we can attempt to fill in missing data, or treat it as unknown. More likely we’d want to start treating it as unknown after some period of silence. RRDTool addresses this by introducing the notion of a heartbeat and the number of missed beats before data is treated as unknown.

Regardless of whether the value is unknown, it is useful to be able to fill in a gap (missing rows) in data. In PostgreSQL this can be accomplished by a join with a result set from the generate_series() function.

Data Seclusion

With many specialized Time Series tools the TS data ends up being secluded in a separate system not easily accessible from the rest of the business data. You cannot join your customer records with data in RRDTool or Graphite or InfluxDB, etc.

Conclusion

If there is a problem with using PosgreSQL or some other database for Time Series data, it is mainly that of having to use advanced SQL syntax and possibly requiring some cookie-cutter method for managing Time Series, especially when it is a large number or series and high volume.

There is also complexity in horizontally scaling a relational database because it involves setting up replication, sharding, methods for recovery from failure and balancing the data. But these are not TS-specific problems, they are scaling problems.

Having written this up, I’m inclined to think that perhaps there is no need for a specialized “Time Series Database”, instead it can be accomplished by an application which uses a database for storage and abstracts the users from the complexities of SQL and potentially even scaling, while still allowing for direct access to the data via the rich set of tools that a database like PostgreSQL provides.

How InfluxDB Stores Data

A nice, reliable, horizontally scalable database that is designed specifically to tackle the problem of Time Series data (and does not require you to stand up a Hadoop cluster) is very much missing from the Open Source Universe right now.

InfluxDB might be able to fill this gap, it certainly aims to.

I was curious about how it structures and stores data and since there wasn’t much documentation on the subject and I ended up just reading the code, I figured I’d write this up. I only looked at the new (currently 0.9.0 in RC stage) version, the previous versions are significantly different.

First of all, InfluxDB is distributed. You can run one node, or a bunch, it seems like a more typical number may be 3 or 5. The nodes use Raft to establish consensus and maintain data consistency.

InfluxDB feels a little like a relational database in some aspects (e.g. it has a SQL-like query language) but not in others.

The top level container is a database. An InfluxDB database is very much like what a database is in MySQL, it’s a collection of other things.

“Other things” are called data points, series, measurements, tags and retention policies. Under the hood (i.e. you never deal with them directly) there are shards and shard groups.

The very first thing you need to do in InfluxDB is create a database and at least one retention policy for this database. Once you have these two things, you can start writing data.

A retention policy is the time period after which the data expires. It can be set to be infinite. A data point, which is a measurement consisting of any number of values and tags associated with a particular point in time, must be associated with a database and a retention policy. A retention policy also specifies the replication factor for the data point.

Let’s say we are tracking disk usage across a whole bunch of servers. Each server runs some sort of an agent which periodically reports the usage of each disk to InfluxDB. Such a report might look like this (in JSON):

In the above example, “disk” is a measurement. Thus we can operate on anything “disk”, regardless of what “server” or “unit” it applies to. The data point as a whole belongs to a (time) series identified by the combination of the measurement name and the tags.

There is no need to create series or measurements, they are created on the fly.

To list the measurements, we can use SHOW MEASUREMENTS:

We can use SHOW SERIES to list the series:

If we send a record that contains different tags, we automatically create a different series (or so it seems), for example if we send this (note we changed “unit” to “foo”):

we get

This is where the distinction between measurement and series becomes a little confusing to me. In actuality (from looking at the code and the actual files InfluxDB created) there is only one series here called “disk”. I understand the intent, but not sure that series is the right terminology here. I think I’d prefer if measurements were simply called series, and to get the equivalent of SHOW SERIES you’d use something like SHOW SERIES TAGS. (May be I’m missing something.)

Under the hood the data is stored in shards, which are grouped by shard groups, which in turn are grouped by retention policies, and finally databases.

A database contains one or more retention policies. Somewhat surprisingly a retention policy is actually a bucket. It makes sense if you think about the problem of having to expire data points - you can remove them all by simply dropping the entire bucket.

If we declare a retention policy of 1 day, then we can logically divide the timeline into a sequence of single days from beginning of the epoch. Any incoming data point falls into its corresponding segment, which is a retention policy bucket. When clean up time comes around, we can delete all days except for the most current day.

To better understand the following paragraphs, consider that having multiple nodes provides the option for two things: redundancy and distribution. Redundancy gives you the ability to lose a node without losing any data. The number of copies of the data is controlled by the replication factor specified as part of the retention policy. Distribution spreads the data across nodes which allows for concurrency: data can be written, read and processed in parallel. For example if we become constrained by write performance, we can solve this by simply adding more nodes. InfluxDB favors redundancy over distribution when having to choose between the two.

Each retention policy bucket is further divided into shard groups, one shard group per series. The purpose of a shard group is to balance series data across the nodes of the cluster. If we have a cluster of 3 nodes, we want the data points to be evenly distributed across these nodes. InfluxDB will create 3 shards, one on each of the nodes. The 3 shards comprise the shard group. This is assuming the replication factor is 1.

But if the replication factor was 2, then there needs to be 2 identical copies of every shard. The shard copies must be on separate nodes. With 3 nodes and replication factor of 2, it is impossible to do any distribution across the nodes - the shard group will have a size of 1, and contain 1 shard, replicated across 2 nodes. In this set up, the third node will have no data for this particular retention policy.

If we had a cluster of 5 nodes and the replication factor of 2, then the shard group can have a size of 2, for 2 shards, replicated across 2 nodes each. Shard one replicas could live on nodes 1 and 3, while shard two replicas on nodes 2 and 4. Now the data is distributed as well as redundant. Note that the 5th node doesn’t do anything. If we up the replication factor to 3 then just like before, the cluster is too small to have any distribution, we only have enough nodes for redundancy.

As of RC15 distributed queries are not yet implemented, so you will always get an error if you have more than one shard in a group.

The shards themselves are instances of Bolt db - a simple to use key/value store written in Go. There is also a separate Bolt db file called meta which stores the metadata, i.e. information about databases, retention policies, measurements, series, etc.

I couldn’t quite figure out the process for typical cluster operations such as recovery from node failure or what happens (or should happen) when nodes are added to existing cluster, whether there is a way to decommission a node or re-balance the cluster similar to the Hadoop balancer, etc. I think as of this writing this has not been fully implemented yet, and there is no documentation, but hopefully it’s coming soon.

Ruby, HiveServer2 and Kerberos

Recently I found myself needing to connect to HiveServer2 with Kerberos authentication enabled from a Ruby app. As it turned out rbhive gem we were using did not have support for Kerberos authentication. So I had to roll my own.

This post is to document the experience of figuring out the details of a SASL/GSSAPI connection before it is lost forever in my neurons and synapses.

First, the terminology. The authentication system that Hadoop uses is Kerberos. Note that Kerberos is not a network protocol. It describes the method by which authentication happens, but not the format of how to send Kerberos tickets and what not over the wire. For that, you need SASL and GSSAPI.

SASL is a generic protocol designed to be able to wrap just about any authentication handshake. It’s very simple: the client sends a START followed by some payload, and expects an OK, BAD or COMPLETE from the server. OK means that there are more steps to this conversation, BAD is self-explanatory and COMPLETE means “I’m satisfied”. The objective is to go from START via a series of OK’s to each side sending the other a COMPLETE.

SASL doesn’t define the payload of each message. The payload is specified by GSSAPI protocol. GSSAPI is another generic protocol. Unlike SASL it is actually very complex and covers a variety of authentication methods, including Kerberos.

The combination of SASL and GSSAPI and what happens at the network layer is documented in RFC4752.

Bottom line is you need to read at least four RFC’s to be able to understand every detail of this process: RFC4120, RFC2222, RFC2743 and RFC4752. Fun!

The Handshake in Ruby

First, you’ll need some form of binding to the GSSAPI libraries. I’ve been using the most excellent GSSAPI gem by Dan Wanek which wraps the MIT GSSAPI library.

If you follow the code in sasl_client_transport.rb, you’ll see the following steps are required to establish a connection.

First, we instantiate a GSSAPI object passing it the remote host and the remote principal. Note that there is no TCP port number to be specifies anywhere, because this isn’t to establish a TCP connection, but only for Kerberos host authentication. (Kerberos requires that not only the client authenticates itself to the host, but also that the host authenticates itself to the client.)

The rest of the action takes place in the initiate_hand_shake_gssapi() method.

First, we call @gsscli.init_context() with no arguments. This call creates a token based on our current Kerberos credentials. (If there are no credentials in our cache, this call will fail).

Next we compose a SASL message which consists of START (0x01) followed by payload length, followed by the actual payload, which is the SASL mechanism name: ‘GSSAPI’. Without waiting for response, we also send an OK (0x02) and the token returned from init_context().

Next we read 5 bytes of response. The first byte is the status returned from the server, which hopefully is OK, followed by the length of the payload, and then we read the payload itself:

The payload is a challenge created for us by the server. We can verify this challenge by calling init_context() a second time, this time passing in the challenge to verify it:

If the challenge verifies, then it is our turn to send an OK (with an empty payload this time):

At this point in the SASL ‘conversation’ we have verified that the server is who they claim to be.

Next the server sends us another challenge, this one is so that we can authenticate ourselves to the server and at the same time agree on the protection level for the communication channel.

We need to decrypt (“unwrap” in the GSSAPI terminology) the challenge, examine the protection level and if it is acceptable, encrypt it on our side and send it back to the server in a SASL COMPLETE message. In this particular case we’re agreeable to any level of protection (which is none in case of HiveServer2, i.e. the conversation is not encrypted). Otherwise there are additional steps that RFC4752 describes whereby the client can select an acceptable protection level.

The server should then respond with COMPLETE as well, at which point we’re done with the authentication process and cat start sending whatever we want over this connection:

Graceful Restart in Golang

Update (Apr 2015): Florian von Bock has turned what is described in this article into a nice Go package called endless.

If you have a Golang HTTP service, chances are, you will need to restart it on occasion to upgrade the binary or change some configuration. And if you (like me) have been taking graceful restart for granted because the webserver took care of it, you may find this recipe very handy because with Golang you need to roll your own.

There are actually two problems that need to be solved here. First is the UNIX side of the graceful restart, i.e. the mechanism by which a process can restart itself without closing the listening socket. The second problem is ensuring that all in-progress requests are properly completed or timed-out.

Restarting without closing the socket

• Fork a new process which inherits the listening socket.
• The child performs initialization and starts accepting connections on the socket.
• Immediately after, child sends a signal to the parent causing the parent to stop accepting connecitons and terminate.

Forking a new process

There is more than one way to fork a process using the Golang lib, but for this particular case exec.Command is the way to go. This is because the Cmd struct this function returns has this ExtraFiles member, which specifies open files (in addition to stdin/err/out) to be inherited by new process.

Here is what this looks like:

In the above code netListener is a pointer to net.Listener listening for HTTP requests. The path variable should contain the path to the new executable if you’re upgrading (which may be the same as the currently running one).

An important point in the above code is that netListener.File() returns a dup(2) of the file descriptor. The duplicated file descriptor will not have the FD_CLOEXEC flag set, which would cause the file to be closed in the child (not what we want).

You may come across examples that pass the inherited file descriptor number to the child via a command line argument, but the way ExtraFiles is implemented makes it unnecessary. The documentation states that “If non-nil, entry i becomes file descriptor 3+i.” This means that in the above code snippet, the inherited file descriptor in the child will always be 3, thus no need to explicitely pass it.

Finally, args array contains a -graceful option: your program will need some way of informing the child that this is a part of a graceful restart and the child should re-use the socket rather than try opening a new one. Another way to do this might be via an environment variable.

Child initialization

Here is part of the program startup sequence

Signal parent to stop

At this point we’re ready to accept requests, but just before we do that, we need to tell our parent to stop accepting requests and exit, which could be something like this:

In-progress requests completion/timeout

For this we will need to keep track of open connections with a sync.WaitGroup. We will need to increment the wait group on every accepted connection and decrement it on every connection close.

At first glance, the Golang standard http package does not provide any hooks to take action on Accept() or Close(), but this is where the interface magic comes to the rescue. (Big thanks and credit to Jeff R. Allen for this post).

Here is an example of a listener which increments a wait group on every Accept(). First, we “subclass” net.Listener (you’ll see why we need stop and stopped below):

Next we “override” the Accept method. (Nevermind gracefulConn for now, it will be introduced later).

We also need a “constructor”:

The reason the function above starts a goroutine is because this cannot be done in our Accept() above since it will block on gl.Listener.Accept(). The goroutine will unblock it by closing file descriptor.

Our Close() method simply sends a nil to the stop channel for the above goroutine to do the rest of the work.

Finally, this little convenience method extracts the file descriptor from the net.TCPListener.

And, of course we also need a variant of a net.Conn which decrements the wait group on Close():

To start using the above graceful version of the Listener, all we need is to change the server.Serve(l) line to:

And there is one more thing. You should avoid hanging connections that the client has no intention of closing (or not this week). It is better to create your server as follows: