I’ve put together a basic example of a “Hello World” Go program which
runs in Amazon AWS Elastic Compute Service (ECS), which allows running
applications in Docker containers and has the ability to scale on
demand.

I initially wanted to write about the components of this system and
the tools you use to deploy your application, but soon realized that
this would make for an extremely long post, as the number of
components required for a simple “Hello World” is mind
boggling. However problematic it may seem, it’s par for the course,
this is what takes to run an application in our cloudy times.

I used Terraform to build
all the AWS infrastructure. Initially I was skeptical on how well it
could accomplish such a tedious task, but I have say my confidence in
Terraform grew the more I used it.

The main top level tool for everything is the good old
make, a tool that
stood the test of time.

Here is the code of the example, read the README, I hope you find it
useful:

If you haven’t read Part I and
Part II
you probably should, or the following will be hard to make sense of.

In Part I we’ve learned how to forceast one point, in Part II we’ve
learned how to forecast two points. In this part we’ll learn how to
forecast many points.

More Terminology

Season

If a series appears to be repetitive at regular intervals, such an
interval is referred to as a season, and the series is said to be
seasonal. Seasonality
is required for the Holt-Winters method to work, non-seasonal series
(e.g. stock prices) cannot be forecasted using this method (would be
nice though if they could be).

Season Length

Season length is the number of data points after which a new season
begins. We will use $L$ to denote season length.

Seasonal Component

The seasonal component is an additional deviation from level + trend
that repeats itself at the same offset into the season. There is a
seasonal component for every point in a season, i.e. if your season
length is 12, there are 12 seasonal components. We will use $s$ to
denote the seasonal component.

The idea behind triple exponential smoothing is to apply exponential
smoothing to the seasonal components in addition to level and
trend. The smoothing is applied across seasons, e.g. the seasonal
component of the 3rd point into the season would be exponentially
smoothed with the the one from the 3rd point of last season, 3rd point
two seasons ago, etc. In math notation we now have four equations:

What’s new:

We now have a third greek letter, $\gamma$ (gamma) which is the smoothing
factor for the seasonal component.

The expected value index is $x+m$ where $m$ can be any integer meaning
we can forecast any number of points into the future (woo-hoo!)

The forecast equation now consists of level, trend and the seasonal
component.

The index of the seasonal component of the forecast
$s_{x-L+1+(m-1)modL}$ may appear a little mind boggling, but it’s
just the offset into the list of seasonal components from the last set
from observed data. (I.e. if we are forecasting the 3rd point into the
season 45 seasons into the future, we cannot use seasonal components
from the 44th season in the future since that season is also
forecasted, we must use the last set of seasonal components from
observed points, or from “the past” if you will.) It looks much
simpler in Python as you’ll see shortly.

Initial Values

Before we can discuss initial values, let me introduce to you a new
tiny series (okay, not as tiny):

You can see that this series is seasonal, there are clearly visible 6
seasons. Although perhaps not easily apparent from the picture, the
season length for this series is 12, i.e. it “repeats” every 12
points. In order to apply triple exponential smoothing we need to know
what the season length is. (There do exist methods for detecting
seasonality in series, but this is way beyond the scope of this text).

Initial Trend

For double exponential smoothing we simply used the first two points
for the initial trend. With seasonal data we can do better than that,
since we can observe many seasons and can extrapolate a better
starting trend. The most common practice is to compute the average of
trend averages across seasons.

Good news - this looks simpler in Python than in math notation:

The situation is even more complicated when it comes to initial values
for the seasonal components. Briefly, we need to compute the average
level for every observed season we have, divide every observed value
by the average for the season it’s in and finally average each of
these numbers across our observed seasons. If you want more detail, here is
one thorough description of this process.

I will forgo the math notation for initial seasonal components, but
here it is in Python. The result is a season-length array of seasonal components.

And finally, here is the additive Holt-Winters method in Python. The
arguments to the function are the series of observed values, the
season length, alpha, beta, gamma and the number of points we want
forecasted.:

1234567891011121314151617181920212223

deftriple_exponential_smoothing(series,slen,alpha,beta,gamma,n_preds):result=[]seasonals=initial_seasonal_components(series,slen)foriinrange(len(series)+n_preds):ifi==0:# initial valuessmooth=series[0]trend=initial_trend(series,slen)result.append(series[0])continueifi>=len(series):# we are forecastingm=len(series)-i+1result.append((smooth+m*trend)+seasonals[i%slen])else:val=series[i]last_smooth,smooth=smooth,alpha*(val-seasonals[i%slen])+(1-alpha)*(smooth+trend)trend=beta*(smooth-last_smooth)+(1-beta)*trendseasonals[i%slen]=gamma*(val-smooth)+(1-gamma)*seasonals[i%slen]result.append(smooth+trend+seasonals[i%slen])returnresult# # forecast 24 points (i.e. two seasons)# >>> triple_exponential_smoothing(series, 12, 0.716, 0.029, 0.993, 24)# [30, 20.052662759277222, 28.15220274074342, 30.249475572455367, 39.45956689718808, ...

And here is what this looks like if we were to plot the original
series, followed by the last 24 points from the result of the
triple_exponential_smoothing() call:

A Note on α, β and γ

You may be wondering how I came up with 0.716, 0.029 and 0.993 for
$\alpha$, $\beta$ and $\gamma$, respectively. To make long story short, it
was done by way of trial and error: simply running the algorithm over and
over again and selecting the values that give you the smallest
SSE. As I
mentioned before, this process is known as fitting.

To compute the smothing factors to three decimal points
we may have to run through 1,000,000,000 iterations, but luckily
there are more efficient methods at zooming in on best
values. Unfortunately this would take a whole other very long post to
describe this process. One good algorithm for this is
Nelder-Mead,
which is what tgres uses.

Conclusion

Well - here you have it, Holt-Winters method explained the way I wish
it would have been explained to me when I needed it. If you think I
missed something, found an error or a suggestion, please do not
hesitate to comment!

If you haven’t read Part I
you probably should, or the following will be hard to make sense of.

All the forecasting methods we covered so far, including single
exponential smoothing, were only good at predicting a single
point. We can do better than that, but first we need to agree
on a couple of more terms.

More terminology

Level

Expected value has another name, which, again varies depending on who wrote the
text book: baseline, intercept (as in
Y-intercept) or
level. We will stick with “level” here.

So level is that one predicted point that we learned how to calculate
in Part I. But because now it’s going to be only part of calculation
of the forcast, we can no longer refer to it as $\hat{y}$ and will instead
use $\ell$.

Trend or Slope

You should be familiar with
slope from your high school algebra
class. What you might be a little rusty on is how to calculate it,
which is important, because a series slope has an interesting
characteristic. Slope is:

where $\Delta{y}$ is the difference in the $y$ coordinates and
$\Delta{x}$ is the difference in the $x$ coordinates, respectively,
between two points. While in real algebraic problems $\Delta{x}$ could
be anything, in a series, from one point to the next, it is always
1. Which means that for a series, slope between two adjacent points
is simply $\dfrac{\Delta{y}} {1}$ or $\Delta{y}$, or:

Where $b$ is trend. To the best of my understanding terms “trend”
and “slope” are interchangeable. In forecasting parlance “trend” is
more common, and in math notation forecasters refer to it as $b$
rather than $m$.

Additive vs Multiplicative

Another thing to know about trend is that instead of subtracting
$y_{x-1}$ from $y_x$, we could divide one by the other thereby
getting a ratio. The difference between these two approaches is
similar to how we can say that something costs $20 more or 5%
more. The variant of the method based on subtraction is known as
additive, while the one based on division is known as
multiplicative.

Practice shows that a ratio (i.e. multiplicative) is a more stable
predictor. The additive method, however is simpler to understand, and
going from additive to multiplicative is trivial once you understand
this whole thing. For this reason we will stick with the additive
method here, leaving the multiplicative method an exercise for the
reader.

Double Exponential Smoothing

So now we have two components to a series: level and trend. In Part I
we learned several methods to forecast the level, and it should follow
that every one of these methods can be applied to the trend
just as well. E.g. the naive method would assume that trend between
last two points is going to stay the same, or we could average all
slopes between all points to get an average trend, use a moving trend
average or apply exponential smoothing.

Double exponential smoothing then is nothing more than exponential
smoothing applied to both level and trend. To express this in
mathematical notation we now need three equations: one for level, one
for the trend and one to combine the level and trend to get the
expected $\hat{y}$.

The first equation is from Part I, only now we’re using $\ell$ instead
of $\hat{y}$ and on the right side the expected value becomes the sum
of level end trend.

The second equation introduces $\beta$, the trend factor (or
coefficient). As with $\alpha$, some values of ${\beta}$ work better
than others depending on the series.

Similarly to single exponential smoothing, where we used the first
observed value as the first expected, we can use the first observed
trend as the first expected. Of course we need at least two points to
compute the initial trend.

Because we have a level and a trend, this method can forecast not one,
but two data points. In Python:

1234567891011121314151617

# given a series and alpha, return series of smoothed pointsdefdouble_exponential_smoothing(series,alpha,beta):result=[series[0]]forninrange(1,len(series)+1):ifn==1:level,trend=series[0],series[1]-series[0]ifn>=len(series):# we are forecastingvalue=result[-1]else:value=series[n]last_level,level=level,alpha*value+(1-alpha)*(level+trend)trend=beta*(level-last_level)+(1-beta)*trendresult.append(level+trend)returnresult# >>> double_exponential_smoothing(series, alpha=0.9, beta=0.9)# [3, 17.0, 15.45, 14.210500000000001, 11.396044999999999, 8.183803049999998, 12.753698384500002, 13.889016464000003]

And here is a picture of double exponential smoothing in action (the
green dotted line).

Quick Review

We’ve learned that a data point in a series can be represented as a
level and a trend, and we have learned how to appliy exponential
smoothing to each of them to be able to forecast not one, but two
points.

In Part III
we’ll finally talk about triple exponential smoothing.

This three part write up [Part IIPart III]
is my attempt at a down-to-earth explanation (and Python code) of the
Holt-Winters method for those of us who while hypothetically might be
quite good at math, still try to avoid it at every opportunity. I had
to dive into this subject while tinkering on
tgres (which features a Golang implementation). And
having found it somewhat complex (and yet so brilliantly
simple), figured that it’d be good to share this knowledge, and
in the process, to hopefully solidify it in my head as well.

Triple Exponential Smoothing,
also known as the Holt-Winters method, is one of the many methods or
algorithms that can be used to forecast data points in a series,
provided that the series is “seasonal”, i.e. repetitive over some
period.

A little history

Еxponential smoothing in some form or another dates back to the work
of Siméon Poisson (1781-1840),
while its application in forecasting appears to have been pioneered over a century later in 1956 by
Robert Brown (1923–2013)
in his publication
Exponential Smoothing for Predicting Demand,
(Cambridge, Massachusetts). [Based on the URL it seems Brown was working on forecasting tobacco demand?]

In 1957 an MIT and University of Chicago
graduate, professor Charles C Holt
(1921-2010) was working at CMU (then known as CIT) on forecasting trends in production,
inventories and labor force.
It appears that Holt and Brown worked independently and knew not of each-other’s work.
Holt published a paper “Forecasting trends
and seasonals by exponentially weighted moving averages” (Office of Naval Research Research
Memorandum No. 52, Carnegie Institute of Technology) describing
double exponential smoothing. Three years later, in 1960, a student of
Holts (?) Peter R. Winters improved the algorithm by adding seasonality and
published
Forecasting sales by exponentially weighted moving averages
(Management Science 6, 324–342), citing Dr. Holt’s 1957 paper as earlier work on the same subject.
This algorithm became known as triple exponential smoothing or the Holt-Winters method,
the latter probably because it was described in a 1960 Prentice-Hall book “Planning Production, Inventories, and Work Force”
by Holt, Modigliani, Muth,
Simon,
Bonini and Winters - good luck finding a copy!

Curiously, I’ve not been able to find any personal information on Peter R. Winters online. If you find anything, please let me
know, I’ll add a reference here.

In 2000 the Holt-Winters method became well known in the ISP
circles at the height of the .com boom when Jake D. Brutlag (then of WebTV) published
Aberrant Behavior Detection in Time Series for Network Monitoring
(Proceedings of the 14th Systems Administration Conference, LISA
2000). It described how an open source C
implementation [link to the actual commit]
of a variant of the Holt-Winters seasonal method, which he contributed as a feature
to the very popular at ISPs RRDTool, could be used to
monitor network traffic.

In 2011 the RRDTool implementation contributed by Brutlag was
ported
to Graphite by Matthew Graham thus making it even more popular in the
devops community.

So… how does it work?

Forecasting, Baby Steps

The best way to explain triple exponential smoothing is to gradually
build up to it starting with the simplest forecasting methods. Lest
this text gets too long, we will stop at triple exponential smoothing,
though there are quite a few other methods known.

I used mathematical notation only where I thought it made best sense, sometimes
accompanied by an “English translation”, and where appropriate
supplemented with a bit of Python code.
In Python I refrain from using any non-standard packages, keeping the
examples plain. I chose not to use generators
for clarity. The objective here is to explain
the inner working of the algorithm so that you can implement it
yourself in whatever language you prefer.

I also hope to demonstrate that this is simple enough that you do not
need to resort to SciPy or whatever
(not that there is anything wrong with that).

But First, Some Terminology

Series

The main subject here is a series. In the real world we are most
likely to be applying this to a time series, but for this discussion
the time aspect is irrelevant. A series is merely an ordered sequence
of numbers. We might be using words that are chronological in nature
(past, future, yet, already, time even!), but only because it makes it easer to
understand. So forget about time, timestamps, intervals,
time does not exist,
the only property each data point has (other than the value) is its order: first,
next, previous, last, etc.

It is useful to think of a series as a list of two-dimensional $x,y$
coordinates, where $x$ is order (always going up by 1), and $y$ is
value. For this reason in our math formulas we will be sticking to $y$
for value and $x$ for order.

Observed vs Expected

Forecasting is estimating values that we do not yet know based on the
the values we do know. The values we know are referred to as
observed while the values we forecast as expected. The math
convention to denote expected values is with the
circumflex a.k.a. “hat”: $\hat{y}$

For example, if we have a series that looks like [1,2,3], we might
forecast the next value to be 4. Using this terminology, given
observed series [1,2,3] the next expected value ${\hat{y}_4}$ is 4.

Method

We may have intuited based on [1,2,3] that in this series each value
is 1 greater than the previous, which in math notation can
be expressed as and $\hat{y}_{x + 1} = y_x + 1$. This equation, the
result of our intuition, is known as a forecast method.

If our method is correct then the next observed value would indeed be
4, but if [1,2,3] is actually part of a
Fibonacci sequence, then where we
expected ${\hat{y}_4 = 4}$, we would observe $y_4 = 5$. Note the hatted
${\hat{y}}$ (expected) in the former and $y$ (observed) in the latter expression.

Error, SSE and MSE

It is perfectly normal to compute expected values where we already
have observed values. Comparing the two lets you compute the error,
which is the difference between observed and expected and is an
indispensable indication of the accuracy of the method.

Since difference can be negative or positive, the common convention is
to use the absolute value or square the error so that the number is always
positive. For a whole series the squared errors are typically summed
resulting in Sum of Squared Errors (SSE).
Sometimes you may come across Mean Squared Error
(MSE) which is simply $\sqrt{SSE}$.

And Now the Methods (where the fun begins!)

In the next few examples we are going to be using this tiny series:

1

series=[3,10,12,13,12,10,12]

(Feel free to paste it and any of the following code snippets into your Python
repl)

Naive Method

This is the most primitive forecasting method. The premise of the
naive method is that the expected point is equal to the last
observed point:

Using this method we would forecast the next point to be 12.

Simple Average

A less primitive method is the arithmetic average
of all the previously observed data points. We take all the values we
know, calculate the average and bet that that’s going to be the next value. Of course it won’t be it exactly,
but it probably will be somewhere in the ballpark, hopefully you can see the reasoning behind this
simplistic approach.

(Okay, this formula is only here because I think the capital Sigma
looks cool. I am sincerely hoping that the average requires no explanation.) In Python:

123456

defaverage(series):returnfloat(sum(series))/len(series)# Given the above series, the average is:# >>> average(series)# 10.285714285714286

As a forecasting method, there are actually situations where it’s spot
on. For example your final school grade may be the average of all the
previous grades.

Moving Average

An improvement over simple average is the average of $n$ last
points. Obviously the thinking here is that only the recent values
matter. Calculation of the moving average involves what is sometimes
called a “sliding window” of size $n$:

12345678

# moving average using n last pointsdefmoving_average(series,n):returnaverage(series[-n:])# >>> moving_average(series, 3)# 11.333333333333334# >>> moving_average(series, 4)# 11.75

A moving average can actually be quite effective, especially if you
pick the right $n$ for the series. Stock analysts adore it.

Also note that simple average is a variation of a moving average, thus
the two functions above could be re-written as a single recursive one
(just for fun):

A weighted moving average is a moving average where within the
sliding window values are given different weights, typically so that
more recent points matter more.

Instead of selecting a window size, it requires a list of weights
(which should add up to 1). For example if we picked [0.1,
0.2, 0.3, 0.4] as weights, we would be giving 10%, 20%, 30% and 40%
to the last 4 points respectively. In Python:

1234567891011

# weighted average, weights is a list of weightsdefweighted_average(series,weights):result=0.0weights.reverse()forninrange(len(weights)):result+=series[-n-1]*weights[n]returnresult# >>> weights = [0.1, 0.2, 0.3, 0.4]# >>> weighted_average(series, weights)# 11.5

Weighted moving average is fundamental to what follows, please take a
moment to understand it, give it a think before reading on.

I would also like to stretch the importance of the weights adding up
to 1. To demonstrate why, let’s say we pick weights [0.9, 0.8, 0.7,
0.6] (which add up to 3.0). Watch what happens:

12

>>> weighted_average(series, [0.9, 0.8, 0.7, 0.6])
>>> 35.5 # <--- this is clearly bogus

Picture time!

Here is a picture that demonstrates our tiny series and all of the above
forecasts (except for naive).

It’s important to understand that which of the above methods is better
very much depends on the nature of the series. The order in which I
presented them was from simple to complex, but “more complex” doesn’t
necessarily mean “better”.

Single Exponential Smoothing

Here is where things get interesting. Imagine a weighted average where
we consider all of the data points, while assigning exponentially
smaller weights as we go back in time. For example if we started with
0.9, our weights would be (going back in time):

…eventually approaching the big old zero. In some way this is very
similar to the weighted average above, only the weights are dictated
by math, decaying uniformly. The smaller the starting weight, the
faster it approaches zero.

Only… there is a problem: weights do not add up to 1. The sum of
the first 3 numbers alone is already 2.439! (Exercise for the reader: what number
does the sum of the weights approach and why?)

What earned Poisson, Holts or Roberts a permanent place in the history
of Mathematics is solving this with a succinct and elegant formula:

If you stare at it just long enough, you will see that the expected
value $\hat{y}_x$ is the sum of two products: $\alpha \cdot y_x$ and
$(1-\alpha) \cdot \hat{y}_{x-1}$. You can think of $\alpha$ (alpha)
as a sort of a starting weight 0.9 in the above (problematic)
example. It is called the smoothing factor or smoothing
coefficient (depending on who wrote your text book).

So essentially we’ve got a weighted moving average with two weights:
$\alpha$ and $1-\alpha$. The sum of $\alpha$ and $1-\alpha$ is 1, so
all is well.

Now let’s zoom in on the right side of the sum. Cleverly, $1-\alpha$
is multiplied by the previous expected value
$\hat{y}_{x-1}$. Which, if you think about it, is the result of the
same formula, which makes the expression recursive (and programmers
love recursion), and if you were to write it all out on paper you would
quickly see that $(1-\alpha)$ is multiplied by itself again and again
all the way to beginning of the series, if there is one, infinitely
otherwise. And this is why this method is called
exponential.

Another important thing about $\alpha$ is that its value dictates how
much weight we give the most recent observed value versus the last
expected. It’s a kind of a lever that gives more weight to the left
side when it’s higher (closer to 1) or the right side when it’s lower
(closer to 0).

Perhaps $\alpha$ would be better referred to as memory decay rate: the
higher the $\alpha$, the faster the method “forgets”.

Why is it called “smoothing”?

To the best of my understanding this simply refers to the effect these
methods have on a graph if you were to plot the values: jagged lines
become smoother. Moving average also has the same effect, so it
deserves the right to be called smoothing just as well.

Implementation

There is an aspect of this method that programmers would appreciate
that is of no concern to mathematicians: it’s simple and efficient to
implement. Here is some Python. Unlike the previous examples, this
function returns expected values for the whole series, not just one
point.

1234567891011

# given a series and alpha, return series of smoothed pointsdefexponential_smoothing(series,alpha):result=[series[0]]# first value is same as seriesforninrange(1,len(series)):result.append(alpha*series[n]+(1-alpha)*result[n-1])returnresult# >>> exponential_smoothing(series, 0.1)# [3, 3.7, 4.53, 5.377, 6.0393, 6.43537, 6.991833]# >>> exponential_smoothing(series, 0.9)# [3, 9.3, 11.73, 12.873000000000001, 12.0873, 10.20873, 11.820873]

The figure below shows exponentially smoothed version of our series
with $\alpha$ of 0.9 (red) and $\alpha$ of 0.1 (orange).

Looking at the above picture it is apparent that the $\alpha$ value of 0.9
follows the observed values much closer than 0.1. This isn’t going to
be true for any series, each series has its best $\alpha$ (or
several). The process of finding the best $\alpha$ is referred to as
fitting and we will discuss it later separately.

Quick Review

We’ve learned some history, basic terminology (series and how it knows
no time, method, error SSE, MSE and fitting). And we’ve learned some
basic forecasting methods: naive, simple average, moving average,
weighted moving average and, finally, single exponential smoothing.

One very important characteristic of all of the above methods is that
remarkably, they can only forecast a single point. That’s correct,
just one.

In Part II we will focus on methods that can forecast more than
one point.

With the latest advances in PostgreSQL (and other db’s), a relational
database begins to look like a very viable TS storage platform. In
this write up I attempt to show how to store TS in PostgreSQL.

A TS is a series of [timestamp, measurement] pairs, where measurement
is typically a floating point number. These pairs (aka “data points”)
usually arrive at a high and steady rate. As time goes on, detailed
data usually becomes less interesting and is often consolidated into
larger time intervals until ultimately it is expired.

The obvious approach

The “naive” approach is a three-column table, like so:

1

CREATETABLEts(idINT,timeTIMESTAMPTZ,valueREAL);

(Let’s gloss over some details such as an index on the time column and
choice of data type for time and value as it’s not relevant to this
discussion.)

One problem with this is the inefficiency of appending data. An insert
requires a look up of the new id, locking and (usually) blocks until
the data is synced to disk. Given the TS’s “firehose” nature, the
database can quite quickly get overwhelmed.

This approach also does not address consolidation and eventual
expiration of older data points.

Round-robin database

A better alternative is something called a round-robin database. An
RRD is a circular structure with a separately stored pointer denoting
the last element and its timestamp.

A everyday life example of an RRD is a week. Imagine a structure of 7
slots, one for each day of the week. If you know today’s date and day
of the week, you can easily infer the date for each slot. For example
if today is Tuesday, April 1, 2008, then the Monday slot refers to
March 31st, Sunday to March 30th and (most notably) Wednesday to March
26.

Here’s what a 7-day RRD of average temperature might look as of
Tuesday, April 1:

Note how little has changed, and that the update required no
allocation of space: all we did to record 92F on Wednesday is
overwrite one value. Even more remarkably, the previous value
automatically “expired” when we overwrote it, thus solving the
eventual expiration problem without any additional operations.

RRD’s are also very space-efficient. In the above example we specified
the date of every slot for clarity. In an actual implementation only
the date of the last slot needs to be stored, thus the RRD can be kept
as a sequence of 7 numbers plus the position of the last entry and
it’s timestamp. In Python syntax it’d look like this:

1

[[79,82,90,92,75,80,81],2,1207022400]

Round-robin in PostgreSQL

Here is a naive approach to having a round-robin table. Carrying on
with our 7 day RRD example, it might look like this:

Somewhere separately we’d also need to record that the last entry is
week_day 3 (Tuesday) and it’s 2008-04-01. Come April 2, we could
record the temperature using:

1

UPDATEseriesSETtemp_f=92WHEREweek_day=4;

This might be okay for a 7-slot RRD, but a more typical TS might have
a slot per minute going back 90 days, which would require 129600
rows. For recording data points one at a time it might be fast enough,
but to copy the whole RRD would require 129600 UPDATE statements which
is not very efficient.

This is where using PostgrSQL arrays become very useful.

Using PostgreSQL arrays

An array would allow us to store the whole series in a single
row. Sticking with the 7-day RRD example, our table would be created
as follows:

(In PostgreSQL arrays are 1-based, not 0-based like in most
programming languages)

But it could be even more efficient

Under the hood, PostgreSQL data is stored in pages of 8K. It would
make sense to keep chunks in which our RRD is written to disk in line
with page size, or at least smaller than one page. (PostgreSQL provides
configuration parameters for how much of a page is used, etc, but this
is way beyond the scope of this article).

Having the series split into chunks also paves the way for some kind
of a caching layer, we could have a server which waits for one row
worth of data points to accumulate, then flushes then all at once.

For simplicity, let’s take the above example and expand the RRD to 4
weeks, while keeping 1 week per row. In our table definition we need
provide a way for keeping the order of every row of the TS with a
column named n, and while we’re at it, we might as well introduce a
notion of an id, so as to be able to store multiple TS in the same
table.

Let’s start with two tables, one called rrd where we would store the
last position and date, and another called ts which would store the
actual data.

The last_pos of 25 is n * 7 + 1 (since arrays are 1-based).

This article omits a lot of detail such as having resolution finer
than one day, but it does describe the general idea. For an actual
implementation of this you might want to check out a project I’ve been
working on: timeriver

Back in my ISP days, we used data stored in RRDs to bill our
customers. I wouldn’t try this with Graphite. In this write up I try
to explain why it is so by comparing the method of recording time series
used by
Graphite,
with the one used by RRDTool.

Graphite uses
Whisper to store data, which in
the FAQ is portrayed as a better alternative to RRDTool, but
this is potentially misleading, because the flexibility afforded by the
design of Whisper comes at the price of inaccuracy.

A time series is most often described as a sequence of (time, value)
tuples [1]. The most naive method of recording a time series is to
store timestamps as is. Since the data points might arrive at
arbitrary and inexact intervals, to correlate the series with a
particular point in time might be tricky. If data points are arriving
somewhere in between one minute bounaries (as they always naturally
would), to answer the question of what happened during a particular
minute would require specifying a range, which is not as clean as
being able to specify a precise value. To join two series on a range
is even more problematic.

One way to improve upon this is to divide time into equal intervals
and assign data points to the intervals. We could then use the
beginning of the interval instead of the actual data point timestamp,
thereby giving us more uniformity. For example, if our interval size
is 10 seconds (I may sometimes refer to it as the step), we could
divide the entire timeline starting from the
beginning of the epoch
and until the end of
universe into 10 second slots. Since the first slot begins at 0, any
10-second-step time series will have slots starting at the exact same
times. Now correlation across series or other time values becomes much
easier.

Calculating the slot is trivially easy: time - time % step (% being
the modulo operator).
There is, however, a subtle complexity lurking when it comes to
storing the datapoint with the adjusted (or aligned) timestamp.
Graphite simply changes the timestamp of the data point to the
aligned one. If multiple data points arrive in the same
step, then the last one “wins”.

On the surface there is little wrong with Graphite’s approach. In fact,
under right circumstances, there is absolutely nothing wrong with
it. Consider the following example:

1234567

Graphite, 10 second step.
Actual Time Aligned Time
1430701282 1430701280 50 <-- This data point is lost
1430701288 1430701280 10
1430701293 1430701290 30
1430701301 1430701300 30

Let’s pretend those values are some system metric like the number of
files open. The consequence of the 50 being dropped is that we will
never know it existed, but towards the end of the 10 second interval
it went down to 10, which is still a true fact. If we really wanted to
know about the variations within a 10 second interval, we should have
chosen a smaller step, e.g. 1 second. By deciding that the step is
going to be 10 seconds, we thus declared that variations within a
smaller period are of no interest to us, and from this perspective,
Graphite is correct.

But what if those numbers are the price of a stock: there may be
hundreds of thousand of trades within a 10 second interval, yet we do
not want to (or cannot, for technical reasons) record every single one
of them? In this scenario having the last value override all previous
ones doesn’t exactly seem correct.

Enter RRDTool which uses a different method. RRDTool keeps track of
the last timestamp and calculates a weight for every incoming
data point based on time since last update or beginning of the step and
the step length. Here is what the same sequence of points looks like
in RRDTool. The lines marked with a * are not actual data points,
but are the last value for the preceding step, it’s used for
computing the value for the remainder of the step after a new one has
begun.

123456789101112131415

RRDTool, 10 second step.
Time Value Time since Weight Adjusted Recorded
last value value
1430701270 0 N/A
* 1430701280 50 10s 1 50* 1= 50
50
1430701282 50 2s .2 50*.2= 10
1430701288 10 6s .6 10*.6= 6
* 1430701290 30 2s .2 30*.2= 6
10+6+6= 22
1430701293 30 3s .3 30*.3= 9
* 1430701300 30 7s .7 30*.7= 21
9+21= 30
1430701301 30 # this data point is incomplete

Note, by the way, that the Whisper FAQ says that “RRD will store your
updates in a temporary workspace area and after the minute has passed,
aggregate them and store them in the archive”, which to me sounds like
there is some sort of a temporary storage area holding all the unsaved
updates. In fact, to be able to compute the weighted average, RRD only
needs to store the time of the last update and the current sum, i.e.
exactly just two variables, regardless of the number of updates in a
single step. This is evident from the above figure.

Before you say “so what, I don’t really understand the difference”,
let’s pretend that those numbers were actually the rate of sale of
trinkets from our website (per second). Here is a horizontal ascii-art
rendition of our timeline, 0 is 1430701270.

1234

0 10 20 30 time (seconds)
+.........+.........+.........+.....
| | | | |
0 50 10 30 30 data points

At 12 seconds we recorded selling 50 trinkets per second. Assuming we
started selling at the beginning of our timeline, i.e. 12 seconds
earlier, we can state that during the first step we sold exactly 500
trinkets. Then 2 seconds into the second step we sold another 100
(we’re still selling at 50/s). Then for the next 6 seconds we were
selling at 10/s, thus another 60 trinkets, and for the last 2 seconds
of the slot we sold another 60 at 30/s. In the third step we were
selling steadily at 30/s, thus exactly 300 were sold.

Comparing RRDTool and Graphite side-by-side, the stories are quite different:

12345678

Trinkets per second and sold:
Time Slot Graphite Trinkets RRDTool Trinkets
1. 1430701270 N/A N/A 50 500
2. 1430701280 10 100 22 220 (100+60+60)
3. 1430701290 30 300 30 300
4. 1430701300 30 N/A N/A N/A
----- -----
TOTAL SOLD: 400 1020

Two important observations here:

The totals are vastly different.

The rate recorded by RRDTool for the second slot (22/s), yields
exactly the number of trinkets sold during that period: 220.

Last, but hardly the least, consider what happens when we consolidate
data points into larger intervals by averaging the values. Let’s say
20 seconds, twice our step. If we consolidate the second and the third
steps, we would get:

12

Graphite: average(10,30) = 20 => 400 trinkets in 20 seconds
RRDTool: average(22,30) = 26 => 520 trinkets in 20 seconds

Since the Graphite numbers were off to begin with, we have no reason
to trust the 400 trinkets number. But using the RRDTool data, the new
number happens to still be 100% accurate even after the data points
have been consolidated. This is a very useful property of rates in
time series. It also explains why RRDTool does not permit updating
data prior to the last update: RRD is always accurate.

As an exercise, try seeing it for yourself: pretent the value of 10 in
the second step never arrived, which should make the final value of
the second slot 34. If the 10 arrived some time later, averaging it in
will not give you the correct 22.

Whisper allows past updates, but is quasi-accurate to begin with - I’m
not sure I understand which is better - inaccurate data with a data
point missing, or the whole inaccurate data. RRD could accomplish
the same thing by adding some --inaccurate flag, though it would
seem like more of a bug than a feature to me.

If you’re interested in learning more about this, I recommend reading
the documentation for
rrdtool create, in
particular the “It’s always a Rate” section, as well as
this post
by Alex van den Bogaerdt.

P.S. After this post was written, someone suggested that instead of
storing a rate, we coud store a count delta. In other words, instead
of recording that we’re selling 10 trinkets per second for the past 6
seconds, we would store the total count of trinkets sold, i.e. 60. At
first this seems like the solution to being able to update historical
data accurately: if later we found out that we sold another 75
trinkets in the second time slot, we could just add it to the total
and all would be well and most importantly accurate.

Here is the problem with this approach: note that in the previous
sentence I had to specify that the additional trinkets were sold in
the second time slot, a small, but crucial detail. If time series
data point is a timestamp and a value, then there isn’t even a way to
relay this information in a single data point - we’d need two
timestamps. On the other hand if every data point arrived with two
timestamps, i.e. as a duration, then which to store, rate or count,
becomes a moot point, we can infer one from the other.

So perhaps another way of explaining the historical update problem is
that it is possible, but the datapoint must specify a time
interval. This is something that neither RRDTool or Graphite
currently support, even though it’d be a very useful feature in my
opinion.

[1] Perhaps the biggest misconception about time series is that it is
a series of data points. What time series represent is continuous
rather than descrete, i.e. it’s the line that connects the points
that matters, not the specific points themselves, they are just
samples at semi-random intervals that help define the line. And as we
know, a line cannot be defined by a single point.

Time Series is on its way to becoming a buzzword in the Information
Technology circles. This has to do with the looming Internet of Things
which shall cause the Great Reversal of Internet whereby upstream flow
of data produced by said Things is expected to exceed the downstream
flow. Much of this data is expected to be of the Time Series kind.

This, of course, is a money-making opportunity of the Big Data
proportions all over again, and I predict we’re going to see a lot of
Time Series support of various shapes and forms appearing in all
manners of (mostly commercial) software.

But is there really such a thing as the problem specifically inherent
to Time Series data which warrants a specialized solution? I’ve been
pondering this for some time now, and I am still undecided. This
here is my attempt at arguing that TS is not a special problem and
that it can be done by using a database like PostgreSQL.

Influx of data and write speeds

One frequently cited issue with time series data is that it arrives in
large volumes at a steady pace which renders buffered writes
useless. The number of incoming data streams can also be large
typically causing a disk seek per stream and further complicating the
write situation. TS data also has a property where often more data is
written than read because it’s possible for a datapoint to be
collected and examined only once, if ever. In short, TS is very
write-heavy.

But is this unique? For example logs have almost identical
properties. The real question here is whether our tried and true
databases such as PostgreSQL are ill-equipped to deal with large
volumes of incoming data requiring an alternative solution.

When considering incoming data I am tempted to imagine every US
household sending it, which, of course, would require massive
infrastructure. But this (unrealistic) scenario is not a TS data
problem, it’s one of scale, the same one from which the Hadoops and
Cassandras of this world were born. What is really happening here is
that TS happens to be yet another thing that requires the difficult to
deal with “big data” infrastructure and reiterates the need for an
easy-to-setup horizontally scalable database (which PostgreSQL isn’t).

The backfill problem

This is the problem of having to import vast amounts of historical
data. For example OpenTSDB goes to great lengths to optimize
back-filling by structuring it in specific ways and storing compressed
blobs of data.

But just like the write problem, it’s not unique to TS. It
is another problem that is becoming more and more pertinent as our
backlogs of data going back to when we stopped using paper keep
growing and growing.

Downsampling

Very often TS data is used to generate charts. This is an artifact of
the human brain being spectacularly good at interpreting a visual
representation of a relationship between streams of numbers while
nearly incapable of making sense of data in tabular form. When
plotting, no matter how much data is being examined, the end result is
limited to however many pixels are available on the display. Even
plotting aside, most any use of time series data is in an aggregated
form.

The process of consolidating datapoints into a smaller number (e.g.
the pixel width of the chart), sometimes called downsampling, involves
aggregation around a particular time interval or simply picking every
Nth datapoint.

As an aside, selecting every Nth row of a table is an interesting SQL
challenge, in PostgreSQL it looks like this (for every 100th row):

123

SELECT time, data FROM
(SELECT *, row_number() OVER (ORDER BY time) as n FROM data_points) dp
WHERE dp.n % 100 = 0 ORDER BY time

Aggregation over a time interval similar to how InfluxDB does it with
the GROUP BY time(1d) syntax can be easily achieved via the
date_trunc('day', time).

Another aspect of downsampling is that since TS data is immutable,
there is no need to repeatedly recompute the consolidated version. It
makes more sense to downsample immediately upon the receipt of the
data and to store it permanently in this form. RRDTool’s Round-Robin
database is based entirely on this notion. InfluxDB’s continuous
queries is another way persistent downsampling is addressed.

Again, there is nothing TS-specific here. Storing data in summary form
is quite common in the data analytics world and a “continuous query”
is easily implemented via a trigger.

Derivatives

Sometimes the data from various devices exists in the form of a
counter, which requires the database to derive a rate by comparing
with a previous datapoint. An example of this is number of bytes sent
over a network interface. Only the rate of change of this value is
relevant, not the number itself. The rate of change is the difference
with the previous value divided over the time interval passed.

Referring to a previous row is also a bit tricky but perfectly doable
in SQL. It can accomplished by using windowing functions such as
lag().

123456

SELECT time,
(bytes - lag(bytes, 1) OVER w) / extract(epoch from (time - lag(time, 1) OVER w))::numeric
AS bytes_per_sec
FROM data_points
WINDOW w AS (ORDER BY time)
ORDER BY time

Expiration

It is useful to downsample data to a less granular form as it ages,
aggregating over an ever larger period of time and possibly purging
records eventually. For example we might want to store minutely data
for a week, hourly for 3 months, daily for 3 years and drop all data
beyond 3 years.

Databases do not expire rows “natively” like Cassandra or Redis, but it
shouldn’t be too hard to accomplish via some sort of a periodic cron
job or possibly even just triggers.

Heartbeat and Interval Filling

It is possible for a time series stream to pause, and this can be
interpreted in different ways: we can attempt to fill in missing data,
or treat it as unknown. More likely we’d want to start treating it as
unknown after some period of silence. RRDTool addresses this by
introducing the notion of a heartbeat and the number of missed beats
before data is treated as unknown.

Regardless of whether the value is unknown, it is useful to be able to
fill in a gap (missing rows) in data. In PostgreSQL this can be
accomplished by a join with a result set from the generate_series()
function.

Data Seclusion

With many specialized Time Series tools the TS data ends up being
secluded in a separate system not easily accessible from the rest of
the business data. You cannot join your customer records with data in
RRDTool or Graphite or InfluxDB, etc.

Conclusion

If there is a problem with using PosgreSQL or some other database for
Time Series data, it is mainly that of having to use advanced SQL
syntax and possibly requiring some cookie-cutter method for managing
Time Series, especially when it is a large number or series and high
volume.

There is also complexity in horizontally scaling a relational database
because it involves setting up replication, sharding, methods for
recovery from failure and balancing the data. But these are not
TS-specific problems, they are scaling problems.

Having written this up, I’m inclined to think that perhaps there is
no need for a specialized “Time Series Database”, instead it can be
accomplished by an application which uses a database for storage and
abstracts the users from the complexities of SQL and potentially even
scaling, while still allowing for direct access to the data via the
rich set of tools that a database like PostgreSQL provides.

A nice, reliable, horizontally scalable database that is designed
specifically to tackle the problem of Time Series data (and does not
require you to stand up a Hadoop cluster) is very much missing from the
Open Source Universe right now.

InfluxDB might be able to fill this gap, it certainly aims to.

I was curious about how it structures and stores data and since there
wasn’t much documentation on the subject and I ended up just reading
the code, I figured I’d write this up. I only looked at the new
(currently 0.9.0 in RC stage) version, the previous versions are
significantly different.

First of all, InfluxDB is distributed. You can run one node, or a
bunch, it seems like a more typical number may be 3 or 5. The nodes
use Raft to establish consensus and maintain data consistency.

InfluxDB feels a little like a relational database in some aspects
(e.g. it has a SQL-like query language) but not in others.

The top level container is a database. An InfluxDB database is very
much like what a database is in MySQL, it’s a collection of other
things.

“Other things” are called data points, series, measurements,
tags and retention policies. Under the hood (i.e. you never deal
with them directly) there are shards and shard groups.

The very first thing you need to do in InfluxDB is create a database
and at least one retention policy for this database. Once you have
these two things, you can start writing data.

A retention policy is the time period after which the data expires. It
can be set to be infinite. A data point, which is a measurement
consisting of any number of values and tags associated with a
particular point in time, must be associated with a database and a
retention policy. A retention policy also specifies the replication
factor for the data point.

Let’s say we are tracking disk usage across a whole bunch of
servers. Each server runs some sort of an agent which periodically
reports the usage of each disk to InfluxDB. Such a report might look
like this (in JSON):

In the above example, “disk” is a measurement. Thus we can operate on
anything “disk”, regardless of what “server” or “unit” it applies
to. The data point as a whole belongs to a (time) series identified by
the combination of the measurement name and the tags.

There is no need to create series or measurements, they are created on
the fly.

To list the measurements, we can use SHOW MEASUREMENTS:

1234

> show measurements
name tags name
---- ---- ----
measurements disk

We can use SHOW SERIES to list the series:

1234

> show series
name tags id server unit
---- ---- -- ------- ----
disk 1 bw123 1

If we send a record that contains different tags, we automatically
create a different series (or so it seems), for example if we send
this (note we changed “unit” to “foo”):

> show series
name tags id foo server unit
---- ---- -- --- ------ ----
disk 1 bwi23 1
disk 2 bar bwi23

This is where the distinction between measurement and series becomes a
little confusing to me. In actuality (from looking at the code and the
actual files InfluxDB created) there is only one series here called
“disk”. I understand the intent, but not sure that series is the
right terminology here. I think I’d prefer if measurements were simply
called series, and to get the equivalent of SHOW SERIES you’d use
something like SHOW SERIES TAGS. (May be I’m missing something.)

Under the hood the data is stored in shards, which are grouped by
shard groups, which in turn are grouped by retention policies, and
finally databases.

A database contains one or more retention policies. Somewhat
surprisingly a retention policy is actually a bucket. It makes sense
if you think about the problem of having to expire data points - you
can remove them all by simply dropping the entire bucket.

If we declare a retention policy of 1 day, then we can logically
divide the timeline into a sequence of single days from beginning of
the epoch. Any incoming data point falls into its corresponding
segment, which is a retention policy bucket. When clean up time comes
around, we can delete all days except for the most current day.

To better understand the following paragraphs, consider that having
multiple nodes provides the option for two things: redundancy and
distribution. Redundancy gives you the ability to lose a node
without losing any data. The number of copies of the data is
controlled by the replication factor specified as part of the
retention policy. Distribution spreads the data across nodes which
allows for concurrency: data can be written, read and processed in
parallel. For example if we become constrained by write performance,
we can solve this by simply adding more nodes. InfluxDB favors
redundancy over distribution when having to choose between the two.

Each retention policy bucket is further divided into shard groups, one
shard group per series. The purpose of a shard group is to balance
series data across the nodes of the cluster. If we have a cluster of 3
nodes, we want the data points to be evenly distributed across these
nodes. InfluxDB will create 3 shards, one on each of the nodes. The 3
shards comprise the shard group. This is assuming the replication
factor is 1.

But if the replication factor was 2, then there needs to be 2
identical copies of every shard. The shard copies must be on separate
nodes. With 3 nodes and replication factor of 2, it is impossible to
do any distribution across the nodes - the shard group will have a
size of 1, and contain 1 shard, replicated across 2 nodes. In this set
up, the third node will have no data for this particular retention
policy.

If we had a cluster of 5 nodes and the replication factor of 2, then
the shard group can have a size of 2, for 2 shards, replicated across
2 nodes each. Shard one replicas could live on nodes 1 and 3, while
shard two replicas on nodes 2 and 4. Now the data is distributed as
well as redundant. Note that the 5th node doesn’t do anything. If we
up the replication factor to 3 then just like before, the cluster is
too small to have any distribution, we only have enough nodes for
redundancy.

As of RC15 distributed queries are not yet implemented, so you will
always get an error if you have more than one shard in a group.

The shards themselves are instances of Bolt db - a simple to use key/value store
written in Go. There is also a separate Bolt db file called meta which
stores the metadata, i.e. information about databases, retention
policies, measurements, series, etc.

I couldn’t quite figure out the process for typical cluster operations
such as recovery from node failure or what happens (or should happen)
when nodes are added to existing cluster, whether there is a way to
decommission a node or re-balance the cluster similar to the Hadoop
balancer, etc. I think as of this writing this has not been fully
implemented yet, and there is no documentation, but hopefully it’s
coming soon.

Recently I found myself needing to connect to HiveServer2 with
Kerberos authentication enabled from a Ruby app. As it turned out
rbhive gem we were using did not have
support for Kerberos authentication. So I had to
roll my own.

This post is to document the experience of figuring out the details of
a SASL/GSSAPI connection before it is lost forever in my neurons and synapses.

First, the terminology. The authentication system that Hadoop uses is
Kerberos. Note that Kerberos is not a
network protocol. It describes the method by which
authentication happens, but not the format of how to send Kerberos
tickets and what not over the wire. For that, you need SASL and
GSSAPI.

SASL is a generic protocol
designed to be able to wrap just about any authentication
handshake. It’s very simple: the client sends a START followed by some
payload, and expects an OK, BAD or COMPLETE from the server. OK means
that there are more steps to this conversation, BAD is
self-explanatory and COMPLETE means “I’m satisfied”. The objective is
to go from START via a series of OK’s to each side sending the other a
COMPLETE.

SASL doesn’t define the payload of each message. The payload is
specified by GSSAPI
protocol. GSSAPI is another generic protocol. Unlike SASL it is
actually very complex and covers a variety of authentication methods,
including Kerberos.

The combination of SASL and GSSAPI and what happens at the network
layer is documented in
RFC4752.

Bottom line is you need to read at least four RFC’s to be able to
understand every detail of this process:
RFC4120,
RFC2222,
RFC2743 and
RFC4752. Fun!

The Handshake in Ruby

First, you’ll need some form of binding to the GSSAPI libraries. I’ve
been using the most excellent GSSAPI gem
by Dan Wanek which wraps the MIT GSSAPI library.

If you follow the code in
sasl_client_transport.rb,
you’ll see the following steps are required to establish a connection.

First, we instantiate a GSSAPI object passing it the remote host and
the remote principal. Note that there is no TCP port number to be
specifies anywhere, because this isn’t to establish a TCP connection,
but only for Kerberos host authentication. (Kerberos requires that
not only the client authenticates itself to the host, but also that
the host authenticates itself to the client.)

The rest of the action takes place in the
initiate_hand_shake_gssapi() method.

First, we call @gsscli.init_context() with no arguments. This call
creates a token based on our current Kerberos credentials. (If there
are no credentials in our cache, this call will fail).

1

token=@gsscli.init_context

Next we compose a SASL message which consists of START (0x01)
followed by payload length, followed by the actual payload, which is
the SASL mechanism name: ‘GSSAPI’. Without waiting for response, we
also send an OK (0x02) and the token returned from init_context().

Next we read 5 bytes of response. The first byte is the status
returned from the server, which hopefully is OK, followed by the
length of the payload, and then we read the payload itself:

12345678

status,len=@transport.read(STATUS_BYTES+PAYLOAD_LENGTH_BYTES).unpack('cl>')casestatuswhenNEGOTIATION_STATUS[:BAD],NEGOTIATION_STATUS[:ERROR]raise@transport.to_io.read(len)whenNEGOTIATION_STATUS[:COMPLETE]raise"Not expecting COMPLETE at initial stage"whenNEGOTIATION_STATUS[:OK]challenge=@transport.to_io.readlen

The payload is a challenge created for us by the server. We can
verify this challenge by calling init_context() a second time, this
time passing in the challenge to verify it:

1234

challenge=@transport.to_io.readlenunless@gsscli.init_context(challenge)raise"GSSAPI: challenge provided by server could not be verified"end

If the challenge verifies, then it is our turn to send an OK (with an
empty payload this time):

At this point in the SASL ‘conversation’ we have verified that the
server is who they claim to be.

Next the server sends us another challenge, this one is so that we can
authenticate ourselves to the server and at the same time agree on the
protection level for the communication channel.

We need to decrypt (“unwrap” in the GSSAPI terminology) the challenge,
examine the protection level and if it is acceptable, encrypt it on
our side and send it back to the server in a SASL COMPLETE message. In
this particular case we’re agreeable to any level of protection (which
is none in case of HiveServer2, i.e. the conversation is not
encrypted). Otherwise there are additional steps that RFC4752
describes whereby the client can select an acceptable protection
level.

123456789101112

status2,len=@transport.read(STATUS_BYTES+PAYLOAD_LENGTH_BYTES).unpack('cl>')casestatus2whenNEGOTIATION_STATUS[:BAD],NEGOTIATION_STATUS[:ERROR]raise@transport.to_io.read(len)whenNEGOTIATION_STATUS[:COMPLETE]raise"Not expecting COMPLETE at second stage"whenNEGOTIATION_STATUS[:OK]challenge=@transport.to_io.readlenunwrapped=@gsscli.unwrap_message(challenge)rewrapped=@gsscli.wrap_message(unwrapped)header=[NEGOTIATION_STATUS[:COMPLETE],rewrapped.length].pack('cl>')@transport.writeheader+rewrapped

The server should then respond with COMPLETE as well, at which point
we’re done with the authentication process and cat start sending
whatever we want over this connection:

12345678910

status3,len=@transport.read(STATUS_BYTES+PAYLOAD_LENGTH_BYTES).unpack('cl>')casestatus3whenNEGOTIATION_STATUS[:BAD],NEGOTIATION_STATUS[:ERROR]raise@transport.to_io.read(len)whenNEGOTIATION_STATUS[:COMPLETE]@transport.to_io.readlen@sasl_complete=truewhenNEGOTIATION_STATUS[:OK]raise"Failed to complete GSS challenge exchange"end

Update (Apr 2015): Florian von Bock has
turned what is described in this article into a nice Go package called
endless.

If you have a Golang HTTP service, chances are, you will need to restart it
on occasion to upgrade the binary or change some configuration. And if
you (like me) have been taking graceful restart for granted because
the webserver took care of it, you may find this recipe very handy
because with Golang you need to roll your own.

There are actually two problems that need to be solved here. First is
the UNIX side of the graceful restart, i.e. the mechanism by which a
process can restart itself without closing the listening socket. The
second problem is ensuring that all in-progress requests are properly
completed or timed-out.

Restarting without closing the socket

Fork a new process which inherits the listening socket.

The child performs initialization and starts accepting connections on
the socket.

Immediately after, child sends a signal to the parent causing the
parent to stop accepting connecitons and terminate.

Forking a new process

There is more than one way to fork a process using the Golang lib, but
for this particular case
exec.Command is the way to
go. This is because the Cmd struct this function returns has
this ExtraFiles member, which specifies open files (in addition to
stdin/err/out) to be inherited by new process.

Here is what this looks like:

1234567891011121314

file:=netListener.File()// this returns a Dup()path:="/path/to/executable"args:=[]string{"-graceful"}cmd:=exec.Command(path,args...)cmd.Stdout=os.Stdoutcmd.Stderr=os.Stderrcmd.ExtraFiles=[]*os.File{file}err:=cmd.Start()iferr!=nil{log.Fatalf("gracefulRestart: Failed to launch, error: %v",err)}

In the above code netListener is a pointer to
net.Listener listening for HTTP
requests. The path variable should contain the path to the new
executable if you’re upgrading (which may be the same as the currently
running one).

An important point in the above code is that netListener.File()
returns a
dup(2)
of the file descriptor. The duplicated file descriptor will not have
the FD_CLOEXEC flag set,
which would cause the file to be closed in the child (not what we want).

You may come across examples that pass the inherited file descriptor
number to the child via a command line argument, but the way
ExtraFiles is implemented makes it unnecessary. The documentation
states that “If non-nil, entry i becomes file descriptor 3+i.” This
means that in the above code snippet, the inherited file descriptor in
the child will always be 3, thus no need to explicitely pass it.

Finally, args array contains a -graceful option: your program will
need some way of informing the child that this is a part of a graceful
restart and the child should re-use the socket rather than try opening
a new one. Another way to do this might be via an environment
variable.

Child initialization

Here is part of the program startup sequence

12345678910111213141516

server:=&http.Server{Addr:"0.0.0.0:8888"}vargracefulChildboolvarlnet.Listevervarerrerrorflag.BoolVar(&gracefulChild,"graceful",false,"listen on fd open 3 (internal use only)")ifgracefulChild{log.Print("main: Listening to existing file descriptor 3.")f:=os.NewFile(3,"")l,err=net.FileListener(f)}else{log.Print("main: Listening on a new file descriptor.")l,err=net.Listen("tcp",server.Addr)}

Signal parent to stop

At this point we’re ready to accept requests, but just before we do
that, we need to tell our parent to stop accepting requests and exit,
which could be something like this:

For this we will need to keep track of open connections with a
sync.WaitGroup. We will need
to increment the wait group on every accepted connection and decrement
it on every connection close.

1

varhttpWgsync.WaitGroup

At first glance, the Golang standard http package does not provide any
hooks to take action on Accept() or Close(), but this is where the
interface magic comes to the rescue. (Big thanks and credit to Jeff R. Allen
for this post).

Here is an example of a listener which increments a wait group on
every Accept(). First, we “subclass” net.Listener (you’ll see why we
need stop and stopped below):

The reason the function above starts a goroutine is because this
cannot be done in our Accept() above since it will block on
gl.Listener.Accept(). The goroutine will unblock it by closing file
descriptor.

Our Close() method simply sends a nil to the stop channel for the
above goroutine to do the rest of the work.

And there is one more thing. You should avoid hanging connections that
the client has no intention of closing (or not this week). It is
better to create your server as follows: