Edit: there is now a part iii in this series of articles.
I have previously written how time series can be stored in PostgreSQL efficiently using arrays.
As a continuation of that article, I shall attempt to describe in detail the inner workings of an SQL view that Tgres uses to make an array of numbers appear as a regular table (link to code).
In short, I will explain how incomprehensible data like this:
1 2 3 4 5 6 7 |
|
… can be transformed in an SQL view to appear as so:
1 2 3 4 5 6 7 8 |
|
This write up will make a lot more sense if you read the previous post first. To recap, Tgres stores series in an array broken up over multiple table rows each containing an array representing a segment of the series. The series array is a round-robin structure, which means that it occupies a fixed amount of space and we do not need to worry about expiring data points: the round-robin nature of the array takes care of it by overwriting old data with new on assignment.
An additional benefit of such a fixed interval round-robin structure is that we do not need to store timestamps for every data point. If we know the timestamp of the latest entry along with the series step and size, we can extrapolate the timestamp of any point in the series.
Tgres creates an SQL view which takes care of this extrapolation and makes this data easy to query. Tgres actually uses this view as its only source of time series information when reading from the database thus delegating all the processing to the database server, where it is close to the data and most efficient.
If you would like to follow along on the Postgres command line, feel free to create and populate the tables with the following SQL, which is nearly identical to the schema used by Tgres:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
And finally create the view:
1 2 3 4 5 6 7 8 |
|
Now give it a whirl with a SELECT * FROM tv ORDER BY t
. Impressive? So how does it work?
First let’s go over the columns of the rra table.
step_s
: the minimal unit of time expressed in seconds (60 or 1 minute in the above data).steps_per_row
: the number of thestep_s
intervals in one slot of our time series. In our example it is 1440, which is the number of minutes in a day, thus making our time series resolution one day.size
: number of slots in the series. Ours is 28, i.e. four weeks.width
: size of a segment which will be stored in a single row, which in our case is 7 (one week).latest
: the timestamp of the last data point in the series.
Next, let’s look at the UNNEST
keyword in the SQL of the view. UNNEST
takes an array and turns it into row, e.g.:
1 2 3 4 5 6 |
|
UNNEST
works in conjunction with the generate_subscripts
PostgreSQL function which generates index values:
1 2 3 4 5 6 |
|
Let us now zoom in on the very long expression in the view, here it is again:
1 2 3 |
|
A perhaps not immediately apparent trick to how all this works is that all our series are aligned on the beginning of the epoch. This means that at UNIX time 0, any series’ slot index is 0. From then on it increments sequentially until the series size is reached, at which point it wraps-around to 0 (thus “round-robin”). Armed with this information we can calculate the index for any point in time.
The formula for calculating the index i
for a given time t
is:
1
|
|
We need time to be expressed as a UNIX time which is done
with EXTRACT(EPOCH FROM rra.latest)::BIGINT
. Now you should recognize
the above formula in the more verbose expression
1
|
|
where rra.step_s * rra.steps_per_row
is the size of our series in seconds.
Next, we need to compute the distance between the current slot and the
last slot (for which we know the timestamp). I.e. if the last slot is i
and the slot we need the
timestamp for is j
, the distance between them is i-j
, but with a
caveat: it is possible for j
to be greater than i
if the series
wraps around, in which case the distance is the sum of the distance from
j
to the end of the series and the distance from the beginning to
i
. If you ponder over it with a pencil and paper long enough, you
will arrive at the following formula for distance between two slots
i
and j
in a wrap-around array:
1
|
|
Another thing to consider is that we’re splitting our series across
multiple rows, thus the actual index of any point is the subscript
into the current segment plus the index of the segment itself (the n
column) multiplied by the wdith
of the segment: generate_subscripts(dp,1) + n * width
.
Which pieced together in SQL now looks like this:
1 2 |
|
Astute readers should notice an unexplained + 1
. This is because
PostgreSQL arrays are 1-based.
Now we need to convert the distance expressed in array slots into
a time interval, which we do by multiplying it by
INTERVAL '1 SECOND' * rra.step_s * rra.steps_per_row
.
And finally, we need to subtract the above time interval from the latest stamp which yields (ta-da!) the timestamp of the current slot:
1 2 3 |
|
That’s it! And even though this may look complicated, from the computational view point it is very efficient, and PostgreSQL can handle it easily.
As an exercise, try setting latest
to various timestamps and observe
how it affects the output of the view and see if you can explain how
and why it happens.