Metrics Maven: Calculating a Moving Average in PostgreSQL

Published

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll look at how to calculate a moving average in PostgreSQL.

This article builds on our previous two articles about Window Functions and Window Frames in PostgreSQL. We'll take advantage of the windowing features we've previously discussed to calculate a moving average and we'll also look at an alternative method.

What is a moving average?

A moving average is just what it sounds like - an average that is continually moving based on changing input. For example, you may want to take the average of some value for the top 100 entries or for the previous 30 days. Because you will be getting new entries into your database or because each new day is another date, the average will change. The term "moving average" is also synonymous with "rolling average" or "running average", but there are a few different kinds of moving averages. In this article we're going to focus on the "simple" moving average to get our feet wet and we'll also briefly review the "cumulative" moving average at the end of the article. Future articles will cover weighted moving averages and exponential moving averages.

The reason to use a moving average for your metrics is to make it easier to spot trends. It's a commonly-used technique in finance and business analytics to smooth out dips and spikes which may occur in the data so that true trends can be identified over the changing series. Figuring out how to perform the calculation as the data changes can be a bit daunting, however, if you've never done it. Once you learn a method you like, though, (we'll cover two) it's easy to do and you'll find many uses for it in your tracking and reports. Let's get to it.

The data

First things first: we'll need a table that contains the values we want to average.

In practice at Compose, we often find that the base data we need is not already neatly defined in one table. For that reason we have a few aggregate tables that pull the data we need together. These are the base tables to which we will apply more advanced calculations, like a moving average. In some cases these are derived tables that exist temporarily for the execution of the main query. In other cases, we might use a view or a materialized view. So, however you get at it, you'll need a table containing the values you want to average and whatever dimension(s) you want to order the data by.

For our example, let's say we've been asked to create a 30 day rolling average for app downloads from Example Co. The app download data is populated daily to a table named "app_downloads_by_date" and the most recent portion of it looks like this:

date          | downloads  
---------------------------
2016-05-26    | 35  
2016-05-27    | 30  
2016-05-28    | 7  
2016-05-29    | 14  
2016-05-30    | 22  
2016-05-31    | 20  
2016-06-01    | 24  
2016-06-02    | 27  
2016-06-03    | 24  
2016-06-04    | 35  
2016-06-05    | 20  
2016-06-06    | 21  
2016-06-07    | 29  
2016-06-08    | 26  
2016-06-09    | 23  
2016-06-10    | 20  
2016-06-11    | 13  
2016-06-12    | 8  
2016-06-13    | 16  
2016-06-14    | 25  
2016-06-15    | 21  
2016-06-16    | 34  
2016-06-17    | 15  
2016-06-18    | 15  
2016-06-19    | 15  
2016-06-20    | 26  
2016-06-21    | 26  
2016-06-22    | 17  
2016-06-23    | 25  
2016-06-24    | 16  
2016-06-25    | 13  
2016-06-26    | 14  
2016-06-27    | 17  
2016-06-28    | 21  
2016-06-29    | 26  
2016-06-30    | 32  
2016-07-01    | 26  
2016-07-02    | 16  

In this example, ordering by date will be important since we want to calculate a 30 day rolling average over the preceding series of dates. Because of this, it's important that we have a row for each date. In our case we do, but if you have gaps in your data where there are no values for certain dates, you can use generate_series when constructing your base table to ensure you've got all the rows you'll need.

Notice how this range of dates contains fluctuating app downloads totals from 35 to 7. It's very difficult to see a trend from this data:

alt=

Enter the moving average.

Using window frames for a simple moving average

If you remember from our previous article in this series, window frames are used to indicate the number of rows around the current row the window function should include. They create a subset of data for the window function to operate on. Depending on your data and your needs, your moving average calculation may include rows both preceding and following the current row, but for our purposes, our moving average will use preceding rows and the current row because we want to generate a new moving average value for each new date.

Our query looks like this:

SELECT ad.date,  
       AVG(ad.downloads)
            OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
FROM app_downloads_by_date ad  
;

We're using ORDER BY on our date field to ensure our data will be in the order we expect and we've specified ROWS BETWEEN 29 PRECEDING AND CURRENT ROW to set the window frame for the AVG calculation. As the window frame advances for each date, only the preceding 29 rows and the current one (30 total days) are used for the calculation:

date          | avg_downloads  
-----------------------------------
. . . .
2016-05-26    | 21.2333333333333333  
2016-05-27    | 21.4000000000000000  
2016-05-28    | 21.1000000000000000  
2016-05-29    | 21.2000000000000000  
2016-05-30    | 21.7333333333333333  
2016-05-31    | 22.0000000000000000  
2016-06-01    | 22.3000000000000000  
2016-06-02    | 22.3666666666666667  
2016-06-03    | 22.3000000000000000  
2016-06-04    | 22.8000000000000000  
2016-06-05    | 22.7000000000000000  
2016-06-06    | 22.9666666666666667  
2016-06-07    | 23.4000000000000000  
2016-06-08    | 23.2666666666666667  
2016-06-09    | 22.7333333333333333  
2016-06-10    | 22.4000000000000000  
2016-06-11    | 21.9000000000000000  
2016-06-12    | 21.3000000000000000  
2016-06-13    | 21.2666666666666667  
2016-06-14    | 21.6666666666666667  
2016-06-15    | 21.4666666666666667  
2016-06-16    | 21.8333333333333333  
2016-06-17    | 21.7000000000000000  
2016-06-18    | 20.9666666666666667  
2016-06-19    | 20.6333333333333333  
2016-06-20    | 21.2666666666666667  
2016-06-21    | 21.9000000000000000  
2016-06-22    | 21.6000000000000000  
2016-06-23    | 21.6666666666666667  
2016-06-24    | 21.6333333333333333  
2016-06-25    | 20.9000000000000000  
2016-06-26    | 20.3666666666666667  
2016-06-27    | 20.7000000000000000  
2016-06-28    | 20.9333333333333333  
2016-06-29    | 21.0666666666666667  
2016-06-30    | 21.4666666666666667  
2016-07-01    | 21.5333333333333333  
2016-07-02    | 21.1666666666666667  

Since we aren't showing you dates in our base table before May 26 for this example, let's focus our review of the results on dates where we showed you the 29 preceding rows. Let's take June 30, for example. Our window frame focuses our AVG aggregation on the app downloads from June only, this portion of our base table:

date          | downloads  
---------------------------
. . . .
2016-06-01    | 24  
2016-06-02    | 27  
2016-06-03    | 24  
2016-06-04    | 35  
2016-06-05    | 20  
2016-06-06    | 21  
2016-06-07    | 29  
2016-06-08    | 26  
2016-06-09    | 23  
2016-06-10    | 20  
2016-06-11    | 13  
2016-06-12    | 8  
2016-06-13    | 16  
2016-06-14    | 25  
2016-06-15    | 21  
2016-06-16    | 34  
2016-06-17    | 15  
2016-06-18    | 15  
2016-06-19    | 15  
2016-06-20    | 26  
2016-06-21    | 26  
2016-06-22    | 17  
2016-06-23    | 25  
2016-06-24    | 16  
2016-06-25    | 13  
2016-06-26    | 14  
2016-06-27    | 17  
2016-06-28    | 21  
2016-06-29    | 26  
2016-06-30    | 32  
. . . .

So, now, if we chart the rolling average we've calculated, we can see that the data is smoothed out and there is an upward trend through the first week of June, then a more volatile downward trend after that:

alt=

Since this is showing only one month of data it's not very telling for an analytical report, but hopefully it helps you understand how calculating a moving average can be useful for business analysis.

A tip on not including the current row

If for some reason you don't want to include the current row for your window function and you're using only PRECEDING or only FOLLOWING settings for your window frame, an easy way to do that is to use x PRECEDING or y FOLLOWING twice in your ROWS BETWEEN... clause. For example, say we wanted to use 30 rows preceding our current row, but not include the current row in the window frame. We could write that clause like this: ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING. Similarly, we can exclude the current row, but do 30 rows following like this: ROWS BETWEEN 1 FOLLOWING AND 30 FOLLOWING.

An alternative method for a simple moving average

Before PostgreSQL 9.0, we didn't have the x PRECEDING or y FOLLOWING window frame options available to us. To calculate a moving average without using a window frame, we can instead use two table aliases from our base table. We'll use one alias to operate over the other one using a date interval. Check it out:

SELECT ad_1.date, AVG(ad_2.downloads)  
FROM app_downloads_by_date ad_1  
JOIN app_downloads_by_date ad_2 ON ad_2.date >= ad_1.date - interval '29 days'  
  AND ad_2.date <= ad_1.date
GROUP BY ad_1.date  
ORDER BY ad_1.date  
;

Using this method we can achieve the same results as described above with the window frame. If you're operating over large amounts of data, the window frame option is going to be more efficient, but this alternative exists if you want to use it.

Calculating a cumulative moving average

Now that we've reviewed a couple methods for how to calculate a simple moving average, we'll switch up our window frame example to show how you can also do a cumulative moving average. The same principles apply, but rather than having a continually shifting window frame for an interval, the window frame simply extends. For example, instead of doing a 30 day rolling average, we're going to calculate a year-to-date moving average. For each new date, it's value is simply included in the average calculation from all the previous dates. Let's have a look at this example:

SELECT ad.date,  
       AVG(ad.downloads)
            OVER(ORDER BY ad.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_downloads_ytd
FROM app_downloads_by_date ad  
;

Because our base table starts at January 1st for the current year, we're using UNBOUNDED PRECEDING to set our window frame. The results we get back for this cumulative calculation look like this:

date          | avg_downloads_ytd  
-----------------------------------
. . . .
2016-05-26    | 20.2585034013605442  
2016-05-27    | 20.3243243243243243  
2016-05-28    | 20.2348993288590604  
2016-05-29    | 20.1933333333333333  
2016-05-30    | 20.2052980132450331  
2016-05-31    | 20.2039473684210526  
2016-06-01    | 20.2287581699346405  
2016-06-02    | 20.2727272727272727  
2016-06-03    | 20.2967741935483871  
2016-06-04    | 20.3910256410256410  
2016-06-05    | 20.3885350318471338  
2016-06-06    | 20.3924050632911392  
2016-06-07    | 20.4465408805031447  
2016-06-08    | 20.4812500000000000  
2016-06-09    | 20.4968944099378882  
2016-06-10    | 20.4938271604938272  
2016-06-11    | 20.4478527607361963  
2016-06-12    | 20.3719512195121951  
2016-06-13    | 20.3454545454545455  
2016-06-14    | 20.3734939759036145  
2016-06-15    | 20.3772455089820359  
2016-06-16    | 20.4583333333333333  
2016-06-17    | 20.4260355029585799  
2016-06-18    | 20.3941176470588235  
2016-06-19    | 20.3625730994152047  
2016-06-20    | 20.3953488372093023  
2016-06-21    | 20.4277456647398844  
2016-06-22    | 20.4080459770114943  
2016-06-23    | 20.4342857142857143  
2016-06-24    | 20.4090909090909091  
2016-06-25    | 20.3672316384180791  
2016-06-26    | 20.3314606741573034  
2016-06-27    | 20.3128491620111732  
2016-06-28    | 20.3166666666666667  
2016-06-29    | 20.3480662983425414  
2016-06-30    | 20.4120879120879121  
2016-07-01    | 20.4426229508196721  
2016-07-02    | 20.4184782608695652  

If we chart these results, you can see that the advantage of the cumulative moving average is a further smoothing out of the data so that only significant data changes show up as trends. We see now that there is a slight upward trend year-to-date:

alt=

Wrapping up

Now that you know a couple different kinds of moving averages you can use and a couple different methods for calculating them, you can perform more insightful analysis and create more effective reports.

In our next Metrics Maven article, we'll look at some options for how to make data pretty so that instead of values like "20.4184782608695652", we'll see "20.42". See you next time!

Image by: extrabrandt

Conquer the Data Layer

Spend your time developing apps, not managing databases.