Metrics Maven: Making Data Pretty in PostgreSQL

Published

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll look at how you can make your reports more understandable (aka "pretty") using built-in functions and operators in PostgreSQL.

The first step in producing any report is to know the audience. Will the report be used in the VP's quarterly slide deck to the C-level executives? Is it intended for use by the product marketing team to understand the impact of a recent ad campaign? Or will it be used as input to analyses by fellow data scientists? Depending on who will be using your reports (and for what purposes), you may need to make the output more high-level and human-friendly (or focus more on precision and detail). For our use case, we're going to continue looking at app downloads, but we want to add some additional human-friendly detail to the report and make the data easy-to-understand.

In our previous article about how to calculate a moving average, we successfully calculated both a simple and a cumulative moving average of app downloads by day. But the results left a little something to be desired since the average values were calculated to 16 decimal places. Here's a sample of what our output looked like for the simple moving average over 30 days:

date          | avg_downloads  
-----------------------------------
. . . .
2016-06-07    | 23.4000000000000000  
2016-06-08    | 23.2666666666666667  
2016-06-09    | 22.7333333333333333  
2016-06-10    | 22.4000000000000000  
2016-06-11    | 21.9000000000000000  
2016-06-12    | 21.3000000000000000  
. . . .

That kind of precision is undoubtedly important in some scenarios, but for us it's overkill (read: "ugly"). In this article, we're going to use some built-in Postgres functions and operators to make it more human-friendly by adding some detail in some areas and simplifying in others. We're calling this "making the data pretty". We'll use this data from the simple 30 day rolling average calculation to walk through some options for our use case here.

Mathematical functions

Using PostgreSQL mathematical functions, we have a few different options to tackle the "too many decimal places" problem. We can round our results, truncate our results, or find the ceiling or floor. Any of these functions will make our data easier to read. Each of these functions has its own nuance, though, so let's compare them.

Before we dive in, an important thing to note is that the mathematical functions described below cannot be used by the OVER clause for window frames since it expects window functions (like RANK() or ROW_NUMBER()) or aggregations (like SUM() or AVG()). That puts a minor hitch in our plan, but we can easily handle it by using WITH to create a common table expression (CTE) or by using our existing query as a derived table wrapped with a new query. Since CTEs help make queries easier to read, we're going to use that option. Our CTE for the examples in this article, then, looks like this and we're calling it "avg_app_downloads_30_days_rolling_by_day":

WITH avg_app_downloads_30_days_rolling_by_day AS (  
    SELECT ad.date,
           AVG(ad.downloads)
               OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
    FROM app_downloads_by_date ad
)

SELECT ....  

To this base, we'll add our new SELECT query for creating our final report.

Rounding numbers

The first function we're going to look at to simplify our moving average values is round. The round function is going round the value up or down, but we can choose whether we want to round to a certain number of decimal points or just to the nearest integer value.

To round to a certain number of decimal points, we provide the number of decimal points we want to see as the second input to the function; the first input to the function is the value we want to round (in our case the value from the "avg_downloads" field). Let's say we want to round to just 2 decimal places. Our query will look like this:

WITH avg_app_downloads_30_days_rolling_by_day AS (  
    SELECT ad.date,
           AVG(ad.downloads)
               OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
    FROM app_downloads_by_date ad
)

SELECT date,  
       round(avg_downloads, 2) as avg_downloads_rounded_decimal
FROM avg_app_downloads_30_days_rolling_by_day  
;

And our output will look like this:

date          | avg_downloads_rounded_decimal  
---------------------------------------------
. . . .
2016-06-07    | 23.40  
2016-06-08    | 23.27  
2016-06-09    | 22.73  
2016-06-10    | 22.40  
2016-06-11    | 21.90  
2016-06-12    | 21.30  
. . . .

We don't want to deal with partial downloads, though. It makes more sense for our use case to round to the nearest whole number. For that reason we'll apply round to just the value from the "avg_downloads" field like we show below without specifying a number of decimal points. Also, from here on out in the examples in this article, since we know the CTE isn't changing, we'll focus on just the new SELECT query we're performing on the CTE:

 -- CTE goes here

SELECT date,  
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

Now our output will look like this:

date          | avg_downloads_rounded  
-------------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 23  
2016-06-10    | 22  
2016-06-11    | 22  
2016-06-12    | 21  
. . . .

Notice how June 9th and 11th got rounded up to the next whole number while the other dates were rounded down.

Truncating numbers

Another option is to truncate the values. trunc can be used to truncate the value to a certain number of decimal places or to the integer value. The difference with trunc as compared to round is that there is no rounding; any characters past the specified position are simply chopped off. trunc follows the same syntax as round so here's what a query looks like for truncating to 2 decimal places:

 -- CTE goes here

SELECT date,  
       trunc(avg_downloads, 2) as avg_downloads_truncated_decimal
FROM avg_app_downloads_30_days_rolling_by_day  
;

And the output looks like this:

date          | avg_downloads_truncated_decimal  
-----------------------------------------------
. . . .
2016-06-07    | 23.40  
2016-06-08    | 23.26  
2016-06-09    | 22.73  
2016-06-10    | 22.40  
2016-06-11    | 21.90  
2016-06-12    | 21.30  
. . . .

For this data, the results are almost the same as when we were using round, except that June 8th didn't get rounded up. If we truncate to just the integer without specifying any decimal points, our results will look like this:

date          | avg_downloads_truncated  
---------------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 22  
2016-06-10    | 22  
2016-06-11    | 21  
2016-06-12    | 21  
. . . .

June 9th and June 11th are now one whole number difference from the results we got when we used round. While this may seem insignificant for this data set, it can be very impactful for other data sets and could skew the results one way or another if this report were used for further calculations down the line. Understanding the nuances of how different functions affect the data is key to selecting the right ones for your purposes.

Number ceilings and floors

Before we leave mathematical functions, let's have a look at ceiling and floor. The ceiling function (you can also used the shortened form ceil) will round up to the nearest integer value. Conversely, floor will round down to the nearest integer value. These functions don't support decimal places; they deal in whole numbers only. Let's compare them to the functions we already reviewed and to each other.

We'll start with ceiling:

 -- CTE goes here

SELECT date,  
       ceiling(avg_downloads) as avg_downloads_ceiling
FROM avg_app_downloads_30_days_rolling_by_day  
;

The value ceilings are returned as:

date          | avg_downloads_ceiling  
-------------------------------------
. . . .
2016-06-07    | 24  
2016-06-08    | 24  
2016-06-09    | 23  
2016-06-10    | 23  
2016-06-11    | 22  
2016-06-12    | 22  
. . . .

All of the values are rounded up to the next whole number. That's a pretty big difference from what we've already seen with the other functions.

And here's floor:

 -- CTE goes here

SELECT date,  
       floor(avg_downloads) as avg_downloads_floor
FROM avg_app_downloads_30_days_rolling_by_day  
;

The value floors are returned as:

date          | avg_downloads_floor  
-----------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 22  
2016-06-10    | 22  
2016-06-11    | 21  
2016-06-12    | 21  
. . . .

For this data, because floor rounds down to the nearest integer, the results are the same as if we'd used trunc, but each of them is one whole number difference from the results of ceiling. That can be significant so make sure to choose your functions wisely.

Now that we have an idea how different mathematical functions can affect our results, we're going to stick with round to the nearest integer for our use case. That way we don't have to deal with the partial downloads represented by decimal points and we can approximate the closest whole number the moving average is indicating.

Let's turn our attention to formatting the date next.

Formatting functions

Our date field is pretty simple and easy-to-read, but for the final report we want to generate, we'd like to have a more human-friendly textual label for the date. We can use formatting functions to play with the "date" field in our results. Specifically, we're going to use the to_char function to list out some elements of the date that will be useful for reporting and analysis, including day of the week, month name, and quarter of the year. The to_char function takes any numeric data type, as well as timestamps and intervals, and converts them to a string using a pattern that we can specify. Though our "date" field is not, strictly-speaking, a timestamp, it will be automatically converted to a timestamp by the function for processing.

Let's start with a simple example to get the quarter using the to_char function:

 -- CTE goes here

SELECT date,  
       to_char(date, 'Q') as quarter,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

Here, the to_char pattern we're using to convert the date to a string is "Q" for quarter. Pretty straightforward, but we've kept the original date value for reference. The result we get back is the quarter number of the year. A "1" represents months from January through March, "2" for April through June, "3" for July through September, and "4" for October through December. Since we've been using dates in June in our examples, our results look like this:

date       | quarter | avg_downloads_rounded  
--------------------------------------------
. . . .
2016-06-07 | 2       | 23  
2016-06-08 | 2       | 23  
2016-06-09 | 2       | 23  
2016-06-10 | 2       | 22  
2016-06-11 | 2       | 22  
2016-06-12 | 2       | 21  
. . . .

Similarly, we can get other values form our "date" field based on the patterns we specify. Here's a handful of example, each represented as a different column:

 -- CTE goes here

SELECT date,  
       to_char(date, 'YYYY') as year,
       to_char(date, 'Q') as quarter,
       to_char(date, 'MON') as month,
       to_char(date, 'W') as week_of_month,
       to_char(date, 'DD') as day,
       to_char(date, 'Day') as day_of_week,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

Based on the patterns we've specified in our query, we're going to get a 4-digit year, the quarter number of the year (which we looked at above), the 3-character shortened form of the month in all caps, the week number of the month, the day as a 2-digit number, and also the day of the week with the first letter captialized.

The results look like this:

date       | year | quarter | month | week_of_month | day | day_of_week | avg_downloads_rounded  
-----------------------------------------------------------------------------------------------
. . . .
2016-06-07 | 2016 | 2       | JUN   | 1             | 07  | Tuesday     | 23  
2016-06-08 | 2016 | 2       | JUN   | 2             | 08  | Wednesday   | 23  
2016-06-09 | 2016 | 2       | JUN   | 2             | 09  | Thursday    | 23  
2016-06-10 | 2016 | 2       | JUN   | 2             | 10  | Friday      | 22  
2016-06-11 | 2016 | 2       | JUN   | 2             | 11  | Saturday    | 22  
2016-06-12 | 2016 | 2       | JUN   | 2             | 12  | Sunday      | 21  
. . . .

Patterns can also be combined into a single request. Now that we have a sense of what some of the different patterns return, let's combine a few patterns, as an example, to produce a new "day_of_week" column:

 -- CTE goes here

SELECT date,  
       to_char(date, 'Month DD, Day') as day_of_week,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

In the revised query above, we've indicated that the "day_of_week" column should have the full month name with a capitalized first letter followed by the 2-digit day and a comma then followed by the day of the week fully spelled out with the first letter capitalized.

The results look like this:

date       | day_of_week             | avg_downloads_rounded  
-----------------------------------------------------------------------------------------------------------------
. . . .
2016-06-07 | June      07, Tuesday   | 23  
2016-06-08 | June      07, Wednesday | 23  
2016-06-09 | June      07, Thursday  | 23  
2016-06-10 | June      07, Friday    | 22  
2016-06-11 | June      07, Saturday  | 22  
2016-06-12 | June      07, Sunday    | 21  
. . . .

Notice that full name of the month and the full name of the day of the week in the "day_of_week" column are padded to 9 characters. If you use a shortened form of the name (such as if we'd used "MON" to get the capitalized shortened form of the month instead), the padding is not used.

The to_char formatting function is going to be useful for making the dates in our final report "pretty". They're not the only options for dates, though. There are also some handy date functions to be aware of.

Date functions

There are two date functions in particular we'll look at here briefly: date_part and date_trunc. These can be used to extract elements from or simplify timestamps (dates are converted to timestamps automatically for processing, same as with the to_char function).

Extracting date parts

date_part can be used to pull individual elements from a timestamp, time, or interval. Note that the extract function does the same thing as date_part. extract is SQL-standard compliant whereas date_part is based on Ingres, but as the extract function in PostgreSQL uses the date_part syntax under the hood, you can just call date_part directly. If you're concerned about the portability of your code to other SQL systems, then choose extract instead of date_part.

The difference between date_part and the to_char function we looked at above is that date_part returns a numeric value. Because of this, there are no formatting patterns for date_part. If you specify "month", you'll get back the numeric value of the month only. The to_char function can also return the number of the month, but has the option of returning the 3-character short form of the month (with various captialization options) or the complete name of the month spelled out (with various captialization options).

Here's an example of extracting the month using date_part (notice how the syntax is different... the part of the date to return is the first input; the value to process is the second input):

 -- CTE goes here

SELECT date,  
       date_part('month', date) as month,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

Here's our "month" date_part:

date       | month | avg_downloads_rounded  
------------------------------------------
. . . .
2016-06-07 | 6       | 23  
2016-06-08 | 6       | 23  
2016-06-09 | 6       | 23  
2016-06-10 | 6       | 22  
2016-06-11 | 6       | 22  
2016-06-12 | 6       | 21  
. . . .

Truncating dates

date_trunc is similar to the trunc function we looked at for number values in the section above about mathematical functions, but it operates on timestamps and intervals. There are different levels of truncation that can be applied. Since our dates don't go more granular than the daily level, we could truncate at the month level, quarter level, year level... all the way up to the millennium if that was necessary for the use case.

Let's look and see how date_trunc works for the month level (you'll notice it uses the same syntax as date_part):

 -- CTE goes here

SELECT date,  
       date_trunc('month', date) as month,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

In return, we get the timestamp truncated at the month:

date       | month                  | avg_downloads_rounded  
------------------------------------------------------------
. . . .
2016-06-07 | 2016-06-01 00:00:00+00 | 23  
2016-06-08 | 2016-06-01 00:00:00+00 | 23  
2016-06-09 | 2016-06-01 00:00:00+00 | 23  
2016-06-10 | 2016-06-01 00:00:00+00 | 22  
2016-06-11 | 2016-06-01 00:00:00+00 | 22  
2016-06-12 | 2016-06-01 00:00:00+00 | 21  
. . . .

As we mentioned, our date field gets automatically converted to a timestamp for processing. Then, at the month level, the time components of the timestamp are zeroed out (if we had any) and the day is set to "01" as a default. If we had other months represented in our example, they'd look similar... July would look like this, for example: 2016-07-01 00:00:00+00. August would look like this: 2016-08-01 00:00:00+00. As you can see there are some important differences between extracting date parts, truncating dates, and formatting using to_char.

String functions

Since we're going to use the to_char formating function for dates in our report (rather than using date_part or date_trunc) and those values will therefore be converted to strings, let's have a look at string functions.

String functions allow you to manipulate string values in a variety of ways. There are options for case, trimming, substrings, length, and more. One commonly-used function that we're going to look at here is concatenation. Even though we were able to do some interesting formatting using just the to_char function on our dates, concatenation will allow us to make our report more human-friendly. For example, let's use concatenation to create an easier-to-understand column for quarter:

 -- CTE goes here

SELECT date,  
       to_char(date, 'YYYY') || '-Q' || to_char(date, 'Q') as quarter,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

Here we're using the concatenation operator "||" to prepend the quarter number with the 4-digit year, a hyphen, and the letter "Q". We can't do that using just the to_char function because "Q" means quarter number and we'd end up duplicating the number (using to_char(date, 'QQ'), we'd get "22"). By concatenating, though, we can make it clear to the report user that values for the specified dates happened in "Q2" as opposed to the more ambiguous "2".

date       | quarter       | avg_downloads_rounded  
--------------------------------------------------
. . . .
2016-06-07 | 2016-Q2       | 23  
2016-06-08 | 2016-Q2       | 23  
2016-06-09 | 2016-Q2       | 23  
2016-06-10 | 2016-Q2       | 22  
2016-06-11 | 2016-Q2       | 22  
2016-06-12 | 2016-Q2       | 21  
. . . .

Let's go through another example using the concatenation operator to clean up the "day_of_week" column we created as well as rearrange some of the elements. To do that we'll also use the trim function to remove the extra padding around the month and day of week names. Here's how:

 -- CTE goes here

SELECT date,  
       trim(trailing ' ' from to_char(date, 'Day')) ||
          ', ' ||
          trim(trailing ' ' from to_char(date, 'Month')) ||
          ' ' ||
          to_char(date, 'DD') ||
          ', ' ||
          to_char(date, 'YYYY') as display_date,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;

We are using the trim function on the trailing spaces for the full name of the weekday by wrapping our to_char function inside trim. Next we're concatenating a comma and a space for readability, then the month name also wrapped in a trim for the trailing spaces. We're then concatenating another space then the 2-digit day then another comma and whitespace and finally the 4-digit year.

It looks like this when it's returned:

date       | display_date             |avg_downloads_rounded  
------------------------------------------------------------
. . . .
2016-06-07 | Tuesday, June 07, 2016   | 23  
2016-06-08 | Wednesday, June 08, 2016 | 23  
2016-06-09 | Thursday, June 09, 2016  | 23  
2016-06-10 | Friday, June 10, 2016    | 22  
2016-06-11 | Saturday, June 11, 2016  | 22  
2016-06-12 | Sunday, June 12, 2016    | 21  
. . . .

Now that we've reviewed some options to make our data "pretty", let's make a final report.

Final report

In our final report, we're going to use some of the functions we reviewed in this article, including round, to_char, trim, and the concatenation operator "||". Our report will have a column for quarter, one for month, and one for the display date to provide detail for the 30 day rolling average of app downloads. We're also going to make it clear that the average app downloads is a 30 day rolling average in the field name. Here's our query:

 -- CTE goes here

SELECT  
       to_char(date, 'YYYY') || '-Q' || to_char(date, 'Q') as quarter,
       to_char(date, 'YYYY-MM') ||
          ' (' ||
          trim(trailing ' ' from to_char(date, 'Month')) ||
          ')' as month,
       trim(trailing ' ' from to_char(date, 'Day')) ||
          ', ' ||
          trim(trailing ' ' from to_char(date, 'Month')) ||
          ' ' ||
          to_char(date, 'DD') ||
          ', ' ||
          to_char(date, 'YYYY') as display_date,
       round(avg_downloads) as avg_downloads_30_days_rolling
FROM avg_app_downloads_30_days_rolling_by_day  
;

The final report now looks like this:

quarter | month          | display_date             |avg_downloads_30_days_rolling  
-----------------------------------------------------------------------------------
. . . .
2016-Q2 | 2016-06 (June) | Tuesday, June 07, 2016   | 23  
2016-Q2 | 2016-06 (June) | Wednesday, June 08, 2016 | 23  
2016-Q2 | 2016-06 (June) | Thursday, June 09, 2016  | 23  
2016-Q2 | 2016-06 (June) | Friday, June 10, 2016    | 22  
2016-Q2 | 2016-06 (June) | Saturday, June 11, 2016  | 22  
2016-Q2 | 2016-06 (June) | Sunday, June 12, 2016    | 21  
. . . .

Next steps

In this article we covered how to make our reports more suited to the audience and the use case. For us, that meant making our report a little more human-friendly. We looked at a variety of mathematical, formatting, date, and string functions to add some detail and make our data easy-to-understand.

There are a slew of additional options for how to present data in your reports. Depending on your data and the needs of your audience, you could look at further grouping and ordering of the data, perhaps even using the new group by options for sets, rollups and cubes that we wrote about when PostgreSQL 9.5 was released. You could even go a step further and use a data visualization tool for charting and dashboarding like we demonstrated in our article about how we visualize data using Leftronic. There are a number of 3rd party visualization tools on the market with a variety of features. Regardless of what additional steps you may want to take with your reports beyond what we discussed here, it all starts with the data you have at hand. By making that data "pretty", your reports will be easy-to-understand and therefore more useful to your audience.

In our next Metrics Maven article, we'll look at how to use crosstab to effectively create pivot tables in PostgreSQL so we can see our data in a different way.

Image by: Romi

Conquer the Data Layer

Spend your time developing apps, not managing databases.