Store Result Sets with Materialized Views in PostgreSQL

Lucero Del Alba talks about using PostgreSQL's materialized views to process large amounts of data without slowing down your database in this new Write Stuff article, showing how to efficiently create, access, update, and delete them.

In this article, we'll learn how to cache result sets for queries that require heavy processing — as in the case of performing aggregate functions (SUM, AVG, etc) and JOINs over millions of records - so that the results will be processed just once and stored as a materialized view for quick retrieval later.

We'll describe the scope of the problem and some possible use cases before we get our hands on the actual code and implementation of materialized views; if you are either familiar with the use case or just want to go straight to the implementation details, feel free to jump ahead to The Solution section.

The Use Case: a Financial Application

We'll be expanding on our previous article on Building OHLC Data in PostgreSQL to see how we can further improve the responsiveness of the application by immediately retrieving result sets that have already been queried.

Here's a quick summary of the previous article:

We went all the way from downloading ticks for a financial instrument to generating OHLC data that's typically used to display and analyze prices for such instruments. The visualization was a plus. What was interesting about what we did was we let the database do most of the heavy work by using PostgreSQL features to generate the OHLC data for which we would have otherwise needed to write a separate program.

Other Use Cases

Keep in mind that even if you're not in the finance field, the following principles will be useful for whenever you need to store the results of a query in a programmatic way such as a defined method to query the DB, get the results, and store them for later.

In general, if you're performing operations on records that are unlikely to be changed once inserted in the DB and where no new records for that group or period are likely to be introduced, then you have a perfect scenario for applying the techniques we'll cover here.

Some examples:

In many of these cases there's really no need to write a program to do all these calculations and pass the results to a caching system, often times all you'll need is just some SQL savvy.

The Problem: Heavy and Recurrent Queries

Specifically, in our case, the problem we face is that generating OHLC from the ticks database is computationally expensive.

Consider this, the file we downloaded (EURUSD_Ticks_08.11.2016-08.11.2016.csv) in the previous article has 69,384 records in it, and those are the ticks for just one day. If we want to generate OHLC data for one-hour bars (or any other period for that matter) for a week, assuming the same volatility for the rest of the days, we'll need to go over ~347,000 records (69,384 x 5 days). If we wanted to generate the same data over a period of a month, we are already over the million records (347,000 x 4 weeks). Also, the aggregation functions that we need for our OHLC data, like maximum, minimums, averages, and counts, aren't particularly expensive for one-hour bars, but minute bars really will be because there's a lot more of computing involved when grouping and aggregating to a lesser granularity.

More grouping means more operations. Consider the following for any given hour bar:

2 times more grouping for 30 minutes bars
4 times more grouping for 15 minutes bars
...
60 times more grouping for 1 minute bars

... and the processing you did for one-hour bars is entirely useless from the one you'll need for two-hours or thirty-minute bars. So you'll have to go over millions of records depending on the different kind of bars you want to generate for just one symbol (EUR/USD).

And there aren't many insights you can get from a single week of financial data on a single instrument; you'll need months, if not years, from multiple symbols, and occasionally from multiple data sources as well when there isn't a centralized exchange, as in forex markets. To make things worse, you'll need to visualize/analyze bars more than once meaning you'll need to query the DB for the same data several times.

The Solution: Storing the Result Sets

We know you can't change the past, historical data will remain the same, and a one-hour bar chart for a given instrument from a given exchange or broker during a given period of time will always look the same. If we use this to our advantage, it implies we will only need to calculate things once and somehow store the results for future reference.

We could get the results and populate a new table in the database for future reference then batch some scripts to handle the process; but, there is a more programmatic, elegant way around this using the database alone.

What Is a Materialized View?

A materialized view is a database object that contains the results of a query. (...) This is a form of caching the results of a query, similar to memoization of the value of a function in functional languages, and it is sometimes described as a form of precomputation. As with other forms of precomputation, database users typically use materialized views for performance reasons.

https://en.wikipedia.org/wiki/Materialized_view

Some of you may be thinking — why not just use a cache? That is, query the database and store the results in some object. Yes, that will work, but it will also require a cache solution, possibly some extra maintenance, and add some logic to your app. Unless you come up with a very good cache, material views are better than that.

With materialized views you can not only retrieve the results set that were stored, but you can query the materialized view table itself as if it was a regular table, use conditional statements, joins and ranges as you would with any other table effectively as if you were changing the original query that generated the materialized view. But it's a whole lot faster and there's no need for a cache or extra maintenance since you're relying on something you already have and maintain — your DB.

In the following code snippets we're going to be focusing on materialized views in PostgreSQL, which have been available since version 9.3.

Creating a Materialized View

Creating a materialized view is super easy — you precede the query statement for which you want the materialize view with the CREATE MATERIALIZED VIEWmymatview AS statement, where mymatview will be the name for the materialized view.

For example, if you want to create a materialized view for SELECT * FROM mytable:

CREATE MATERIALIZED VIEW mymatview AS SELECT * FROM mytable;  

Transparently Create/Access a Materialized View

In order to make the process as transparent as possible, what we want to do is to create the materialized view if it doesn't exist and to access it immediately right after.

PostgreSQL 9.5 and Above

This is the easy way, and it's pretty much doing what we just described, but in SQL:

CREATE MATERIALIZED VIEW IF NOT EXISTS mymatview AS  
    SELECT
        date_trunc('second', dt) dt,
        (array_agg(bid ORDER BY dt ASC))[1] o,
        MAX(bid) h,
        MIN(bid) l,
        (array_agg(bid ORDER BY dt DESC))[1] c,
        SUM(bid_vol) bid_vol,
        SUM(ask_vol) ask_vol,
        COUNT(*) ticks
    FROM "EUR/USD"
    WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'
    GROUP BY date_trunc('second', dt)
    ORDER BY dt;
SELECT * FROM mymatview;  

The first statement won't return any result set, so we won't get any unexpected output whether the materialized view previously existed or not.

PostgreSQL 9.3 and 9.4

IF NOT EXISTS for creating a materialized view was introduced in 9.5, so we'll need a way around this issue for earlier versions. We'll use Procedural Language/PostgreSQL (PL/pgSQL) with a DO code block containing a BEGIN/END block for a transaction.

DO $$  
BEGIN  
    IF
        (SELECT COUNT(*) FROM pg_matviews WHERE matviewname='mymatview') = 0
    THEN
        CREATE MATERIALIZED VIEW "mymatview" AS
            SELECT
                date_trunc('second', dt) dt,
                (array_agg(bid ORDER BY dt ASC))[1] o,
                MAX(bid) h,
                MIN(bid) l,
                (array_agg(bid ORDER BY dt DESC))[1] c,
                SUM(bid_vol) bid_vol,
                SUM(ask_vol) ask_vol,
                COUNT(*) ticks
            FROM "EUR/USD"
            WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'
            GROUP BY date_trunc('second', dt)
            ORDER BY dt;
    END IF;
END  
$$ LANGUAGE plpgsql;
SELECT * FROM mymatview;  

Note that pg_matviews is the name of the system catalog where the materialized views are stored.

All Cases

We'll again focus on the process since the query you'll materialize can be anything that requires you to do heavy processing.

You'll need to replace mymatview with something that accurately describes your view. That might come as obvious and as a trivial thing to do for cases such as: balance_2015, whole_results, etc. For our example we'll need to include the symbol name, the periodicity of the OHLC bars, and possibly the data range, so you got to come up with a template for naming such as {symbol}_{date_start}_{date_end}_{periodicity}. In the case you don't normally need to access these tables manually, you could just use an 8-character hash that you'll compare against a hash table.

Notice that while you can get away by just executing queries with SELECT statements, when creating materialized views within your programs you'll need to commit these queries in order to actually save the newly created view in the DB. In Python, this is just a matter of adding a conn.commit() (replacing conn with the name of your connection name) right after sending the query execution to the cursor. Don't forget this step otherwise you'll be creating (and not saving) a materialized view every time.

Yes, you'll add an extra query every time since you have to check if the materialized view exists, and even add an extra procedure for PostgreSQL 9.3 and 9.4. However, this is extremely fast and it's something that shouldn't worry you since you don't access balances and heavy queries all that often. Now, if you're making these queries by the thousands in a matter of seconds, then yes, go ahead and batch the creation of materialized views and query them as simply as you possibly can; even forget about materialized views altogether and load the results in an HDF5 store that you can load to memory and access on demand.

Updating a Materialized View

A materialized view isn't just a virtual table representing the result of a query (a "view"), these results are stored as an actual ("materialized") table which can be updated from the original tables when needed.

Right, we said before that you can't change the past, except when you do. It may happen that new or corrected historical data becomes available, either filling in gaps or fixing inaccuracies. This is the time when you'll need to re-generate your views.

Since materialized views are stored along with the details of the very query that generated them, handling an update is incredibly easy. You don't need to delete the materialized view and recreate it, just this simple SQL statement will do the trick:

REFRESH MATERIALIZED VIEWmymatview;

That simple!

Changing and Deleting a Materialized View

As mentioned, these aren't just views but actual tables that you can operate in pretty much the same ways as regular tables. Just use the MATERIALIZED keyword to tell the engine what kind of operation you want to perform:

Other DBMS

Materialized views are supported by several databases though the implementation details may vary from one engine to another so you are advised to read the documentation.

Actual Support

Oracle was the first database to implement materialized views adding support was in version 8i back in 1998.

Here's what creating a materialized view in Oracle looks like:

CREATE MATERIALIZED VIEW MV_MY_VIEW  
REFRESH FAST START WITH SYSDATE  
  NEXT SYSDATE + 1
    AS SELECT * FROM <table_name>;

For more information see Managing Read-Only Materialized Views in the Oracle's Database Administrator’s Guide.

Sybase SQL Anywhere, at least since version 10, has supported materialized views. The latest version as of now, 17, has extensive documentation covering this topic.

Partial Support (or with a Workaround)

IBM DB2 since version 10 has a close enough implementation called materialized query tables (MQT). Implementation details vary from what we covered, so check out an introduction to materialized query tables on IBM developerWorks.

Microsoft SQL Server, since version 2005, ships a similar feature called "index views". See Improving Performance with SQL Server 2005 Indexed Views on Microsoft TechNet and Create Indexed Views (for SQL Server 2016 and later).

MySQL doesn't quite support materialized views, however you can workaround this by using triggers and stored procedures; see Materialized Views with MySQL to see an example of this. There's also a third-party application called Flexviews which implements materialized views on MySQL, though the project is unmaintained lately.

Wrap-Up

Materialized views are incredibly useful for saving computation time and are trivially easy to create and maintain: all you need is your DB! They mix the immediate accessibility of a view with the flexibility of an actual table, meaning that you can alter them and even run queries within them, or filtering results. Also, whenever new data becomes available, you'll be all set with a simple REFRESH. So keep materialized views in mind whenever some queries are taking you valuable time and you happen to run them somewhat regularly.

Finally, have a look at Robert M. Wysocki's It's a view, it's a table... no, it's a materialized view! where he goes into more technical details such as table locking, which we didn't get to cover here.

Lucero dances, plays music, writes about random topics, leads projects to varying and doubtful degrees of success, and keeps trying to be real and he keeps failing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Mike Wilson