Compose PostgreSQL for Your Segment Warehouse

Published

Learn how to set up your Segment warehouse on Compose PostgreSQL. This article walks you through the process - it's as easy as 1-2-3.

Segment is an analytics API with integrations to many marketing and reporting tools. At Compose we use Segment for hooking together our web analytics, email, and social tracking. So, we were very excited to see them announce warehouses for all the tracking data. Segment warehouses allow you to set up one or more warehouses for each of your Segment workspaces using either PostgreSQL or Amazon Redshift. We opted for postgreSQL, of course, since that's what we use in-house for our account and financial reporting already. Now we can use SQL with all the power of Postgres to query for our Segment data and create custom reports.

We've got our Segment warehouse up and running on one of our instances of Compose PostgreSQL. Follow along to see how you can do it, too.

Step 1: Setup PostgreSQL

If you don't yet have an account with Compose, sign-up and select the PostgreSQL database to get started. For those of you already on Compose, if don't yet have a PostgreSQL instance, you can add one from the Deployments page in our management console by clicking "Create Deployment" then selecting PostgreSQL or just click here to get a PostgreSQL deployment added to your account.

Once your PostgreSQL deployment is spun up, you may want to create a user to be the owner of the database you'll use for Segment. There is already an admin user role that is generated on initialization of your deployment, but this is the super user for your deployment so you may want to create additional users with more specific privileges. You may also want to manually scale up your deployment for the initial load of Segment data since it loads the past 2 months of data by default (learn more in the section below called "An important note about usage"). You can then scale it back down according to your data needs after the initial load. Our easy-to-use management console lets you perform these tasks, monitor your deployments, configure security settings, manage backups, and more.

Now, all you need to do is create a database where your Segment data will live. You can create a database directly from the Data Browser interface in the Compose management console, by using a tool such as the pgAdmin GUI or programmatically via code you've written. For simplicity, we named our database "segment" and associated it to our "compose" user as the owner. Here is our SQL statement to create the database for Segment data, using the default PostgreSQL arguments (set yours appropriately to your requirements):

 -- Create DB for Segment
 CREATE DATABASE segment
   WITH OWNER = compose
        ENCODING = 'SQL_ASCII'
        TABLESPACE = pg_default
        LC_COLLATE = 'C'
        LC_CTYPE = 'C'
        CONNECTION LIMIT = -1;

And that's it! You don't even need to create any tables - Segment will handle that for you.

Step 2: Setup Segment

If you're not already using Segment tracking you can signup on the Segment website.

Adding a warehouse

If you're like us and you're already using Segment, you can add a warehouse directly from your workspace. There's a new link in the left-hand navigation called "warehouses". This is where you "add warehouse":

add a warehouse

Select "Postgres" then click to "connect to warehouse":

select Postgres

Fill in the details for your connection. You can get your host and port from the "Connection Info" section of your Compose deployment Overview page. It is recommended that you use the username and associated password for the user you assigned as owner of the database when you created it. In our case that's the "compose" user for our new "segment" database:

connection info

After you click "save and connect", you should see that the connection was made successfully to your PostgreSQL database. You'll be asked to give the warehouse connection a nickname in Segment. We kept things simple and just used "Postgres". You can change this nickname later if you choose. After that, click "save warehouse":

connection success

If there is a problem with the connection, you'll get an error screen like this (we generated this one on purpose for illustrating connection errors by connecting to an instance that did not have our "segment" database):

connection error

Once you have successfully added a warehouse, you'll see it listed on the "warehouses" page:

warehouses

Configuring and monitoring the warehouse

At this point, you may want to adjust the warehouse settings. If you click on the warehouse you just created, you'll see a "settings" item in the left-hand navigation that is specific to the warehouse. In our case, we have 3 projects in our workspace, but we only want one of them (our production project) to be actively synced to our PostgreSQL DB:

syncing

Configure warehouse settings appropriate for your environment.

In the meantime, syncing will be underway. You can check the progress of this on the "run history" page:

run history

You'll also be able to see here, for each project you're syncing, all the previous syncs and when the next one is scheduled. Syncing happens according to the plan level you eventually sign up for. When you're just getting started like this, you'll get the "free" level for 14 days, which promises updates every 24 hours. Our experience, however, shows that syncing is happening for us every couple of hours on average.

If you click into one of the syncs listed under "run history", you'll get the details for that sync, such as the amount of time the sync took, how long ago it ran, how many updates were made for each collection (a collection corresponds to a table that has been created in your PostgreSQL database), and whether there were any warnings or errors. Here's what our initial sync looked like:

initial sync

An important note on usage

Note that the "free" level you'll get during the trial period only allows 50,000 rows to be added or updated each month. You can check your usage under the "billing" area of your workspace. Here's what ours looked like before any data had synced:

initial usage

But... be aware: the initial data load will load the past 2 months of data in one fell swoop. Depending on your implementation of Segment tracking and integrations, that might be a lot of data. For us that automatically put us over the "free" limit for the month, even though we are still in the trial period. Based on how many rows we see are loaded for each of the past two months, we know we will need to sign-up for one of the non-free plan levels at the conclusion of our trial. This is not a deterrent for us, but we wanted to mention it to you so you are not surprised by your usage during your trial. Here's what our usage looks like now, a couple days after adding the warehouse:

usage now

So, if you have a lot of Segment data over the past two months, you should manually scale up your PostgreSQL deployment on Compose before you add the warehouse to handle the initial load from Segment. You can then scale it back down as appropriate for a regular month's load. To scale your deployment manually, go to the Overview page for your deployment in the Compose management console and select the appropriate amount in the "Scaling" section for how much data you believe you'll have for the past 2 months. For us this was not an issue since these 1 million-ish rows is less than 1 GB, but you may have many more tables, more columns per table or larger data values than we have in ours. We just want to make sure you load data with your eyes wide open! Segment does not tell you the size of your data, but you can get a ballpark idea of at least how many rows you'll be adding in one month by checking the number of API calls for your integrations:

integrations usage

Your Segment warehouse is now set up.

Step 3: Browse & Query

And now the fun part - browsing and querying the data!

You'll notice in your PostgreSQL database that a new schema has been created for each project that was synced. For us, we initially synced all our projects (dev, marketing-prod, and production) before we went back and configured syncing to only run for our production project. Under our production project schema we'll see that a whole bunch of tables were created. Here's our tables in the Compose data browser "Tables" view:

Compose data broswer

When the PostgreSQL database is loaded, several tables are created by default: aliases, groups, identifies, pages, screens and tracks. You may also have accounts and users tables if you use unique calls for groups and for identifies. To learn more about these default tables and their fields, see the Segment schema documentation.

All of the other tables will be event-specific, according to the event names and properties you use in your tracks calls. The number of tables will depend on the number of unique events you're tracking. For example, we use a track call for when customers view their deployments such as:

analytics.track('deployments_show', {  
  deployment_name: 'heroic-rabbitmq-62', 
  deployment_type: 'RabbitMQ'
});

In our Postgres Segment database, we'll see a table named "deployments_show" which we can query for that deployment to see how many times it was viewed:

SELECT COUNT(id)  
 -- Don't forget the schema: FROM <project>.<table>
FROM production.deployments_show  
WHERE deployment_name = 'heroic-rabbitmq-62';  

The result is 18 times in the past 2 months. That was me - I was writing an article about configuring RabbitMQ in November. We can verify that by joining to the identifies table, which contains user data, through the user_id foreign key:

SELECT DISTINCT i.name  
FROM production.identifies i  
JOIN production.deployments_show ds ON ds.user_id = i.user_id  
WHERE ds.deployment_name = 'heroic-rabbitmq-62';  

A more interesting query for us, however, might be to see how many deployments were created in November using our "deployments_new" event:

SELECT COUNT(DISTINCT id)  
FROM production.deployments_new  
WHERE original_timestamp >= '2015-11-01'  
AND original_timestamp < '2015-12-01';  

In this way we can create custom reports and perform analysis on our tracking data, using SQL as simple or as complex as we need, to expose insights which our Segment-integrated tracking tools may not be able to do easily.

All the power of PostgreSQL for Segment data - it's that easy. Segment has written an article on useful queries that you may want to read to get started on pulling data and creating reports from your warehouse.

Once you've got the data in Postgres, you can do so much more with it, too. One idea might be to develop an app that performs various functions based on the different events being loaded to the database, potentially using RabbitMQ as your asynchronous message broker. An example could be that you want a banner to display on your website once your 1000th customer has signed up. The data is at your fingertips; you just need to decide how you want to use it.

Kickoff Your Warehouse

Now you're ready to kickoff your Segment warehouse using Compose hosted PostgreSQL. As always, we advise you to check out the official Segment warehouses documentation for more information and their warehouse-specific support section, but as you can see from this article, it's a simple process using Compose PostgreSQL.

Conquer the Data Layer

Spend your time developing apps, not managing databases.