Tutorial: Building a Modern Delivery Data Warehouse with Location Data

Photo by Maarten van den Heuvel on Unsplash

Logistics and last-mile delivery companies have large amounts of assets and people in the real world, getting goods where they need to go. Location is a key part of their business, and gaining an understanding of the location data they already collect can drive efficiency and performance.

Data modeling is a crucial step in the analytics process - a well-planned data model which fits the unique processes and entities of a business will make analytics much easier and more valuable for the organization. This article will present a data model and ELT process for an online retailer which handles its own deliveries.

We will specifically focus on incorporating delivery location data into the data model so it can be used to gain insights into delivery performance. In addition to building the data model, this article will cover how to implement the model into a data warehouse with dbt, DuckDB, and Snowflake.

All of the data and files discussed in this tutorial are available on GitHub. They are open-source and released under the MIT License. Feel free to incorporate them into your work.

Introduction to Delivery Superstore

Readers who have worked in the data industry for a few years may remember working with the classic ‘Superstore’ dataset from Tableau. The Superstore dataset contained data about the sales performance of a large big-box retailer, including tables like ‘customers’, ‘orders’, ‘products’, etc. By joining these tables together, dashboards could be built, which allowed for discovering insights about the most profitable products, sales by region, etc. It provided an excellent way to learn about dimensions and measures in a BI system.

Fast-forward to 2024. Due to lasting changes in consumer behavior driven by the COVID-19 pandemic, Superstore Inc. has seen a dramatic fall in in-store sales. In response to this change in behavior, senior leadership made the decision to switch to a delivery-only business model. While the migration itself was carried out successfully during the pandemic, the business is now struggling with the unit economics of deliveries. Many delivery failures are occurring, and the cost of labor and equipment for deliveries keeps rising. As the VP of Data, you have been asked to build a series of data products which provide insight into delivery performance.

Overview of the Process

There are many ways that this problem can be solved. In this article, we will present a data infrastructure (word choice?) that ingests data from source systems into a data warehouse, transforms the data into tables that are ready for analysis, and then create an interactive map that can be shared with leadership.

This process can be described as following the “ELT”/ Extract Load Transform, pattern, as data is first extracted from source systems, loaded into our data warehouse, and then transformed so it is ready for analysis. The article will be structured following these steps. However, before we dive into the ELT process, we will build our logical data model.

Step 0: Building the Data Model

Discovery Process

Before we start setting up any systems or writing code, we first need to understand the business and the types of data that we will be working with in this system, as well as the types of questions that the business wants answered. This upfront ‘discovery’ step is often neglected in technical guides, but a small investment of time and effort upfront will significantly reduce problems later in the process. While specific steps for conducting a successful discovery process are beyond the scope of this article, the following resources may be helpful:

After interviewing various teams across Superstore, Inc., it has emerged that there are 3 main sources of data within the company:

A CRM system
An Ecommerce system
An operational delivery system, including a Driver App

Conceptual Data Model

Data from each of these will need to be incorporated into the data platform. After some more interviews to better understand how the business operates, we can put together the following conceptual data model, which describes how various entities interact with each other to accomplish the business objective of delivering orders to customers.

While it may seem like a waste of time to develop such a diagram, it’s much easier to address problems and inconsistencies that you spot at this point in the process, rather than trying to make changes after the data platform is implemented and data is flowing through it.

Once we are happy with our conceptual data model, we can build a logical data model. A logical data model moves from abstract entities towards tables with fields, and relationships between them. At this point, we’ll learn what fields are available by talking to the source system owners and reviewing documentation.

Logical Data Model

Once we're satisfied with our conceptual data model, we can build a logical data model. This moves from abstract entities towards tables with fields and relationships between them.

Incorporating H3 Indexes into the data model

One of the goals of this project is to provide insights into delivery performance through an interactive map that can be shared with leadership. For location data, we currently have delivery_lat and delivery_lon in the orders table, representing the latitude and longitude of the delivery location, respectively. To make it easier to analyze data by location, let’s add a new field called ‘delivery_h3’ which will be based on the delivery latitude and longitude. H3 is a spatial index system which will ‘bin’ our locations into hexagon-shaped cells. This will effectively transform our latitude and longitude measure columns into a spatial dimension column. For more discussion of the conceptual basis for using H3 to analyze location data, read the previous article here.

Here is the logical data model with the new ‘delivery_h3’ field included in the ‘Orders’ table.

This logical data model will give us a good goal to shoot for as an endpoint of the ELT process. Now that we know what we’re trying to build, let’s go ahead and build it!

A note on technologies: in this article, we are using dbt to run SQL statements that will create different tables in our database (either DuckDB if we are running locally, or Snowflake if we are running in the cloud). The goal is to provide a tangible example of one way it could be implemented. The approach itself is tool-agnostic and can be adapted to many different environments.

Step 1: Extract data from source systems

The first step of the process is extracting data from the source systems. The goal is to get data out of the source system and get it into the data warehouse in raw form - at this step we’re not worried about transforming it yet. In the real world, this is heavily dependent on the source systems themselves and what kinds of interfaces they support. Here are some possible approaches:

A built-in integration. For example, Salesforce supports writing data to Snowflake tables.
A managed data integration tool, like Fivetran or Airbyte. They will typically support a set of ‘sources’ and ‘sinks’ and run regularly to keep data up-to-date.
A custom integration tool, like a Python script which calls an API that the source system provides, generates a JSON file from the response, and then saves it to a staging S3 bucket that Snowflake can access. This could be scheduled to run automatically using tools like cron or Airflow.
A file dump - sometimes you don’t need to over engineer things to start. For data that doesn’t change often, such as a list of products sold, the source system team can often provide an exported CSV or JSON file that can be manually loaded into the data warehouse.

In the case of Delivery Superstore, we will be relying on CSV files that are generated by a Python script. You can view these raw files on GitHub: https://github.com/honeycomb-maps/superstore-delivered

crm_customers.csv
delivery_data.csv
ecommerce_products.csv
order_info.csv

Step 2: Load the data into a data warehouse

For this example, we’ll use dbt’s Sources functionality to load data from CSV files into staging tables. These staging tables will preserve the same structure as the CSV files - one table for each file.

Here is the dbt file that defines the staging table for the ‘crm_customers’ table.

WITH source AS (
    SELECT * FROM {{ source('raw_data', 'crm_customers') }}
)

SELECT
    CAST(customer_id AS INTEGER) AS customer_id,
    customer_name,
    email,
    phone,
    segment
FROM source

All of the dbt staging models can be found in the GitHub Repository.

Step 3: Transform the data to fit our data model

Creating Normalized Fact and Dimension Tables

In this step, we will create tables that match the logical data model that we put together when we started this process. We’ll do this by selecting data from our staging tables and performing aggregations and other transformations where needed. In this step we will also add the ‘delivery_h3’ column using the H3 latlon_to_cell() function. In this example we are using the hex() function in DuckDB to convert the integer representation to a hexadecimal representation. However, this is superfluous as H3 cells can be represented either way.

One important question is the H3 level to aggregate to. Here is a table of H3 resolutions along with their average area and a frame of reference.

Res	Area (m²)	Area (ft²)	Comparison
0	4,357,449,416,078 m²	46,903,149,769,722 ft²	About 9 times the area of France
1	609,788,441,794 m²	6,563,701,808,626 ft²	Approximately the land area of Ukraine
2	86,801,780,398 m²	934,325,684,026 ft²	Slightly larger than Austria
3	12,393,434,655 m²	133,401,691,283 ft²	Roughly the size of Jamaica
4	1,770,347,654 m²	19,055,845,113 ft²	Approximately twice the size of New York City
5	252,903,858 m²	2,722,231,837 ft²	About the size of Amsterdam, Netherlands
6	36,129,062 m²	388,889,610 ft²	Approximately the size of Central Park in New York City
7	5,161,293 m²	55,555,642 ft²	Close to the size of Vatican City
8	737,327 m²	7,936,514 ft²	About 100 soccer/football fields
9	105,332 m²	1,133,783 ft²	About 15 soccer/football fields
10	15,047 m²	161,964 ft²	About the size of two soccer/football fields
11	2,149 m²	23,132 ft²	About the size of an Olympic swimming pool
12	307 m²	3,305 ft²	Approximately the size of a tennis court
13	43.8 m²	471 ft²	The area of a small apartment
14	6.3 m²	68 ft²	About the area of a king-size bed
15	0.895 m²	10 ft²	Approximately the size of a small coffee table

For most mobility and last-mile use cases in urban areas, resolution 12 is a good choice. As long as you store the original lat/lon within the data warehouse, higher resolutions can always be computed at a later point.

Here is the SQL statement for each of the data tables: dimension_customers.sql

-- dimension_customers.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    customer_name,
    segment
FROM {{ ref('stg_customers') }}

dimension_products.sql

-- dimension_products.sql
{{ config(materialized='table') }}

SELECT
    product_id,
    product_name,
    category,
    sub_category
FROM {{ ref('stg_products') }}

fact_deliveries.sql

-- fact_deliveries.sql
{{ config(materialized='table') }}

SELECT
    delivery_id,
    order_id,
    estimated_delivery_date,
    actual_delivery_date,
    delivery_status
FROM {{ ref('stg_deliveries') }}

fact_order_details.sql

-- fact_order_details.sql
{{ config(materialized='table') }}

SELECT
    order_id,
    product_id,
    sales,
    quantity,
    discount,
    profit
FROM {{ ref('stg_orders') }}

Note, our simulated ecommerce platform only provided a single ‘orders’ file, while in our target data model we have both fact_orders and fact_order_details. Therefore we need to do some transformation of the data, deduplicating the order data and aggregating the sales, quantity, and profit fields to create the fact_orders data, as shown below:

fact_orders.sql

-- fact_orders.sql
{{ config(materialized='table') }}

WITH order_info AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        ship_date,
        ship_mode,
        delivery_address,
        delivery_lat,
        delivery_lon,
        hex(h3_latlng_to_cell(delivery_lat, delivery_lon, 10)) as delivery_h3,
        SUM(sales) as total_sales,
        SUM(quantity) as total_quantity,
        SUM(profit) as total_profit
    FROM {{ ref('stg_orders') }}
    GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
)

SELECT
    order_id,
    customer_id,
    order_date,
    ship_date,
    ship_mode,
    delivery_address,
    delivery_lat,
    delivery_lon,
    delivery_h3,
    total_sales,
    total_quantity,
    total_profit
FROM order_info

Creating Denormalized Tables for Visualization

In the last step, we transformed the raw data into normalized fact and dimension tables. Modeling data and storing it in a normalized form is a great way to avoid duplicate data.

In this step, we are interested in getting the data ready for visualization. We are interested in producing a map that gives insights into failed package deliveries. To do that, let's create two queries which will assemble the data we need to make this map.

A table with lat/lon information and delivery status

Creating a table with the lat/lon information and delivery status First, we want to create a map of delivery points which are colored based on whether the delivery was successful or not. To do this, we’ll need to write a query that joins the order table with the delivery table, and selects relevant fields.

Here’s the query:

-- models/order_locations_delivery_success.sql

{{ config(materialized='table') }}

SELECT
    o.order_id,
    o.delivery_lat AS latitude,
    o.delivery_lon AS longitude,
    CASE
        WHEN d.delivery_status = 'Delivered' THEN TRUE
        WHEN d.delivery_status = 'Failed' THEN FALSE
        ELSE NULL  -- for any other status or if status is unknown
    END AS delivery_successful
FROM {{ ref('fact_orders') }} o
LEFT JOIN {{ ref('fact_deliveries') }} d ON o.order_id = d.order_id

We will name this table ‘order_locations_delivery_status’. In the next article, we’ll make a map with this data.

However, this approach will produce a single row for each order. In cases where we have a huge amount of data, or we don’t want to give our viewers access to individual latitude/longitude data (which could be used to identify the precise location of a person), we can take a different approach where we pre-aggregate metrics for each h3 cell.

A pre-aggregated table with statistics for each h3 cell

To do that, we’ll use the delivery_h3 field that we created in the transformation step as a dimension to group our data by and calculate metrics for.

For each h3 cell, we’ll calculate the following metrics from the data:

Number of orders
Total sales
Number of failed deliveries
Number of total deliveries
Percentage of failed deliveries

Here is a SQL query which will generate the metrics above for each h3 cell:

-- metrics_by_h3.sql
{{ config(materialized='table') }}

WITH order_delivery_data AS (
    SELECT
        o.delivery_h3,
        o.order_id,
        o.total_sales,
        d.delivery_status
    FROM {{ ref('fact_orders') }} o
    LEFT JOIN {{ ref('fact_deliveries') }} d ON o.order_id = d.order_id
),

aggregated_data AS (
    SELECT
        delivery_h3,
        COUNT(DISTINCT order_id) AS number_of_orders,
        SUM(total_sales) AS total_sales,
        SUM(CASE WHEN delivery_status = 'Failed' THEN 1 ELSE 0 END) AS failed_deliveries,
        COUNT(delivery_status) AS total_deliveries
    FROM order_delivery_data
    GROUP BY delivery_h3
)

SELECT
    delivery_h3,
    number_of_orders,
    total_sales,
    failed_deliveries,
    CASE
        WHEN total_deliveries > 0 THEN CAST(failed_deliveries AS FLOAT) / total_deliveries * 100
        ELSE 0
    END AS pct_failed_deliveries
FROM aggregated_data

We can either choose to materialize this query as a materialized view or a table, or create this query as a view. If we create a table, then we need to run dbt on a schedule so the data is refreshed as new deliveries occur. If we create a view, then the query will be run every time, ensuring that the returned data is up-to-date. Whether to use a view or a table depends on the amount of data that is being processed, the cost of processing the data, and the freshness required.

By pre-aggregating metrics by h3 cell, we reduce the number of rows that need to be processed by our visualization tool, greatly reducing the loading time and making it possible to visualize datasets with billions of rows, while maintaining interactivity.

In the next article, we will visualize both the point and h3 metrics tables that we created in Honeycomb.