Developer Blog | dbt Developer Hub

Why you should specify a production environment in dbt Cloud

November 14, 2023 · 5 min read

Senior Developer Experience Advocate @ dbt Labs

The Bottom Line:

You should split your Jobs across Environments in dbt Cloud based on their purposes (e.g. Production and Staging/CI) and set one environment as Production. This will improve your CI experience and enable you to use dbt Explorer.

Environmental segmentation has always been an important part of the analytics engineering workflow:

When developing new models you can process a smaller subset of your data by using target.name or an environment variable.
By building your production-grade models into a different schema and database, you can experiment in peace without being worried that your changes will accidentally impact downstream users.
Using dedicated credentials for production runs, instead of an analytics engineer's individual dev credentials, ensures that things don't break when that long-tenured employee finally hangs up their IDE.

Historically, dbt Cloud required a separate environment for Development, but was otherwise unopinionated in how you configured your account. This mostly just worked – as long as you didn't have anything more complex than a CI job mixed in with a couple of production jobs – because important constructs like deferral in CI and documentation were only ever tied to a single job.

But as companies' dbt deployments have grown more complex, it doesn't make sense to assume that a single job is enough anymore. We need to exchange a job-oriented strategy for a more mature and scalable environment-centric view of the world. To support this, a recent change in dbt Cloud enables project administrators to mark one of their environments as the Production environment, just as has long been possible for the Development environment.

Explicitly separating your Production workloads lets dbt Cloud be smarter with the metadata it creates, and is particularly important for two new features: dbt Explorer and the revised CI workflows.

To defer or to clone, that is the question

October 31, 2023 · 6 min read

Kshitij Aranke

Senior Software Engineer @ dbt Labs

Doug Beatty

Senior Developer Experience Advocate @ dbt Labs

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

What is dbt clone?
How is it different from deferral?
Should I defer or should I clone?

Optimizing Materialized Views with dbt

August 3, 2023 · 11 min read

Amy Chen

Staff Partner Engineer @ dbt Labs

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

Create dbt Documentation and Tests 10x faster with ChatGPT

July 18, 2023 · 8 min read

Pedro Brito de Sa

Product Analyst @ Sage

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

Data Vault 2.0 with dbt Cloud

July 3, 2023 · 15 min read

Rastislav Zdechovan

Analytics Engineer @ Infinite Lambda

Sean McIntyre

Senior Solutions Architect @ dbt Labs

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

Building a historical user segmentation model with dbt

June 13, 2023 · 14 min read

Santiago Jauregui

Data Analytics Leader @ MODO

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

Accelerate your documentation workflow: Generate docs for whole folders at once

May 17, 2023 · 9 min read

Mikael Thorup

Senior Analytics Engineer @ Lunar

At Lunar, most of our dbt models are sourcing from event-driven architecture. As an example, we have the following models for our activity_based_interest folder in our ingestion layer:

activity_based_interest_activated.sql
activity_based_interest_deactivated.sql
activity_based_interest_updated.sql
downgrade_interest_level_for_user.sql
set_inactive_interest_rate_after_july_1st_in_bec_for_user.sql
set_inactive_interest_rate_from_july_1st_in_bec_for_user.sql
set_interest_levels_from_june_1st_in_bec_for_user.sql

This results in a lot of the same columns (e.g. account_id) existing in different models, across different layers. This means I end up:

Writing/copy-pasting the same documentation over and over again
Halfway through, realizing I could improve the wording to make it easier to understand, and go back and update the .yml files I already did
Realizing I made a syntax error in my .yml file, so I go back and fix it
Realizing the columns are defined differently with different wording being used in other folders in our dbt project
Reconsidering my choice of career and pray that a large language model will steal my job
Considering if there’s a better way to be generating documentation used across different models

Modeling ragged time-varying hierarchies

May 2, 2023 · 18 min read

Sterling Paramore

Sr Staff Data Engineer @ Mainspring Energy

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

Data model

Obviously, a real bike could have a hundred or more separate components. To keep things simple for this article, let's just consider the bike, the frame, a wheel, the wheel rim, tire, and tube. Our component hierarchy looks like:

eBike Hierarchy

This hierarchy is ragged because different paths through the hierarchy terminate at different depths. It is time-varying because specific components can be added and removed.

Now let's take a look at how this data is represented in our source data systems and how it can be transformed to make analytics queries easier.

Transactional model

Our ERP system (Enterprise Resource Planning) contains records that log when a specific component serial number (component_id) was installed in or removed from a parent assembly component (assembly_id). The top-most assembly component is the eBike itself, which has no parent assembly. So when an eBike (specifically, the eBike with serial number "Bike-1") is originally constructed, the ERP system would contain records that look like the following.

erp_components:

`assembly_id`	`component_id`	`installed_at`
	Bike-1	2023-01-01
Bike-1	Frame-1	2023-01-01
Bike-1	Wheel-1	2023-01-01
Wheel-1	Rim-1	2023-01-01
Wheel-1	Tire-1	2023-01-01
Tire-1	Tube-1	2023-01-01

Now let's suppose this bike has been ridden for a while, and on June 1, the user of the bike reported a flat tire. A service technician then went to the site, replaced the tube that was in the wheel, and installed a new one. They logged this in the ERP system, causing one record to be updated with a removed_at date, and another record to be created with the new tube component_id.

erp_components:

`assembly_id`	`component_id`	`installed_at`	`removed_at`
...	...	...	...
Tire-1	Tube-1	2023-01-01	2023-06-01
Tire-1	Tube-2	2023-06-01
...	...	...	...

After a few more months, there is a small crash. Don't worry, everyone's OK! However, the wheel (Wheel-1)is totally broken and must be replaced (with Wheel-2). When the technician updates the ERP, the entire hierarchy under the replaced wheel is also updated, as shown below.

erp_components:

`assembly_id`	`component_id`	`installed_at`	`removed_at`
Bike-1	Wheel-1	2023-01-01	2023-08-01
Wheel-1	Rim-1	2023-01-01	2023-08-01
Wheel-1	Tire-1	2023-01-01	2023-08-01
Tire-1	Tube-2	2023-06-01	2023-08-01
Bike-1	Wheel-2	2023-08-01
Wheel-2	Rim-2	2023-08-01
Wheel-2	Tire-2	2023-08-01
Tire-2	Tube-3	2023-08-01

After all of the above updates and additions, our ERP data looks like the following.

erp_components:

`assembly_id`	`component_id`	`installed_at`	`removed_at`
	Bike-1	2023-01-01
Bike-1	Frame-1	2023-01-01
Bike-1	Wheel-1	2023-01-01	2023-08-01
Wheel-1	Rim-1	2023-01-01	2023-08-01
Wheel-1	Tire-1	2023-01-01	2023-08-01
Tire-1	Tube-1	2023-01-01	2023-06-01
Tire-1	Tube-2	2023-06-01	2023-08-01
Bike-1	Wheel-2	2023-08-01
Wheel-2	Rim-2	2023-08-01
Wheel-2	Tire-2	2023-08-01
Tire-2	Tube-3	2023-08-01

So that's all fine and good from the perspective of the ERP system. But this data structure can be difficult to work with if we want to generate reports that calculate the total mileage accumulated on various components, or the average mileage of a particular component type, or how one component type might affect the lifetime of another component.

Multivalued dimensional model

In dimensional modeling, we have fact tables that contain measurements and dimension tables that contain the context for those measurements (attributes). In our eBike data warehouse, we have a fact table that contains one record for each eBike for each day it is ridden and the measured mileage accumulated during rides that day. This fact table contains surrogate key columns, indicated by the _sk suffix. These are usually system-generated keys used to join to other tables in the database; the specific values of these keys are not important.

fct_daily_mileage:

`bike_sk`	`component_sk`	`ride_at`	`miles`
bsk1	csk1	2023-01-01	3
bsk1	csk1	2023-01-02	2
bsk1	csk1	2023-01-03	0
bsk1	csk1	2023-01-04	0
...	...	...	...
bsk1	csk3	2023-08-01	7
bsk1	csk3	2023-08-02	8
bsk1	csk3	2023-08-03	4

One of the dimension tables is a simple table containing information about the individual bikes we have manufactured.

dim_bikes:

`bike_sk`	`bike_id`	`color`	`model_name`
bsk1	Bike-1	Orange	Wyld Stallyn

There is a simple many-to-one relationship between fct_daily_mileage and dim_bikes. If we need to calculate the total mileage accumulated for each bike in our entire fleet of eBikes, we just join the two tables and aggregate on the miles measurement.

select
    dim_bikes.bike_id,
    sum(fct_daily_mileage.miles) as miles
from
    fct_daily_mileage
inner join
    dim_bikes
    on
        fct_daily_mileage.bike_sk = dim_bikes.bike_sk
group by
    1

Extending this to determine if orange bikes get more use than red bikes or whether certain models are preferred are similarly straightforward queries.

Dealing with all of the components is more complicated because there are many components installed on the same day. The relationship between days when the bikes are ridden and the components is thus multivalued. In dim_bikes, there is one record per bike and surrogate key. In our components dimension will have multiple records with the same surrogate key and will therefore be a multivalued dimension. Of course, to make things even more complicated, the components can change from day to day. To construct the multivalued dimension table, we break down the time-varying component hierarchy into distinct ranges of time where all of the components in a particular bike remain constant. At specific points in time where the components are changed, a new surrogate key is created. The final dimension table for our example above looks like the following, where the valid_from_at and valid_to_at represent the begin and end of a range of time where all the components of an eBike remain unchanged.

mdim_components:

`component_sk`	`assembly_id`	`component_id`	`depth`	`installed_at`	`removed_at`	`valid_from_at`	`valid_to_at`
csk1		Bike-1	0	2023-01-01		2023-01-01	2023-06-01
csk1	Bike-1	Frame-1	1	2023-01-01		2023-01-01	2023-06-01
csk1	Bike-1	Wheel-1	1	2023-01-01	2023-08-01	2023-01-01	2023-06-01
csk1	Wheel-1	Rim-1	2	2023-01-01	2023-08-01	2023-01-01	2023-06-01
csk1	Wheel-1	Tire-1	2	2023-01-01	2023-08-01	2023-01-01	2023-06-01
csk1	Tire-1	Tube-1	3	2023-01-01	2023-06-01	2023-01-01	2023-06-01
csk2		Bike-1	0	2023-01-01		2023-06-01	2023-08-01
csk2	Bike-1	Frame-1	1	2023-01-01		2023-06-01	2023-08-01
csk2	Bike-1	Wheel-1	1	2023-01-01	2023-08-01	2023-06-01	2023-08-01
csk2	Wheel-1	Rim-1	2	2023-01-01	2023-08-01	2023-06-01	2023-08-01
csk2	Wheel-1	Tire-1	2	2023-01-01	2023-08-01	2023-06-01	2023-08-01
csk2	Tire-1	Tube-2	3	2023-06-01	2023-08-01	2023-06-01	2023-08-01
csk3		Bike-1	0	2023-01-01		2023-08-01
csk3	Bike-1	Frame-1	1	2023-01-01		2023-08-01
csk3	Bike-1	Wheel-2	1	2023-08-01		2023-08-01
csk3	Wheel-2	Rim-2	2	2023-08-01		2023-08-01
csk3	Wheel-2	Tire-2	2	2023-08-01		2023-08-01
csk3	Tire-2	Tube-3	3	2023-08-01		2023-08-01

Now, let's look at how this structure can help in writing queries. In a later section of this article, we'll examine the SQL code that can take our ERP table and convert it into this dimensional model.

Mileage for a component

Suppose we wanted to know the total mileage accumulated on "Wheel-1". The SQL code for determining this is very similar to that for determining the mileage for a given bike.

select
    mdim_components.component_id,
    sum(fct_daily_mileage.miles) as miles
from
    fct_daily_mileage
inner join
    mdim_components
    on
        fct_daily_mileage.component_sk = mdim_components.component_sk
group by
    1
where
    component_id = 'Wheel-1'

caution

One thing to be very cautious about when working with multivalued dimensions is that you need to be careful interpreting aggregations. For example, suppose we chose to aggregate on top_assembly_id (to reduce clutter, this field is not shown in the data model above because it is just "Bike-1" for each record). For this aggregation, we would be over-counting the total mileage on that top assembly because the join would result in a Cartesian product and thus we'd get a "fan-out" situation.

Bonus: Finding components installed at the same time as other components

This structure simplifies other kinds of interesting analysis. Suppose we wanted to start exploring how one component affects another, like whether certain brands of tube needed to be replaced more often if they were in a new brand of tire. We can do this by partitioning the data into the segments of time where the components are not changing and looking for other components installed at the same time. For example, to find all of the components that were ever installed at the same time "Tube-3" was installed, we can collect them with a simple window function. We could then use the results of this query in a regression or other type of statistical analysis.

select distinct
    component_id
from
    mdim_components
qualify
    sum(iff(component_id = 'Tube-3', 1, 0)) over (partition by valid_from_at, valid_to_at) > 0

SQL code to build the dimensional model

Now we get to the fun part! This section shows how to take the ERP source data and turn it into the multivalued dimensional model. This SQL code was written and tested using Snowflake, but should be adaptable to other dialects.

Traversing the hierarchy

The first step will be to traverse the hierarchy of components to find all components that belong to the same top assembly. In our example above, we only had one bike and thus just one top assembly; in a real system, there will be many (and we may even swap components between different top assemblies!).

The key here is to use a recursive join to move from the top of the hierarchy to all children and grandchildren. The top of the hierarchy is easy to identify because they are the only records without any parents.

with recursive
-- Contains our source data with records that link a child to a parent
components as (
    select
        *,
        -- Valid dates start as installed/removed, but may be modified as we traverse the hierarchy below
        installed_at as valid_from_at,
        removed_at as valid_to_at
    from
        erp_components
),

-- Get all the source records that are at the top of hierarchy
top_assemblies as (
    select * from components where assembly_id is null
),

-- This is where the recursion happens that traverses the hierarchy
traversal as (
    -- Start at the top of hierarchy
    select
        -- Keep track of the depth as we traverse down
        0 as component_hierarchy_depth,
        -- Flag to determine if we've entered a circular relationship
        false as is_circular,
        -- Define an array that will keep track of all of the ancestors of a component
        [component_id] as component_trace,
        -- At the top of the hierarchy, the component is the top assembly
        component_id as top_assembly_id,

        assembly_id,
        component_id,

        installed_at,
        removed_at,
        valid_from_at,
        valid_to_at
    from
        top_assemblies

    union all

    -- Join the current layer of the hierarchy with the next layer down by linking
    -- the current component id to the assembly id of the child
    select
        traversal.component_hierarchy_depth + 1 as component_hierarchy_depth,
        -- Check for any circular dependencies
        array_contains(components.component_id::variant, traversal.component_trace) as is_circular,
        -- Append trace array
        array_append(traversal.component_trace, components.component_id) as component_trace,
        -- Keep track of the top of the assembly
        traversal.top_assembly_id,

        components.assembly_id,
        components.component_id,

        components.installed_at,
        components.removed_at,
        -- As we recurse down the hierarchy, only want to consider time ranges where both
        -- parent and child are installed; so choose the latest "from" timestamp and the earliest "to".
        greatest(traversal.valid_from_at, components.valid_from_at) as valid_from_at,
        least(traversal.valid_to_at, components.valid_to_at) as valid_to_at
    from
        traversal
    inner join
        components
        on
            traversal.component_id = components.assembly_id
            and
            -- Exclude component assemblies that weren't installed at the same time
            -- This may happen due to source data quality issues
            (
                traversal.valid_from_at < components.valid_to_at
                and
                traversal.valid_to_at >= components.valid_from_at
            )
    where
        -- Stop if a circular hierarchy is detected
        not array_contains(components.component_id::variant, traversal.component_trace)
        -- There can be some bad data that might end up in hierarchies that are artificially extremely deep
        and traversal.component_hierarchy_depth < 20
),

final as (
    -- Note that there may be duplicates at this point (thus "distinct").
    -- Duplicates can happen when a component's parent is moved from one grandparent to another.
    -- At this point, we only traced the ancestry of a component, and fixed the valid/from dates
    -- so that all child ranges are contained in parent ranges.

    select distinct *
    from
        traversal
    where
        -- Prevent zero-time (or less) associations from showing up
        valid_from_at < valid_to_at
)

select * from final

At the end of the above step, we have a table that looks very much like the erp_components that it used as the source, but with a few additional valuable columns:

top_assembly_id - This is the most important output of the hierarchy traversal. It ties all sub components to a their common parent. We'll use this in the next step to chop up the hierarchy into all the distinct ranges of time where the components that share a common top assembly are constant (and each distict range of time and top_assembly_id getting their own surrogate key).
component_hierarchy_depth - Indicates how far removed a component is from the top assembly.
component_trace - Contains an array of all the components linking this component to the top assembly.
valid_from_at/valid_to_at - If you have really high-quality source data, these will be identical to installed_at/removed_at. However, in the real world, we've found cases where the installed and removal dates are not consistent between parent and child, either due to a data entry error or a technician forgetting to note when a component was removed. So for example, we may have a parent assembly that was removed along with all of its children, but only the parent assembly has removed_at populated. At this point, the valid_from_at and valid_to_at tidy up these kinds of scenarios.

Temporal range join

The last step is perform a temporal range join between the top assembly and all of its descendents. This is what splits out all of the time-varying component changes into distinct ranges of time where the component hierarchy is constant. This range join makes use of the dbt macro in this gist, the operation of which is out-of-scope for this article, but you are encouraged to investigate it and the discourse post mentioned earlier.

-- Start with all of the assemblies at the top (hierarchy depth = 0)
with l0_assemblies as (
    select
        top_assembly_id,
        component_id,
        -- Prep fields required for temporal range join
        {{ dbt_utils.surrogate_key(['component_id', 'valid_from_at']) }} as dbt_scd_id,
        valid_from_at as dbt_valid_from,
        valid_to_at as dbt_valid_to
    from
        component_traversal
    where
        component_hierarchy_depth = 0
),

components as (
    select
        top_assembly_id,
        component_hierarchy_depth,
        component_trace,
        assembly_id,
        component_id,
        installed_at,
        removed_at,
        -- Prep fields required for temporal range join
        {{ dbt_utils.surrogate_key(['component_trace', 'valid_from_at'])}} as dbt_scd_id,
        valid_from_at as dbt_valid_from,
        valid_to_at as dbt_valid_to
    from
        component_traversal
),

-- Perform temporal range join
{{
    trange_join(
      left_model='l0_assemblies',
      left_fields=[
        'top_assembly_id',
      ],
      left_primary_key='top_assembly_id',
      right_models={
        'components': {
          'fields': [
              'component_hierarchy_depth',
              'component_trace',
              'assembly_id',
              'component_id',
              'installed_at',
              'removed_at',
          ],
          'left_on': 'component_id',
          'right_on': 'top_assembly_id',
        }
      }
    )
}}

select
    surrogate_key,
    top_assembly_id,
    component_hierarchy_depth,
    component_trace,
    assembly_id,
    component_id,
    installed_at,
    removed_at,
    valid_from_at,
    valid_to_at
from
    trange_final
order by
    top_assembly_id,
    valid_from_at,
    component_hierarchy_depth

Bonus: component swap

Before we go, let's investigate one other interesting scenario. Suppose we have two bikes, "Bike-1" and "Bike-2". While performing service, a technician notices that the color on the rim of "Bike-2" matches with the frame of "Bike-1" and vice-versa. Perhaps there was a mistake made during the initial assembly process? The technician decides to swap the wheels between the two bikes. The ERP system then shows that "Wheel-1" was removed from "Bike-1" on the service date and that "Wheel-1" was installed in "Bike-2" on the same date (similarly for "Wheel-2"). To reduce clutter below, we'll ignore Frames and Tubes.

erp_components:

`assembly_id`	`component_id`	`installed_at`	`removed_at`
	Bike-1	2023-01-01
Bike-1	Wheel-1	2023-01-01	2023-06-01
Wheel-1	Rim-1	2023-01-01
Wheel-1	Tire-1	2023-01-01
	Bike-2	2023-02-01
Bike-2	Wheel-2	2023-02-01	2023-06-01
Wheel-2	Rim-2	2023-02-01
Wheel-2	Tire-2	2023-02-01
Bike-2	Wheel-1	2023-06-01
Bike-1	Wheel-2	2023-06-01

When this ERP data gets converted into the multivalued dimension, we get the table below. In the ERP data, only one kind of component assembly, the wheel, was removed/installed, but in the dimensional model all of the child components come along for the ride. In the table below, we see that "Bike-1" and "Bike-2" each have two distinct ranges of valid time, one prior to the wheel swap, and one after.

mdim_components:

`component_sk`	`top_assembly_id`	`assembly_id`	`component_id`	`valid_from_at`	`valid_to_at`
sk1	Bike-1		Bike-1	2023-01-01	2023-06-01
sk1	Bike-1	Bike-1	Wheel-1	2023-01-01	2023-06-01
sk1	Bike-1	Wheel-1	Rim-1	2023-01-01	2023-06-01
sk1	Bike-1	Wheel-1	Tire-1	2023-01-01	2023-06-01
sk2	Bike-1		Bike-1	2023-06-01
sk2	Bike-1	Bike-1	Wheel-2	2023-06-01
sk2	Bike-1	Wheel-2	Rim-2	2023-06-01
sk2	Bike-1	Wheel-2	Tire-2	2023-06-01
sk3	Bike-2		Bike-2	2023-02-01	2023-06-01
sk3	Bike-2	Bike-2	Wheel-2	2023-02-01	2023-06-01
sk3	Bike-2	Wheel-2	Rim-2	2023-02-01	2023-06-01
sk3	Bike-2	Wheel-2	Tire-2	2023-02-01	2023-06-01
sk4	Bike-2		Bike-2	2023-06-01
sk4	Bike-2	Bike-2	Wheel-1	2023-06-01
sk4	Bike-2	Wheel-1	Rim-1	2023-06-01
sk4	Bike-2	Wheel-1	Tire-1	2023-06-01

Summary

In this article, we've explored a strategy for creating a dimensional model for ragged time-varying hierarchies. We used a simple toy system involving one or two eBikes. In the real world, there would be many more individual products, deeper hierarchies, more component attributes, and the install/removal dates would likely be captured with a timestamp component as well. The model described here works very well even in these messier real world cases.

If you have any questions or comments, please reach out to me by commenting on this post or contacting me on dbt slack (@Sterling Paramore).

Data engineers + dbt v1.5: Evolving the craft for scale

May 1, 2023 · 12 min read

Sung Won Chung

Solutions Architect @ dbt Labs

Kira Furuichi

Technical Writer @ dbt Labs

I, Sung, entered the data industry by chance in Fall 2014. I was using this thing called audit command language (ACL) to automate debits equal credits for accounting analytics (yes, it’s as tedious as it sounds). I remember working my butt off in a hotel room in Des Moines, Iowa where the most interesting thing there was a Panda Express. It was late in the AM. I’m thinking about 2 am. And I took a step back and thought to myself, “Why am I working so hard for something that I just don’t care about with tools that hurt more than help?”

Why we're deprecating the dbt_metrics package

April 26, 2023 · 5 min read

Callum McCann

Senior Developer Experience Advocate @ dbt Labs

Hello, my dear data people.

If you haven’t read Nick & Roxi’s blog post about what’s coming in the future of the dbt Semantic Layer, I highly recommend you read through that, as it gives helpful context around what the future holds.

With that said, it has come time for us to bid adieu to our beloved dbt_metrics package. Upon the release of dbt-core v1.6 in late July, we will be deprecating support for the dbt_metrics package.

How we reduced a 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake

April 25, 2023 · 12 min read

Arthur Marcon

Analytics Engineer @ Indicium Tech

Lucas Bergo Dias

Analytics Engineer @ Indicium Tech

Christian van Bellen

Analytics Engineer @ Indicium Tech

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

Building a Kimball dimensional model with dbt

April 20, 2023 · 20 min read

Jonathan Neo

Data Engineer @ Canva & Data Engineer Camp

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

Data modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

Understand dimensional modeling concepts
Set up a mock dbt project and database
Identify the business process to model
Identify the fact and dimension tables
Create the dimension tables
Create the fact table
Document the dimensional model relationships
Consume the dimensional model

dbt Squared: Leveraging dbt Core and dbt Cloud together at scale

April 17, 2023 · 12 min read

João Antunes

Lead Data Engineer, Global Product Strategy @ Roche

Yannick Misteli

Head of Engineering, Global Product Strategy @ Roche

Sean McIntyre

Senior Solutions Architect @ dbt Labs

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

The missing guide to debug() in dbt

March 29, 2023 · 7 min read

Benoit Perigaud

Staff Analytics Engineer @ dbt Labs

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Pacakages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

Audit_helper in dbt: Bringing data auditing to a higher level

March 24, 2023 · 15 min read

Arthur Marcon

Analytics Engineer @ Indicium Tech

Lucas Bergo Dias

Analytics Engineer @ Indicium Tech

Christian van Bellen

Analytics Engineer @ Indicium Tech

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.

BigQuery ingestion-time partitioning and partition copy with dbt

March 10, 2023 · 8 min read

Christophe Oudar

Staff Software Engineer @ Teads

At Teads, we’ve been using BigQuery (BQ) to build our analytics stack since 2017. As presented in a previous article, we have designed pipelines that use multiple roll-ups that are aggregated in data marts. Most of them revolve around time series, and therefore time-based partitioning is often the most appropriate approach.

Tips and advice to study for, and pass, the dbt Certification exam

February 16, 2023 · 9 min read

Callie White

Analytics Consultant @ Montreal Analytics

Jade Milaney

Analytics Consultant @ Montreal Analytics

The new dbt Certification Program has been created by dbt Labs to codify the data development best practices that enable safe, confident, and impactful use of dbt. Taking the Certification allows dbt users to get recognized for the skills they’ve honed, and stand out to organizations seeking dbt expertise.

Over the last few months, Montreal Analytics, a full-stack data consultancy servicing organizations across North America, has had over 25 dbt Analytics Engineers become certified, earning them the 2022 dbt Platinum Certification award.

In this article, two Montreal Analytics consultants, Jade and Callie, discuss their experience in taking, and passing, the dbt Certification exam to help guide others looking to study for, and pass the exam.

How we cut our tests by 80% while increasing data quality: the power of aggregating test failures in dbt

January 24, 2023 · 8 min read

Noah Kennedy

Software Developer @ Tempus

Testing the quality of data in your warehouse is an important aspect in any mature data pipeline. One of the biggest blockers for developing a successful data quality pipeline is aggregating test failures and successes in an informational and actionable way. However, ensuring actionability can be challenging. If ignored, test failures can clog up a pipeline and create unactionable noise, rendering your testing infrastructure ineffective.

Power up your data quality with grouped checks

January 17, 2023 · 7 min read

Emily Riederer

Senior Manager Analytics @ Capital One

Imagine you were responsible for monitoring the safety of a subway system. Where would you begin? Most likely, you'd start by thinking about the key risks like collision or derailment, contemplate what causal factors like scheduling software and track conditions might contribute to bad outcomes, and institute processes and metrics to detect if those situations arose. What you wouldn't do is blindly apply irrelevant industry standards like testing for problems with the landing gear (great for planes, irrelevant for trains) or obsessively worry about low probability events like accidental teleportation before you'd locked down the fundamentals.

When thinking about real-world scenarios, we're naturally inclined to think about key risks and mechanistic causes. However, in the more abstract world of data, many of our data tests often gravitate towards one of two extremes: applying rote out-of-the-box tests (nulls, PK-FK relationships, etc.) from the world of traditional database management or playing with exciting new toys that promise to catch our wildest errors with anomaly detection and artificial intelligence.

Between these two extremes lies a gap where human intelligence goes. Analytics engineers can create more effective tests by embedding their understanding of how the data was created, and especially how this data can go awry (a topic I've written about previously). While such expressive tests will be unique to our domain, modest tweaks to our mindset can help us implement them with our standard tools. This post demonstrates how the simple act of conducting tests by group can expand the universe of possible tests, boost the sensitivity of the existing suite, and help keep our data "on track". This feature is now available in dbt-utils.

Making the leap from accountant to analytics engineer

December 15, 2022 · 9 min read

Samuel Harting

Associate Analytics Engineer @ dbt Labs

In seventh grade, I decided it was time to pick a realistic career to work toward, and since I had an accountant in my life who I really looked up to, that is what I chose. Around ten years later, I finished my accounting degree with a minor in business information systems (a fancy way of saying I coded in C# for four or five classes). I passed my CPA exams quickly and became a CPA as soon as I hit the two-year experience requirement. I spent my first few years at a small firm completing tax returns but I didn't feel like I was learning enough, so I went to a larger firm right before the pandemic started. The factors that brought me to the point of changing industries are numerous, but I’ll try to keep it concise: the tax industry relies on underpaying its workers to maintain margins and prevent itself from being top-heavy, my future work as a manager was unappealing to me, and my work was headed in a direction I wasn’t excited about.

Developer Blog | dbt Developer Hub

Introduction​

Introduction​

Data model​

Transactional model​

Multivalued dimensional model​

Mileage for a component​

Bonus: Finding components installed at the same time as other components​

SQL code to build the dimensional model​

Traversing the hierarchy​

Temporal range join​

Bonus: component swap​

Summary​

Introduction

Introduction

Data model

Transactional model

Multivalued dimensional model

Mileage for a component

Bonus: Finding components installed at the same time as other components

SQL code to build the dimensional model

Traversing the hierarchy

Temporal range join

Bonus: component swap

Summary