Skip to main content

Developer Blog | dbt Developer Hub

Find tutorials, product updates, and developer insights in the dbt Developer Blog.

Start here

· 5 min read
Joel Labes
The Bottom Line:

You should split your Jobs across Environments in dbt Cloud based on their purposes (e.g. Production and Staging/CI) and set one environment as Production. This will improve your CI experience and enable you to use dbt Explorer.

Environmental segmentation has always been an important part of the analytics engineering workflow:

  • When developing new models you can process a smaller subset of your data by using target.name or an environment variable.
  • By building your production-grade models into a different schema and database, you can experiment in peace without being worried that your changes will accidentally impact downstream users.
  • Using dedicated credentials for production runs, instead of an analytics engineer's individual dev credentials, ensures that things don't break when that long-tenured employee finally hangs up their IDE.

Historically, dbt Cloud required a separate environment for Development, but was otherwise unopinionated in how you configured your account. This mostly just worked – as long as you didn't have anything more complex than a CI job mixed in with a couple of production jobs – because important constructs like deferral in CI and documentation were only ever tied to a single job.

But as companies' dbt deployments have grown more complex, it doesn't make sense to assume that a single job is enough anymore. We need to exchange a job-oriented strategy for a more mature and scalable environment-centric view of the world. To support this, a recent change in dbt Cloud enables project administrators to mark one of their environments as the Production environment, just as has long been possible for the Development environment.

Explicitly separating your Production workloads lets dbt Cloud be smarter with the metadata it creates, and is particularly important for two new features: dbt Explorer and the revised CI workflows.

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

· 11 min read
Amy Chen

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

· 8 min read
Pedro Brito de Sa

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

· 15 min read
Rastislav Zdechovan
Sean McIntyre

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

· 14 min read
Santiago Jauregui

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

· 9 min read
Mikael Thorup

At Lunar, most of our dbt models are sourcing from event-driven architecture. As an example, we have the following models for our activity_based_interest folder in our ingestion layer:

  • activity_based_interest_activated.sql
  • activity_based_interest_deactivated.sql
  • activity_based_interest_updated.sql
  • downgrade_interest_level_for_user.sql
  • set_inactive_interest_rate_after_july_1st_in_bec_for_user.sql
  • set_inactive_interest_rate_from_july_1st_in_bec_for_user.sql
  • set_interest_levels_from_june_1st_in_bec_for_user.sql

This results in a lot of the same columns (e.g. account_id) existing in different models, across different layers. This means I end up:

  1. Writing/copy-pasting the same documentation over and over again
  2. Halfway through, realizing I could improve the wording to make it easier to understand, and go back and update the .yml files I already did
  3. Realizing I made a syntax error in my .yml file, so I go back and fix it
  4. Realizing the columns are defined differently with different wording being used in other folders in our dbt project
  5. Reconsidering my choice of career and pray that a large language model will steal my job
  6. Considering if there’s a better way to be generating documentation used across different models

· 18 min read
Sterling Paramore

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

Data model

Obviously, a real bike could have a hundred or more separate components. To keep things simple for this article, let's just consider the bike, the frame, a wheel, the wheel rim, tire, and tube. Our component hierarchy looks like:

eBike HierarchyeBike Hierarchy

This hierarchy is ragged because different paths through the hierarchy terminate at different depths. It is time-varying because specific components can be added and removed.

Now let's take a look at how this data is represented in our source data systems and how it can be transformed to make analytics queries easier.

Transactional model

Our ERP system (Enterprise Resource Planning) contains records that log when a specific component serial number (component_id) was installed in or removed from a parent assembly component (assembly_id). The top-most assembly component is the eBike itself, which has no parent assembly. So when an eBike (specifically, the eBike with serial number "Bike-1") is originally constructed, the ERP system would contain records that look like the following.

erp_components:

assembly_idcomponent_idinstalled_atremoved_at
Bike-12023-01-01
Bike-1Frame-12023-01-01
Bike-1Wheel-12023-01-01
Wheel-1Rim-12023-01-01
Wheel-1Tire-12023-01-01
Tire-1Tube-12023-01-01

Now let's suppose this bike has been ridden for a while, and on June 1, the user of the bike reported a flat tire. A service technician then went to the site, replaced the tube that was in the wheel, and installed a new one. They logged this in the ERP system, causing one record to be updated with a removed_at date, and another record to be created with the new tube component_id.

erp_components:

assembly_idcomponent_idinstalled_atremoved_at
............
Tire-1Tube-12023-01-012023-06-01
Tire-1Tube-22023-06-01
............

After a few more months, there is a small crash. Don't worry, everyone's OK! However, the wheel (Wheel-1)is totally broken and must be replaced (with Wheel-2). When the technician updates the ERP, the entire hierarchy under the replaced wheel is also updated, as shown below.

erp_components:

assembly_idcomponent_idinstalled_atremoved_at
Bike-1Wheel-12023-01-012023-08-01
Wheel-1Rim-12023-01-012023-08-01
Wheel-1Tire-12023-01-012023-08-01
Tire-1Tube-22023-06-012023-08-01
Bike-1Wheel-22023-08-01
Wheel-2Rim-22023-08-01
Wheel-2Tire-22023-08-01
Tire-2Tube-32023-08-01

After all of the above updates and additions, our ERP data looks like the following.

erp_components:

assembly_idcomponent_idinstalled_atremoved_at
Bike-12023-01-01
Bike-1Frame-12023-01-01
Bike-1Wheel-12023-01-012023-08-01
Wheel-1Rim-12023-01-012023-08-01
Wheel-1Tire-12023-01-012023-08-01
Tire-1Tube-12023-01-012023-06-01
Tire-1Tube-22023-06-012023-08-01
Bike-1Wheel-22023-08-01
Wheel-2Rim-22023-08-01
Wheel-2Tire-22023-08-01
Tire-2Tube-32023-08-01

So that's all fine and good from the perspective of the ERP system. But this data structure can be difficult to work with if we want to generate reports that calculate the total mileage accumulated on various components, or the average mileage of a particular component type, or how one component type might affect the lifetime of another component.

Multivalued dimensional model

In dimensional modeling, we have fact tables that contain measurements and dimension tables that contain the context for those measurements (attributes). In our eBike data warehouse, we have a fact table that contains one record for each eBike for each day it is ridden and the measured mileage accumulated during rides that day. This fact table contains surrogate key columns, indicated by the _sk suffix. These are usually system-generated keys used to join to other tables in the database; the specific values of these keys are not important.

fct_daily_mileage:

bike_skcomponent_skride_atmiles
bsk1csk12023-01-013
bsk1csk12023-01-022
bsk1csk12023-01-030
bsk1csk12023-01-040
............
bsk1csk32023-08-017
bsk1csk32023-08-028
bsk1csk32023-08-034

One of the dimension tables is a simple table containing information about the individual bikes we have manufactured.

dim_bikes:

bike_skbike_idcolormodel_name
bsk1Bike-1OrangeWyld Stallyn

There is a simple many-to-one relationship between fct_daily_mileage and dim_bikes. If we need to calculate the total mileage accumulated for each bike in our entire fleet of eBikes, we just join the two tables and aggregate on the miles measurement.

select
dim_bikes.bike_id,
sum(fct_daily_mileage.miles) as miles
from
fct_daily_mileage
inner join
dim_bikes
on
fct_daily_mileage.bike_sk = dim_bikes.bike_sk
group by
1

Extending this to determine if orange bikes get more use than red bikes or whether certain models are preferred are similarly straightforward queries.

Dealing with all of the components is more complicated because there are many components installed on the same day. The relationship between days when the bikes are ridden and the components is thus multivalued. In dim_bikes, there is one record per bike and surrogate key. In our components dimension will have multiple records with the same surrogate key and will therefore be a multivalued dimension. Of course, to make things even more complicated, the components can change from day to day. To construct the multivalued dimension table, we break down the time-varying component hierarchy into distinct ranges of time where all of the components in a particular bike remain constant. At specific points in time where the components are changed, a new surrogate key is created. The final dimension table for our example above looks like the following, where the valid_from_at and valid_to_at represent the begin and end of a range of time where all the components of an eBike remain unchanged.

mdim_components:

component_skassembly_idcomponent_iddepthinstalled_atremoved_atvalid_from_atvalid_to_at
csk1Bike-102023-01-012023-01-012023-06-01
csk1Bike-1Frame-112023-01-012023-01-012023-06-01
csk1Bike-1Wheel-112023-01-012023-08-012023-01-012023-06-01
csk1Wheel-1Rim-122023-01-012023-08-012023-01-012023-06-01
csk1Wheel-1Tire-122023-01-012023-08-012023-01-012023-06-01
csk1Tire-1Tube-132023-01-012023-06-012023-01-012023-06-01
csk2Bike-102023-01-012023-06-012023-08-01
csk2Bike-1Frame-112023-01-012023-06-012023-08-01
csk2Bike-1Wheel-112023-01-012023-08-012023-06-012023-08-01
csk2Wheel-1Rim-122023-01-012023-08-012023-06-012023-08-01
csk2Wheel-1Tire-122023-01-012023-08-012023-06-012023-08-01
csk2Tire-1Tube-232023-06-012023-08-012023-06-012023-08-01
csk3Bike-102023-01-012023-08-01
csk3Bike-1Frame-112023-01-012023-08-01
csk3Bike-1Wheel-212023-08-012023-08-01
csk3Wheel-2Rim-222023-08-012023-08-01
csk3Wheel-2Tire-222023-08-012023-08-01
csk3Tire-2Tube-332023-08-012023-08-01

Now, let's look at how this structure can help in writing queries. In a later section of this article, we'll examine the SQL code that can take our ERP table and convert it into this dimensional model.

Mileage for a component

Suppose we wanted to know the total mileage accumulated on "Wheel-1". The SQL code for determining this is very similar to that for determining the mileage for a given bike.

select
mdim_components.component_id,
sum(fct_daily_mileage.miles) as miles
from
fct_daily_mileage
inner join
mdim_components
on
fct_daily_mileage.component_sk = mdim_components.component_sk
group by
1
where
component_id = 'Wheel-1'
caution

One thing to be very cautious about when working with multivalued dimensions is that you need to be careful interpreting aggregations. For example, suppose we chose to aggregate on top_assembly_id (to reduce clutter, this field is not shown in the data model above because it is just "Bike-1" for each record). For this aggregation, we would be over-counting the total mileage on that top assembly because the join would result in a Cartesian product and thus we'd get a "fan-out" situation.

Bonus: Finding components installed at the same time as other components

This structure simplifies other kinds of interesting analysis. Suppose we wanted to start exploring how one component affects another, like whether certain brands of tube needed to be replaced more often if they were in a new brand of tire. We can do this by partitioning the data into the segments of time where the components are not changing and looking for other components installed at the same time. For example, to find all of the components that were ever installed at the same time "Tube-3" was installed, we can collect them with a simple window function. We could then use the results of this query in a regression or other type of statistical analysis.

select distinct
component_id
from
mdim_components
qualify
sum(iff(component_id = 'Tube-3', 1, 0)) over (partition by valid_from_at, valid_to_at) > 0

SQL code to build the dimensional model

Now we get to the fun part! This section shows how to take the ERP source data and turn it into the multivalued dimensional model. This SQL code was written and tested using Snowflake, but should be adaptable to other dialects.

Traversing the hierarchy

The first step will be to traverse the hierarchy of components to find all components that belong to the same top assembly. In our example above, we only had one bike and thus just one top assembly; in a real system, there will be many (and we may even swap components between different top assemblies!).

The key here is to use a recursive join to move from the top of the hierarchy to all children and grandchildren. The top of the hierarchy is easy to identify because they are the only records without any parents.

with recursive
-- Contains our source data with records that link a child to a parent
components as (
select
*,
-- Valid dates start as installed/removed, but may be modified as we traverse the hierarchy below
installed_at as valid_from_at,
removed_at as valid_to_at
from
erp_components
),

-- Get all the source records that are at the top of hierarchy
top_assemblies as (
select * from components where assembly_id is null
),

-- This is where the recursion happens that traverses the hierarchy
traversal as (
-- Start at the top of hierarchy
select
-- Keep track of the depth as we traverse down
0 as component_hierarchy_depth,
-- Flag to determine if we've entered a circular relationship
false as is_circular,
-- Define an array that will keep track of all of the ancestors of a component
[component_id] as component_trace,
-- At the top of the hierarchy, the component is the top assembly
component_id as top_assembly_id,

assembly_id,
component_id,

installed_at,
removed_at,
valid_from_at,
valid_to_at
from
top_assemblies

union all

-- Join the current layer of the hierarchy with the next layer down by linking
-- the current component id to the assembly id of the child
select
traversal.component_hierarchy_depth + 1 as component_hierarchy_depth,
-- Check for any circular dependencies
array_contains(components.component_id::variant, traversal.component_trace) as is_circular,
-- Append trace array
array_append(traversal.component_trace, components.component_id) as component_trace,
-- Keep track of the top of the assembly
traversal.top_assembly_id,

components.assembly_id,
components.component_id,

components.installed_at,
components.removed_at,
-- As we recurse down the hierarchy, only want to consider time ranges where both
-- parent and child are installed; so choose the latest "from" timestamp and the earliest "to".
greatest(traversal.valid_from_at, components.valid_from_at) as valid_from_at,
least(traversal.valid_to_at, components.valid_to_at) as valid_to_at
from
traversal
inner join
components
on
traversal.component_id = components.assembly_id
and
-- Exclude component assemblies that weren't installed at the same time
-- This may happen due to source data quality issues
(
traversal.valid_from_at < components.valid_to_at
and
traversal.valid_to_at >= components.valid_from_at
)
where
-- Stop if a circular hierarchy is detected
not array_contains(components.component_id::variant, traversal.component_trace)
-- There can be some bad data that might end up in hierarchies that are artificially extremely deep
and traversal.component_hierarchy_depth < 20
),

final as (
-- Note that there may be duplicates at this point (thus "distinct").
-- Duplicates can happen when a component's parent is moved from one grandparent to another.
-- At this point, we only traced the ancestry of a component, and fixed the valid/from dates
-- so that all child ranges are contained in parent ranges.

select distinct *
from
traversal
where
-- Prevent zero-time (or less) associations from showing up
valid_from_at < valid_to_at
)

select * from final

At the end of the above step, we have a table that looks very much like the erp_components that it used as the source, but with a few additional valuable columns:

  • top_assembly_id - This is the most important output of the hierarchy traversal. It ties all sub components to a their common parent. We'll use this in the next step to chop up the hierarchy into all the distinct ranges of time where the components that share a common top assembly are constant (and each distict range of time and top_assembly_id getting their own surrogate key).
  • component_hierarchy_depth - Indicates how far removed a component is from the top assembly.
  • component_trace - Contains an array of all the components linking this component to the top assembly.
  • valid_from_at/valid_to_at - If you have really high-quality source data, these will be identical to installed_at/removed_at. However, in the real world, we've found cases where the installed and removal dates are not consistent between parent and child, either due to a data entry error or a technician forgetting to note when a component was removed. So for example, we may have a parent assembly that was removed along with all of its children, but only the parent assembly has removed_at populated. At this point, the valid_from_at and valid_to_at tidy up these kinds of scenarios.

Temporal range join

The last step is perform a temporal range join between the top assembly and all of its descendents. This is what splits out all of the time-varying component changes into distinct ranges of time where the component hierarchy is constant. This range join makes use of the dbt macro in this gist, the operation of which is out-of-scope for this article, but you are encouraged to investigate it and the discourse post mentioned earlier.

-- Start with all of the assemblies at the top (hierarchy depth = 0)
with l0_assemblies as (
select
top_assembly_id,
component_id,
-- Prep fields required for temporal range join
{{ dbt_utils.surrogate_key(['component_id', 'valid_from_at']) }} as dbt_scd_id,
valid_from_at as dbt_valid_from,
valid_to_at as dbt_valid_to
from
component_traversal
where
component_hierarchy_depth = 0
),

components as (
select
top_assembly_id,
component_hierarchy_depth,
component_trace,
assembly_id,
component_id,
installed_at,
removed_at,
-- Prep fields required for temporal range join
{{ dbt_utils.surrogate_key(['component_trace', 'valid_from_at'])}} as dbt_scd_id,
valid_from_at as dbt_valid_from,
valid_to_at as dbt_valid_to
from
component_traversal
),

-- Perform temporal range join
{{
trange_join(
left_model='l0_assemblies',
left_fields=[
'top_assembly_id',
],
left_primary_key='top_assembly_id',
right_models={
'components': {
'fields': [
'component_hierarchy_depth',
'component_trace',
'assembly_id',
'component_id',
'installed_at',
'removed_at',
],
'left_on': 'component_id',
'right_on': 'top_assembly_id',
}
}
)
}}

select
surrogate_key,
top_assembly_id,
component_hierarchy_depth,
component_trace,
assembly_id,
component_id,
installed_at,
removed_at,
valid_from_at,
valid_to_at
from
trange_final
order by
top_assembly_id,
valid_from_at,
component_hierarchy_depth

Bonus: component swap

Before we go, let's investigate one other interesting scenario. Suppose we have two bikes, "Bike-1" and "Bike-2". While performing service, a technician notices that the color on the rim of "Bike-2" matches with the frame of "Bike-1" and vice-versa. Perhaps there was a mistake made during the initial assembly process? The technician decides to swap the wheels between the two bikes. The ERP system then shows that "Wheel-1" was removed from "Bike-1" on the service date and that "Wheel-1" was installed in "Bike-2" on the same date (similarly for "Wheel-2"). To reduce clutter below, we'll ignore Frames and Tubes.

erp_components:

assembly_idcomponent_idinstalled_atremoved_at
Bike-12023-01-01
Bike-1Wheel-12023-01-012023-06-01
Wheel-1Rim-12023-01-01
Wheel-1Tire-12023-01-01
Bike-22023-02-01
Bike-2Wheel-22023-02-012023-06-01
Wheel-2Rim-22023-02-01
Wheel-2Tire-22023-02-01
Bike-2Wheel-12023-06-01
Bike-1Wheel-22023-06-01

When this ERP data gets converted into the multivalued dimension, we get the table below. In the ERP data, only one kind of component assembly, the wheel, was removed/installed, but in the dimensional model all of the child components come along for the ride. In the table below, we see that "Bike-1" and "Bike-2" each have two distinct ranges of valid time, one prior to the wheel swap, and one after.

mdim_components:

component_sktop_assembly_idassembly_idcomponent_idvalid_from_atvalid_to_at
sk1Bike-1Bike-12023-01-012023-06-01
sk1Bike-1Bike-1Wheel-12023-01-012023-06-01
sk1Bike-1Wheel-1Rim-12023-01-012023-06-01
sk1Bike-1Wheel-1Tire-12023-01-012023-06-01
sk2Bike-1Bike-12023-06-01
sk2Bike-1Bike-1Wheel-22023-06-01
sk2Bike-1Wheel-2Rim-22023-06-01
sk2Bike-1Wheel-2Tire-22023-06-01
sk3Bike-2Bike-22023-02-012023-06-01
sk3Bike-2Bike-2Wheel-22023-02-012023-06-01
sk3Bike-2Wheel-2Rim-22023-02-012023-06-01
sk3Bike-2Wheel-2Tire-22023-02-012023-06-01
sk4Bike-2Bike-22023-06-01
sk4Bike-2Bike-2Wheel-12023-06-01
sk4Bike-2Wheel-1Rim-12023-06-01
sk4Bike-2Wheel-1Tire-12023-06-01

Summary

In this article, we've explored a strategy for creating a dimensional model for ragged time-varying hierarchies. We used a simple toy system involving one or two eBikes. In the real world, there would be many more individual products, deeper hierarchies, more component attributes, and the install/removal dates would likely be captured with a timestamp component as well. The model described here works very well even in these messier real world cases.

If you have any questions or comments, please reach out to me by commenting on this post or contacting me on dbt slack (@Sterling Paramore).

· 12 min read
Sung Won Chung
Kira Furuichi

I, Sung, entered the data industry by chance in Fall 2014. I was using this thing called audit command language (ACL) to automate debits equal credits for accounting analytics (yes, it’s as tedious as it sounds). I remember working my butt off in a hotel room in Des Moines, Iowa where the most interesting thing there was a Panda Express. It was late in the AM. I’m thinking about 2 am. And I took a step back and thought to myself, “Why am I working so hard for something that I just don’t care about with tools that hurt more than help?”

· 5 min read
Callum McCann

Hello, my dear data people.

If you haven’t read Nick & Roxi’s blog post about what’s coming in the future of the dbt Semantic Layer, I highly recommend you read through that, as it gives helpful context around what the future holds.

With that said, it has come time for us to bid adieu to our beloved dbt_metrics package. Upon the release of dbt-core v1.6 in late July, we will be deprecating support for the dbt_metrics package.

· 12 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

· 20 min read
Jonathan Neo
Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.Data modeling techniques on a normalization vs denormalization scaleData modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

  • Understand dimensional modeling concepts
  • Set up a mock dbt project and database
  • Identify the business process to model
  • Identify the fact and dimension tables
  • Create the dimension tables
  • Create the fact table
  • Document the dimensional model relationships
  • Consume the dimensional model

· 12 min read
João Antunes
Yannick Misteli
Sean McIntyre

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

· 7 min read
Benoit Perigaud

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Pacakages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

· 15 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.

· 9 min read
Callie White
Jade Milaney

The new dbt Certification Program has been created by dbt Labs to codify the data development best practices that enable safe, confident, and impactful use of dbt. Taking the Certification allows dbt users to get recognized for the skills they’ve honed, and stand out to organizations seeking dbt expertise.

Over the last few months, Montreal Analytics, a full-stack data consultancy servicing organizations across North America, has had over 25 dbt Analytics Engineers become certified, earning them the 2022 dbt Platinum Certification award.

In this article, two Montreal Analytics consultants, Jade and Callie, discuss their experience in taking, and passing, the dbt Certification exam to help guide others looking to study for, and pass the exam.

· 8 min read
Noah Kennedy

Testing the quality of data in your warehouse is an important aspect in any mature data pipeline. One of the biggest blockers for developing a successful data quality pipeline is aggregating test failures and successes in an informational and actionable way. However, ensuring actionability can be challenging. If ignored, test failures can clog up a pipeline and create unactionable noise, rendering your testing infrastructure ineffective.

· 7 min read
Emily Riederer

Imagine you were responsible for monitoring the safety of a subway system. Where would you begin? Most likely, you'd start by thinking about the key risks like collision or derailment, contemplate what causal factors like scheduling software and track conditions might contribute to bad outcomes, and institute processes and metrics to detect if those situations arose. What you wouldn't do is blindly apply irrelevant industry standards like testing for problems with the landing gear (great for planes, irrelevant for trains) or obsessively worry about low probability events like accidental teleportation before you'd locked down the fundamentals. 

When thinking about real-world scenarios, we're naturally inclined to think about key risks and mechanistic causes. However, in the more abstract world of data, many of our data tests often gravitate towards one of two extremes: applying rote out-of-the-box tests (nulls, PK-FK relationships, etc.) from the world of traditional database management or playing with exciting new toys that promise to catch our wildest errors with anomaly detection and artificial intelligence. 

Between these two extremes lies a gap where human intelligence goes. Analytics engineers can create more effective tests by embedding their understanding of how the data was created, and especially how this data can go awry (a topic I've written about previously). While such expressive tests will be unique to our domain, modest tweaks to our mindset can help us implement them with our standard tools. This post demonstrates how the simple act of conducting tests by group can expand the universe of possible tests, boost the sensitivity of the existing suite, and help keep our data "on track". This feature is now available in dbt-utils.

· 9 min read
Samuel Harting

In seventh grade, I decided it was time to pick a realistic career to work toward, and since I had an accountant in my life who I really looked up to, that is what I chose. Around ten years later, I finished my accounting degree with a minor in business information systems (a fancy way of saying I coded in C# for four or five classes). I passed my CPA exams quickly and became a CPA as soon as I hit the two-year experience requirement. I spent my first few years at a small firm completing tax returns but I didn't feel like I was learning enough, so I went to a larger firm right before the pandemic started. The factors that brought me to the point of changing industries are numerous, but I’ll try to keep it concise: the tax industry relies on underpaying its workers to maintain margins and prevent itself from being top-heavy, my future work as a manager was unappealing to me, and my work was headed in a direction I wasn’t excited about.