Test for Tags in Dagster

How to enforce a style in your organization.

Test for Tags in Dagster
Dagster assets can be labelled with owners, tags, kinds and metadata. This is super useful but if you want to enforce a particular style on every merge request you will mess up eventually. But, you can use pytest to enforce this. testing for components You can find the file in the github linked below, but here it is in steps. import pytest from dagster import AssetSpec, AssetsDefinition # Import all the assets from your project as one list. [Read More]

Dagster: all the Ways you can Differentiate Assets

tags, kinds, metadata, and more

Dagster: all the Ways you can Differentiate Assets
When you have more than 10 assets in dagster, you might want to be able to quickly identify them. Here are all the ways (I think) you can differentiate assets. naming convention: This might not be for everyone but with a strong naming convention you can easily identify the asset. You could use a schema like <type>_<source>__<additional_context> (the dbt docs have excellent naming suggestions) prefixes: it is possible to add a prefix to an asset (I don’t really like this), but it would be like groupingsname/assetname groups: you can group multiple related assets together. [Read More]

How I Set Up Dagster in a Company

In the past few months I setup a dagster deployment within a kubernetes cluster. This was a lot of fun because I learned a lot, but I’d like to document some of the things we did so I’ll remember them later. Dagster is a scheduler/ orchestrator or workflow manager (I’ve seen all those words but I’m not sure what the differences are). When you need to get data from one place to another, do complex data operations that need to happen in a certain order or you have many many tasks to run, you might want to use such a thing. [Read More]

Should I Move to a Database?

Long ago at a real-life meetup (remember those?), I received a t-shirt which said: “biggeR than R”. I think it was by microsoft, who develop a special version of R with automatic parallel work. Anyways, I was thinking about bigness (is that a word? it is now!) of your data. Is your data becoming to big? big data stupid gif Your dataset becomes so big and unwieldy that operations take a long time. [Read More]

Walkthrough UbiOps and Tidymodels

From python cookbook to R {recipes}

In this walkthrough I modified a tutorial from the UbiOps cookbook ‘Python Scikit learn and UbiOps’, but I replaced everything python with R. So in stead of scikitlearn I’m using {tidymodels}, and where python uses a requirement.txt, I will use {renv}. So in a way I’m going from python cookbook to {recipes} in R! Components of the pipeline The original cookbook (and my rewrite too) has three components: [Read More]

Some Thoughts About dbt for Data Engineering

Over the last week I have experimented with dbt (data built tool), a cmdline tool created by Fishtown-analytics. I’m hardly the first to write or talk about it (see all the references at the bottom of this piece). But I just want to record my thoughts at this point in time. What is it Imagine the following situation: you have a data warehouse where all your data lives. You as a data engineer support tens to hundreds of analysts who build dashboards and reports on top of that source data. [Read More]