python

Test for Tags in Dagster

How to enforce a style in your organization.

Posted on November 19, 2024 | 2 minutes | 323 words | Roel M. Hogervorst

Dagster assets can be labelled with owners, tags, kinds and metadata. This is super useful but if you want to enforce a particular style on every merge request you will mess up eventually. But, you can use pytest to enforce this. testing for components You can find the file in the github linked below, but here it is in steps. import pytest from dagster import AssetSpec, AssetsDefinition # Import all the assets from your project as one list. [Read More]

Dagster: all the Ways you can Differentiate Assets

tags, kinds, metadata, and more

Posted on November 7, 2024 (Last modified on November 10, 2024) | 2 minutes | 364 words | Roel M. Hogervorst

When you have more than 10 assets in dagster, you might want to be able to quickly identify them. Here are all the ways (I think) you can differentiate assets. naming convention: This might not be for everyone but with a strong naming convention you can easily identify the asset. You could use a schema like <type>_<source>__<additional_context> (the dbt docs have excellent naming suggestions) prefixes: it is possible to add a prefix to an asset (I don’t really like this), but it would be like groupingsname/assetname groups: you can group multiple related assets together. [Read More]

Dagster data_engineering python

Logging for Machine Learning

How and what should you log in machine learning

Posted on November 3, 2024 | 8 minutes | 1506 words | Roel M. Hogervorst

How and what should you log in machine learning? The python logging system is really powerful but not a lot of machine learning practitioners use it, and that is a shame. Here are some of my thoughts on logging for your Machine learning (ML) projects.

an image of logs (the tree kind, not what I’m talking about, I am super funny)

[Read More]

MLOps logging python errors

How I Set Up Dagster in a Company

Posted on February 27, 2022 (Last modified on November 19, 2024) | 5 minutes | 917 words | Roel M. Hogervorst

In the past few months I setup a dagster deployment within a kubernetes cluster. This was a lot of fun because I learned a lot, but I’d like to document some of the things we did so I’ll remember them later. Dagster is a scheduler/ orchestrator or workflow manager (I’ve seen all those words but I’m not sure what the differences are). When you need to get data from one place to another, do complex data operations that need to happen in a certain order or you have many many tasks to run, you might want to use such a thing. [Read More]

Dagster scheduling advanced python

Should I Move to a Database?

Posted on November 8, 2021 (Last modified on November 9, 2022) | 15 minutes | 2986 words | Roel M. Hogervorst

Long ago at a real-life meetup (remember those?), I received a t-shirt which said: “biggeR than R”. I think it was by microsoft, who develop a special version of R with automatic parallel work. Anyways, I was thinking about bigness (is that a word? it is now!) of your data. Is your data becoming to big? big data stupid gif Your dataset becomes so big and unwieldy that operations take a long time. [Read More]

dataframe database python duckdb sqlite pandas polars

Walkthrough UbiOps and Tidymodels

From python cookbook to R {recipes}

Posted on July 6, 2021 | 8 minutes | 1625 words | Roel M. Hogervorst

In this walkthrough I modified a tutorial from the UbiOps cookbook ‘Python Scikit learn and UbiOps’, but I replaced everything python with R. So in stead of scikitlearn I’m using {tidymodels}, and where python uses a requirement.txt, I will use {renv}. So in a way I’m going from python cookbook to {recipes} in R! Components of the pipeline The original cookbook (and my rewrite too) has three components: [Read More]

UbiOps tidymodels parsnip python

Some Thoughts About dbt for Data Engineering

Posted on March 15, 2021 | 5 minutes | 1054 words | Roel M. Hogervorst

Over the last week I have experimented with dbt (data built tool), a cmdline tool created by Fishtown-analytics. I’m hardly the first to write or talk about it (see all the references at the bottom of this piece). But I just want to record my thoughts at this point in time. What is it Imagine the following situation: you have a data warehouse where all your data lives. You as a data engineer support tens to hundreds of analysts who build dashboards and reports on top of that source data. [Read More]

dbt python SQL intermediate testing git