Intermediate level posts

how I write tests for dagster

unit-testing data engineering

Posted on February 27, 2024 (Last modified on November 19, 2024) | 4 minutes | 813 words | Roel M. Hogervorst

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Evolution of Our Dagster File Organization

File structures should make your work easier

Posted on February 24, 2024 (Last modified on February 25, 2024) | 5 minutes | 978 words | Roel M. Hogervorst

Whenever you try to do intelligent data engineering tasks: refreshing tables in order, running python processes, ingesting and outputting data, you need a scheduler. Airflow is the best known of these beasts, but I have a fondness for Dagster. Dagster focuses on the result of computations: tables, trained model artefacts and not on the process itself, this works really well for data science solutions. Dagster makes it also quite easy to swap implementations, you can read from disk while testing or developing, and read from a database in production. [Read More]

Dagster pipelines

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Posted on January 28, 2024 (Last modified on February 25, 2024) | 5 minutes | 859 words | Roel M. Hogervorst

I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

advanced Dagster data_engineering data_science grist quickthoughts spreadsheets

Should We Make a Guild for AI Work?

craftsmen, masters, ethics and licenses

Posted on October 30, 2023 | 6 minutes | 1233 words | Roel M. Hogervorst

I recently reread How professional Ethics work by Sibylla Bostoniensis and it resonated with other ideas about data science titles and care about the craft. So here goes my rambling: Should we make a guild for AI work? for love of the craft Every once in a while, fellow data scientists complain about other data scientists. Some gate-keeping comments like: “These youngsters don’t know anything about data science work in practice”, “Anyone who participated in one Kaggle competition calls themselves a data scientist nowadays”, “People who were data analyst, are calling themselves data scientists now”, “That is not a data scientist, that is a database administrator who took one python course”, “That is not a data scientist, that is a python programmer who just downloaded a model and runs it without thinking”. [Read More]

guild ethics coding skills

Poisoning the Web to Thwart Large Language Models

Be the language model failure mode you want to see in the taco.

Posted on August 13, 2023 (Last modified on October 30, 2023) | 5 minutes | 958 words | Roel M. Hogervorst

Large language models trained on data that is scraped without taking licences into account is in essence theft. So I understand that we want to do something as creators. As user @mttaggart@fosstodon.org says: In a world where a bit of math can pass as human on the internet, we are obliged to be more unpredictable, more chaotic. Be the language model failure mode you want to see in the taco. [Read More]

LLMs poisoning the well problematic training data civil disobedience

High and Low Variance in Data Science Work

Consistency or peaks, pick one

Posted on March 6, 2023 (Last modified on August 13, 2023) | 2 minutes | 248 words | Roel M. Hogervorst

I recently read “High Variance Management” by Sebas Bensu and this made me think about datascience work. First some examples from the post: Some work needs to be consistent, not extraordinary but always very nearly the same. Theatre actors performing multiple shows per week need to deliver their acting in the same way every day. Their work is low variance. Some work needs superb results, results you don’t know if you can reach it but you try it many times and between all of the failures, you might find gold. [Read More]

MLOps

Creating One Unified Calendar of all Data Science Events in the Netherlands

Over engineering with renv and github actions

Posted on December 2, 2022 | 3 minutes | 439 words | Roel M. Hogervorst

I enjoy learning new things about machine learning, and I enjoy meeting like minded people too. That is why I go to meetups and conferences. But not everyone I meet becomes a member of every group. So I keep sending my coworkers new events that I hear about here in the Netherlands. And it is easy to overlook a new event that comes in over email. Me individually cannot scale. So in this post I will walk you through an over engineered solution to make myself unnecessary. [Read More]

calendar devops CI/CD git renv scheduling ghactions

The city, neighborhoods and streets: Organizing your MLproject

reduce your mental load by using conventions

Posted on October 25, 2022 (Last modified on November 9, 2022) | 4 minutes | 790 words | Roel M. Hogervorst

Have you received a project that someone else created and did it make you go 🤯? (Was that someone else: you from a few months back?1 ) Sometimes a project organically grows into a mess of scripts and you don’t know how to make it better. The main problem is often the project organization. I want you to think about, and organize, your project in three levels that I call city-level, neighborhood-level and street-level. [Read More]

development modular projects optimize your code

Should I Move to a Database?

Posted on November 8, 2021 (Last modified on November 9, 2022) | 15 minutes | 2986 words | Roel M. Hogervorst

Long ago at a real-life meetup (remember those?), I received a t-shirt which said: “biggeR than R”. I think it was by microsoft, who develop a special version of R with automatic parallel work. Anyways, I was thinking about bigness (is that a word? it is now!) of your data. Is your data becoming to big? big data stupid gif Your dataset becomes so big and unwieldy that operations take a long time. [Read More]

dataframe database python duckdb sqlite pandas polars

Munging and reordering Polarsteps data

Turning nested lists into a data.frame with purrr

Posted on April 23, 2020 (Last modified on November 9, 2022) | 10 minutes | 1937 words | Roel M. Hogervorst

This post is about how to extract data from a json, turn it into a tibble and do some work with the result. I’m working with a download of personal data from polarsteps. A picture of Tokomaru Wharf (New Zealand) I was a month in New Zealand, birthplace of R and home to Hobbits. I logged my travel using the Polarsteps application. The app allows you to upload pictures and write stories about your travels. [Read More]

jsonlite dplyr purrr rectangling