These levels have been defined by the software carpentry people, and I have modified them to this:

  • beginner: You have just started out in this topic. You do not yet know how things are supposed to work. You do not have a mental model of this thing
  • intermediate: You are a regular user of this software/tool/concept, you have a mental model, but it is not very sophisticated
  • advanced: You have a sophisticated mental model how things work, and you even know when the model breaks, when it does not match reality.

how I write tests for dagster

unit-testing data engineering

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Evolution of Our Dagster File Organization

File structures should make your work easier

Evolution of Our Dagster File Organization
Whenever you try to do intelligent data engineering tasks: refreshing tables in order, running python processes, ingesting and outputting data, you need a scheduler. Airflow is the best known of these beasts, but I have a fondness for Dagster. Dagster focuses on the result of computations: tables, trained model artefacts and not on the process itself, this works really well for data science solutions. Dagster makes it also quite easy to swap implementations, you can read from disk while testing or developing, and read from a database in production. [Read More]

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Using Grist as Part of your Data Engineering Pipeline with Dagster
I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

Should We Make a Guild for AI Work?

craftsmen, masters, ethics and licenses

Should We Make a Guild for AI Work?
I recently reread How professional Ethics work by Sibylla Bostoniensis and it resonated with other ideas about data science titles and care about the craft. So here goes my rambling: Should we make a guild for AI work? for love of the craft Every once in a while, fellow data scientists complain about other data scientists. Some gate-keeping comments like: “These youngsters don’t know anything about data science work in practice”, “Anyone who participated in one Kaggle competition calls themselves a data scientist nowadays”, “People who were data analyst, are calling themselves data scientists now”, “That is not a data scientist, that is a database administrator who took one python course”, “That is not a data scientist, that is a python programmer who just downloaded a model and runs it without thinking”. [Read More]

Poisoning the Web to Thwart Large Language Models

Be the language model failure mode you want to see in the taco.

Poisoning the Web to Thwart Large Language Models
Large language models trained on data that is scraped without taking licences into account is in essence theft. So I understand that we want to do something as creators. As user @mttaggart@fosstodon.org says: In a world where a bit of math can pass as human on the internet, we are obliged to be more unpredictable, more chaotic. Be the language model failure mode you want to see in the taco. [Read More]

High and Low Variance in Data Science Work

Consistency or peaks, pick one

High and Low Variance in Data Science Work
I recently read “High Variance Management” by Sebas Bensu and this made me think about datascience work. First some examples from the post: Some work needs to be consistent, not extraordinary but always very nearly the same. Theatre actors performing multiple shows per week need to deliver their acting in the same way every day. Their work is low variance. Some work needs superb results, results you don’t know if you can reach it but you try it many times and between all of the failures, you might find gold. [Read More]

Creating One Unified Calendar of all Data Science Events in the Netherlands

Over engineering with renv and github actions

Creating One Unified Calendar of all Data Science Events in the Netherlands
I enjoy learning new things about machine learning, and I enjoy meeting like minded people too. That is why I go to meetups and conferences. But not everyone I meet becomes a member of every group. So I keep sending my coworkers new events that I hear about here in the Netherlands. And it is easy to overlook a new event that comes in over email. Me individually cannot scale. So in this post I will walk you through an over engineered solution to make myself unnecessary. [Read More]

The city, neighborhoods and streets: Organizing your MLproject

reduce your mental load by using conventions

Have you received a project that someone else created and did it make you go 🤯? (Was that someone else: you from a few months back?1 ) Sometimes a project organically grows into a mess of scripts and you don’t know how to make it better. The main problem is often the project organization. I want you to think about, and organize, your project in three levels that I call city-level, neighborhood-level and street-level. [Read More]

Should I Move to a Database?

Long ago at a real-life meetup (remember those?), I received a t-shirt which said: “biggeR than R”. I think it was by microsoft, who develop a special version of R with automatic parallel work. Anyways, I was thinking about bigness (is that a word? it is now!) of your data. Is your data becoming to big? big data stupid gif Your dataset becomes so big and unwieldy that operations take a long time. [Read More]

Munging and reordering Polarsteps data

Turning nested lists into a data.frame with purrr

This post is about how to extract data from a json, turn it into a tibble and do some work with the result. I’m working with a download of personal data from polarsteps. A picture of Tokomaru Wharf (New Zealand) I was a month in New Zealand, birthplace of R and home to Hobbits. I logged my travel using the Polarsteps application. The app allows you to upload pictures and write stories about your travels. [Read More]