The art (and science) of feature engineering

combining best practices from science, and engineering

The art (and science) of feature engineering
Data scientists, in general, do not just throw data into a model. They use feature engineering; transforming input data to make it easy for the chosen machine learning algorithm to pick up the subtleties in the data. Data scientists do this so the model can predict outcomes better. In the image below you see a transformation of data into numeric values with meaning. In this article I’ll discuss why we still need feature engineering (FE) in the age of Large language models, and what some best practices are. [Read More]

how I write tests for dagster

unit-testing data engineering

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Evolution of Our Dagster File Organization

File structures should make your work easier

Evolution of Our Dagster File Organization
Whenever you try to do intelligent data engineering tasks: refreshing tables in order, running python processes, ingesting and outputting data, you need a scheduler. Airflow is the best known of these beasts, but I have a fondness for Dagster. Dagster focuses on the result of computations: tables, trained model artefacts and not on the process itself, this works really well for data science solutions. Dagster makes it also quite easy to swap implementations, you can read from disk while testing or developing, and read from a database in production. [Read More]

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Using Grist as Part of your Data Engineering Pipeline with Dagster
I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

Should We Make a Guild for AI Work?

craftsmen, masters, ethics and licenses

Should We Make a Guild for AI Work?
I recently reread How professional Ethics work by Sibylla Bostoniensis and it resonated with other ideas about data science titles and care about the craft. So here goes my rambling: Should we make a guild for AI work? for love of the craft Every once in a while, fellow data scientists complain about other data scientists. Some gate-keeping comments like: “These youngsters don’t know anything about data science work in practice”, “Anyone who participated in one Kaggle competition calls themselves a data scientist nowadays”, “People who were data analyst, are calling themselves data scientists now”, “That is not a data scientist, that is a database administrator who took one python course”, “That is not a data scientist, that is a python programmer who just downloaded a model and runs it without thinking”. [Read More]

Poisoning the Web to Thwart Large Language Models

Be the language model failure mode you want to see in the taco.

Poisoning the Web to Thwart Large Language Models
Large language models trained on data that is scraped without taking licences into account is in essence theft. So I understand that we want to do something as creators. As user @mttaggart@fosstodon.org says: In a world where a bit of math can pass as human on the internet, we are obliged to be more unpredictable, more chaotic. Be the language model failure mode you want to see in the taco. [Read More]

High and Low Variance in Data Science Work

Consistency or peaks, pick one

High and Low Variance in Data Science Work
I recently read “High Variance Management” by Sebas Bensu and this made me think about datascience work. First some examples from the post: Some work needs to be consistent, not extraordinary but always very nearly the same. Theatre actors performing multiple shows per week need to deliver their acting in the same way every day. Their work is low variance. Some work needs superb results, results you don’t know if you can reach it but you try it many times and between all of the failures, you might find gold. [Read More]

Creating One Unified Calendar of all Data Science Events in the Netherlands

Over engineering with renv and github actions

Creating One Unified Calendar of all Data Science Events in the Netherlands
I enjoy learning new things about machine learning, and I enjoy meeting like minded people too. That is why I go to meetups and conferences. But not everyone I meet becomes a member of every group. So I keep sending my coworkers new events that I hear about here in the Netherlands. And it is easy to overlook a new event that comes in over email. Me individually cannot scale. So in this post I will walk you through an over engineered solution to make myself unnecessary. [Read More]

Are you a Fearless Deployer?

Fast experimentation and confident deployments should be your goal

Are you a Fearless Deployer?
how do you feel when you press the ‘deploy to production’ button? Confident, slightly afraid? I bet many data scientists find it a bit scary. It’s worth it to dig a bit deeper into this fear. In my ideal world we are not scared at all. We have a devops mindset. We have no anxiety, no fears at all. You should be confident that the deployment pipeline takes care of everything. [Read More]

The city, neighborhoods and streets: Organizing your MLproject

reduce your mental load by using conventions

Have you received a project that someone else created and did it make you go 🤯? (Was that someone else: you from a few months back?1 ) Sometimes a project organically grows into a mess of scripts and you don’t know how to make it better. The main problem is often the project organization. I want you to think about, and organize, your project in three levels that I call city-level, neighborhood-level and street-level. [Read More]