Creating One Unified Calendar of all Data Science Events in the Netherlands

Over engineering with renv and github actions

Creating One Unified Calendar of all Data Science Events in the Netherlands
I enjoy learning new things about machine learning, and I enjoy meeting like minded people too. That is why I go to meetups and conferences. But not everyone I meet becomes a member of every group. So I keep sending my coworkers new events that I hear about here in the Netherlands. And it is easy to overlook a new event that comes in over email. Me individually cannot scale. So in this post I will walk you through an over engineered solution to make myself unnecessary. [Read More]

Are you a Fearless Deployer?

Fast experimentation and confident deployments should be your goal

Are you a Fearless Deployer?
how do you feel when you press the ‘deploy to production’ button? Confident, slightly afraid? I bet many data scientists find it a bit scary. It’s worth it to dig a bit deeper into this fear. In my ideal world we are not scared at all. We have a devops mindset. We have no anxiety, no fears at all. You should be confident that the deployment pipeline takes care of everything. [Read More]

The city, neighborhoods and streets: Organizing your MLproject

reduce your mental load by using conventions

Have you received a project that someone else created and did it make you go 🤯? (Was that someone else: you from a few months back?1 ) Sometimes a project organically grows into a mess of scripts and you don’t know how to make it better. The main problem is often the project organization. I want you to think about, and organize, your project in three levels that I call city-level, neighborhood-level and street-level. [Read More]

Do you Need a Feature Store?

From simple to advanced

A feature store is a central place where you get your (transformed) training and prediction data from. But do you need this? Why would you invest (engineering effort) in a feature store?1 All engineering is making trade offs, a feature store is an abstraction that can lead to more consistency between teams and between projects. A feature store is not useful for a single data scientist for a single project. It becomes useful when you do multiple projects, with multiple teams. [Read More]

Reading in your training data

Data Ingestion Patterns for ML

How do you get your training data into your model? Most tutorials and kaggle notebooks begin with reading of csv data, but in this post I hope I can convince you to do better. I think you should spend as much time as possible in the feature engineering and modelling part of your ML project, and as little time as possible on getting the data from somewhere to your machine. [Read More]

Planning Meals in an Overly Complicated Way

Creating an over-engineered technical solution for a first world household issue

Cooking, for some a chore, for some absolute joy. I’m somewhere in the middle. But over the years I’ve learned that if I need to plan my meals. If I plan a week of meals in advance we can do groceries for the entire week in one go and by thinking about your meals in advance you can vary your meals for nutritional value. I used to have a very strict diet to prevent stomach aches, and planning and cooking those meals was annoying, but eating the same things is very boring. [Read More]

Introducing the 'Smoll Data Stack'

The Small, Minimal, Open, Low effort, Low power (SMOLL) datastack1, is a pun with ambitions to grow into something larger, and more educational. I wanted a cheap platform to work on improving my data engineering skills and so I re-purposed some hardware for this project. A raspberry pi 3 with ubuntu & a NAS that I’ve installed a postgres database into. What people call the ‘modern data stack’ is usually [1] a cloud data warehouse such as Snowflake, Bigquery or Redshift2 (In my opinion a data warehouse is something that holds all the data in table like format and your transformations are done with SQL). [Read More]

Don't Panic! a Scientific Approach to Debugging Production Failure

Your production system just broke down. What should you do now? Can you imagine your shiny application / flask app, or your API service breaking down? As a beginning programmer, or operations (or devops) person it can be overwhelming to deal with logs, messages, metrics and other possible relevant information that is coming at you at such a point. And when something fails you want it to get back to working state as fast as possible. [Read More]

WTF is Kubernetes and Should I Care as R User?

Fearless to production

I’m going to give you a high overview of kubernetes and how you can make your R work shine in kubernetes. Are you, an R-user in a company that uses kubernetes? building R applications (models that do predictions, shiny applications, APIs)? curious about this whole kubernetes thing that your coworkers are talking about? somewhat afraid? Then I have the post for you! Many R users come from an academic background, statistics and social sciences. [Read More]

How I Set Up Dagster in a Company

In the past few months I setup a dagster deployment within a kubernetes cluster. This was a lot of fun because I learned a lot, but I’d like to document some of the things we did so I’ll remember them later. Dagster is a scheduler/ orchestrator or workflow manager (I’ve seen all those words but I’m not sure what the differences are). When you need to get data from one place to another, do complex data operations that need to happen in a certain order or you have many many tasks to run, you might want to use such a thing. [Read More]