I recently reread How professional Ethics work by Sibylla Bostoniensis and it resonated with other ideas about data science titles and care about the craft. So here goes my rambling: Should we make a guild for AI work? for love of the craft Every once in a while, fellow data scientists complain about other data scientists. Some gate-keeping comments like: “These youngsters don’t know anything about data science work in practice”, “Anyone who participated in one Kaggle competition calls themselves a data scientist nowadays”, “People who were data analyst, are calling themselves data scientists now”, “That is not a data scientist, that is a database administrator who took one python course”, “That is not a data scientist, that is a python programmer who just downloaded a model and runs it without thinking”. [Read More]
Poisoning the Web to Thwart Large Language Models
Be the language model failure mode you want to see in the taco.
Large language models trained on data that is scraped without taking licences into account is in essence theft. So I understand that we want to do something as creators. As user @email@example.com says: In a world where a bit of math can pass as human on the internet, we are obliged to be more unpredictable, more chaotic. Be the language model failure mode you want to see in the taco. [Read More]
High and Low Variance in Data Science Work
Consistency or peaks, pick one
I recently read “High Variance Management” by Sebas Bensu and this made me think about datascience work. First some examples from the post: Some work needs to be consistent, not extraordinary but always very nearly the same. Theatre actors performing multiple shows per week need to deliver their acting in the same way every day. Their work is low variance. Some work needs superb results, results you don’t know if you can reach it but you try it many times and between all of the failures, you might find gold. [Read More]
Creating One Unified Calendar of all Data Science Events in the Netherlands
Over engineering with renv and github actions
I enjoy learning new things about machine learning, and I enjoy meeting like minded people too. That is why I go to meetups and conferences. But not everyone I meet becomes a member of every group. So I keep sending my coworkers new events that I hear about here in the Netherlands. And it is easy to overlook a new event that comes in over email. Me individually cannot scale. So in this post I will walk you through an over engineered solution to make myself unnecessary. [Read More]
Are you a Fearless Deployer?
Fast experimentation and confident deployments should be your goal
how do you feel when you press the ‘deploy to production’ button? Confident, slightly afraid? I bet many data scientists find it a bit scary. It’s worth it to dig a bit deeper into this fear. In my ideal world we are not scared at all. We have a devops mindset. We have no anxiety, no fears at all. You should be confident that the deployment pipeline takes care of everything. [Read More]
The city, neighborhoods and streets: Organizing your MLproject
reduce your mental load by using conventions
Have you received a project that someone else created and did it make you go 🤯? (Was that someone else: you from a few months back?1 ) Sometimes a project organically grows into a mess of scripts and you don’t know how to make it better. The main problem is often the project organization. I want you to think about, and organize, your project in three levels that I call city-level, neighborhood-level and street-level. [Read More]
Do you Need a Feature Store?
From simple to advanced
A feature store is a central place where you get your (transformed) training and prediction data from. But do you need this? Why would you invest (engineering effort) in a feature store?1 All engineering is making trade offs, a feature store is an abstraction that can lead to more consistency between teams and between projects. A feature store is not useful for a single data scientist for a single project. It becomes useful when you do multiple projects, with multiple teams. [Read More]
Reading in your training data
Data Ingestion Patterns for ML
How do you get your training data into your model? Most tutorials and kaggle notebooks begin with reading of csv data, but in this post I hope I can convince you to do better. I think you should spend as much time as possible in the feature engineering and modelling part of your ML project, and as little time as possible on getting the data from somewhere to your machine. [Read More]
Planning Meals in an Overly Complicated Way
Creating an over-engineered technical solution for a first world household issue
Cooking, for some a chore, for some absolute joy. I’m somewhere in the middle. But over the years I’ve learned that if I need to plan my meals. If I plan a week of meals in advance we can do groceries for the entire week in one go and by thinking about your meals in advance you can vary your meals for nutritional value. I used to have a very strict diet to prevent stomach aches, and planning and cooking those meals was annoying, but eating the same things is very boring. [Read More]
Introducing the 'Smoll Data Stack'
The Small, Minimal, Open, Low effort, Low power (SMOLL) datastack1, is a pun with ambitions to grow into something larger, and more educational. I wanted a cheap platform to work on improving my data engineering skills and so I re-purposed some hardware for this project. A raspberry pi 3 with ubuntu & a NAS that I’ve installed a postgres database into. What people call the ‘modern data stack’ is usually  a cloud data warehouse such as Snowflake, Bigquery or Redshift2 (In my opinion a data warehouse is something that holds all the data in table like format and your transformations are done with SQL). [Read More]