Just enough kubernetes to be dangerous

Just enough kubernetes to be dangerous
As a data scientist that wants to achieve production results, one of the best options is to make your work available in kubernetes. Because kubernetes runs on all clouds and because many organizations use kubernetes. Make your prediction API available in kubernetes and your organization can ‘just’ plug it into their systems. Many data scientists don’t know anything about docker, not to mention kubernetes and its main tool helm. I think you should learn and practice just enough helm to be dangerous1. [Read More]

OpenSanctions is an amazing example of entity resolution at scale

OpenSanctions is an amazing example of entity resolution at scale
In one of my previous post I talked about entity resolution and how data science plays a role. I am a big fan of OpenSanctions, and their process (entirely open) is a beautiful example of Entity resolution. OpenSanctions is an international database of persons and companies of political, criminal, or economic interest. It’s a combination of sanction lists, publicly available databases, and criminal information. Companies can use this information to check their customers, to prevent money laundering and sanction evasion. [Read More]

Entity resolution for data scientists

or data matching, or data deduplication or record linkage

Entity resolution for data scientists
I have a problem. Others have it too, it is a problem of duplication. I’m trying to track the books I read in Bookwyrm so I can talk about it online. But there are so many duplicates! How do we know if Soren Kierkegaard,Søren Kierkegaard, and Sören Kierkegaard are the same person? This is an example of entity resolution1. It is also called deduplication, record linkage and data matching 2. We want to compare entities from different datasets and make a confident claim that they match or not. [Read More]

The art (and science) of feature engineering

combining best practices from science, and engineering

The art (and science) of feature engineering
Data scientists, in general, do not just throw data into a model. They use feature engineering; transforming input data to make it easy for the chosen machine learning algorithm to pick up the subtleties in the data. Data scientists do this so the model can predict outcomes better. In the image below you see a transformation of data into numeric values with meaning. In this article I’ll discuss why we still need feature engineering (FE) in the age of Large language models, and what some best practices are. [Read More]

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Using Grist as Part of your Data Engineering Pipeline with Dagster
I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

Data Science Technical Terms: Job Titles and Fields

MLE, AE, DE, DS, WTF?

Data Science Technical Terms: Job Titles and Fields
What do I mean when I talk about MLops, Machine Learning Engineering, or data science? I call myself data engineer, data scientist, or machine learning engineer. But never an analyst. To me these job-titles all have a certain meaning, although they overlap. Here is what the job titles mean to me, right now. The first thing you need to keep in mind is the size of the organization the size of the data team and the data-maturity of an organization. [Read More]

Not the Jobtitle but the Activities

What exactly would you say you do here?

I call myself data engineer, data scientist, or machine learning engineer. But never an analist. To me these job-titles all have a certain meaning, but how would a recruiter know what these things mean? My understanding of the roles can also be different from someone else in the field. Some people (they are dicks) would like to keep everyone who is not building neural nets out of the title of data scientist. [Read More]

William Sealy Gosset one of the first data scientists

The father of the t-distribution

I think William Sealy Gosset, better known as ‘Student’ is the first data scientist. He used math to solve real world business problems, he worked on experimental design, small sample statistics, quality control, and beer. In fact, I think we should start a fanclub! And as the first member of that fanclub, I have been to the Guinness brewery to take a picture of Gosset’s only visible legacy there. W. S. [Read More]