data_science

Just enough kubernetes to be dangerous

Posted on July 30, 2024 | 2 minutes | 401 words | Roel M. Hogervorst

As a data scientist that wants to achieve production results, one of the best options is to make your work available in kubernetes. Because kubernetes runs on all clouds and because many organizations use kubernetes. Make your prediction API available in kubernetes and your organization can ‘just’ plug it into their systems. Many data scientists don’t know anything about docker, not to mention kubernetes and its main tool helm. I think you should learn and practice just enough helm to be dangerous1. [Read More]

OpenSanctions is an amazing example of entity resolution at scale

Posted on May 22, 2024 | 4 minutes | 817 words | Roel M. Hogervorst

In one of my previous post I talked about entity resolution and how data science plays a role. I am a big fan of OpenSanctions, and their process (entirely open) is a beautiful example of Entity resolution. OpenSanctions is an international database of persons and companies of political, criminal, or economic interest. It’s a combination of sanction lists, publicly available databases, and criminal information. Companies can use this information to check their customers, to prevent money laundering and sanction evasion. [Read More]

data_engineering data_science entity_resolution

Entity resolution for data scientists

or data matching, or data deduplication or record linkage

Posted on April 24, 2024 (Last modified on May 22, 2024) | 12 minutes | 2384 words | Roel M. Hogervorst

I have a problem. Others have it too, it is a problem of duplication. I’m trying to track the books I read in Bookwyrm so I can talk about it online. But there are so many duplicates! How do we know if Soren Kierkegaard,Søren Kierkegaard, and Sören Kierkegaard are the same person? This is an example of entity resolution1. It is also called deduplication, record linkage and data matching 2. We want to compare entities from different datasets and make a confident claim that they match or not. [Read More]

data_engineering data_science entity_resolution

The art (and science) of feature engineering

combining best practices from science, and engineering

Posted on March 1, 2024 | 5 minutes | 1010 words | Roel M. Hogervorst

Data scientists, in general, do not just throw data into a model. They use feature engineering; transforming input data to make it easy for the chosen machine learning algorithm to pick up the subtleties in the data. Data scientists do this so the model can predict outcomes better. In the image below you see a transformation of data into numeric values with meaning. In this article I’ll discuss why we still need feature engineering (FE) in the age of Large language models, and what some best practices are. [Read More]

data_science programming reproducability

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Posted on January 28, 2024 (Last modified on February 25, 2024) | 5 minutes | 859 words | Roel M. Hogervorst

I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

advanced Dagster data_engineering data_science grist quickthoughts spreadsheets

Data Science Technical Terms: Job Titles and Fields

MLE, AE, DE, DS, WTF?

Posted on January 29, 2022 | 3 minutes | 556 words | Roel M. Hogervorst

What do I mean when I talk about MLops, Machine Learning Engineering, or data science? I call myself data engineer, data scientist, or machine learning engineer. But never an analyst. To me these job-titles all have a certain meaning, although they overlap. Here is what the job titles mean to me, right now. The first thing you need to keep in mind is the size of the organization the size of the data team and the data-maturity of an organization. [Read More]

data_science mlops jobtitles

Not the Jobtitle but the Activities

What exactly would you say you do here?

Posted on January 29, 2022 | 2 minutes | 243 words | Roel M. Hogervorst

I call myself data engineer, data scientist, or machine learning engineer. But never an analist. To me these job-titles all have a certain meaning, but how would a recruiter know what these things mean? My understanding of the roles can also be different from someone else in the field. Some people (they are dicks) would like to keep everyone who is not building neural nets out of the title of data scientist. [Read More]

data_science mlops jobtitles

William Sealy Gosset one of the first data scientists

The father of the t-distribution

Posted on August 17, 2019 (Last modified on November 9, 2022) | 3 minutes | 600 words | Roel M. Hogervorst

I think William Sealy Gosset, better known as ‘Student’ is the first data scientist. He used math to solve real world business problems, he worked on experimental design, small sample statistics, quality control, and beer. In fact, I think we should start a fanclub! And as the first member of that fanclub, I have been to the Guinness brewery to take a picture of Gosset’s only visible legacy there. W. S. [Read More]

gosset data_science