Message Broker Pattern for ML Systems

Message Broker Pattern for ML Systems
I’ve seen a pattern in different places but it is most useful for streaming data. Data that comes in over time, with quite some volume. The core of the solution is a message broker, this could be light weight like redis1, or a heavier log-like solution like kafka2. In stead of sending data from one microservice to another through API calls, we publish data to a central place, and services subscribe to data, and publish their results back (that is why it is called PubSub; publish - subscribe). [Read More]

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Using Grist as Part of your Data Engineering Pipeline with Dagster
I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

Reading in your training data

Data Ingestion Patterns for ML

How do you get your training data into your model? Most tutorials and kaggle notebooks begin with reading of csv data, but in this post I hope I can convince you to do better. I think you should spend as much time as possible in the feature engineering and modelling part of your ML project, and as little time as possible on getting the data from somewhere to your machine. [Read More]

Don't Panic! a Scientific Approach to Debugging Production Failure

Your production system just broke down. What should you do now? Can you imagine your shiny application / flask app, or your API service breaking down? As a beginning programmer, or operations (or devops) person it can be overwhelming to deal with logs, messages, metrics and other possible relevant information that is coming at you at such a point. And when something fails you want it to get back to working state as fast as possible. [Read More]

How I Set Up Dagster in a Company

In the past few months I setup a dagster deployment within a kubernetes cluster. This was a lot of fun because I learned a lot, but I’d like to document some of the things we did so I’ll remember them later. Dagster is a scheduler/ orchestrator or workflow manager (I’ve seen all those words but I’m not sure what the differences are). When you need to get data from one place to another, do complex data operations that need to happen in a certain order or you have many many tasks to run, you might want to use such a thing. [Read More]

Testing Azure Functions Locally with Azurite

Supplying secrets and simulating storage

I’ve been developing Azure Functions with R for the past week. There are some nice basic tutorials to run custom code on ‘Functions’, the basic tutorials all create a simple web app. That is, the docker container responds to http triggers. However, if you want to use a different trigger, you need to have a storage account too. There are two ways to do this: use the actual storage account you created on azure simulate storage with the ‘azurite’ container. [Read More]

How to Use Catboost with Tidymodels

Treesnip standardizes everything

So you want to compete in a kaggle competition with R and you want to use tidymodels. In this howto I show how you can use CatBoost with tidymodels. I give very terse descriptions of what the steps do, because I believe you read this post for implementation, not background on how the elements work. This tutorial is extremely similar to my previous post about using lightGBM with Tidymodels. Why tidymodels? [Read More]

How to Use Lightgbm with Tidymodels

Treesnip standardizes everything

So you want to compete in a kaggle competition with R and you want to use tidymodels. In this howto I show how you can use lightgbm (LGBM) with tidymodels. I give very terse descriptions of what the steps do, because I believe you read this post for implementation, not background on how the elements work. Why tidymodels? It is a unified machine learning framework that uses sane defaults, keeps model definitions andimplementation separate and allows you to easily swap models or change parts of the processing. [Read More]