Do you Need a Feature Store?

From simple to advanced

A feature store is a central place where you get your (transformed) training and prediction data from. But do you need this? Why would you invest (engineering effort) in a feature store?1 All engineering is making trade offs, a feature store is an abstraction that can lead to more consistency between teams and between projects. A feature store is not useful for a single data scientist for a single project. It becomes useful when you do multiple projects, with multiple teams. [Read More]

Introducing the 'Smoll Data Stack'

The Small, Minimal, Open, Low effort, Low power (SMOLL) datastack1, is a pun with ambitions to grow into something larger, and more educational. I wanted a cheap platform to work on improving my data engineering skills and so I re-purposed some hardware for this project. A raspberry pi 3 with ubuntu & a NAS that I’ve installed a postgres database into. What people call the ‘modern data stack’ is usually [1] a cloud data warehouse such as Snowflake, Bigquery or Redshift2 (In my opinion a data warehouse is something that holds all the data in table like format and your transformations are done with SQL). [Read More]

Don't Panic! a Scientific Approach to Debugging Production Failure

Your production system just broke down. What should you do now? Can you imagine your shiny application / flask app, or your API service breaking down? As a beginning programmer, or operations (or devops) person it can be overwhelming to deal with logs, messages, metrics and other possible relevant information that is coming at you at such a point. And when something fails you want it to get back to working state as fast as possible. [Read More]

WTF is Kubernetes and Should I Care as R User?

Fearless to production

I’m going to give you a high overview of kubernetes and how you can make your R work shine in kubernetes. Are you, an R-user in a company that uses kubernetes? building R applications (models that do predictions, shiny applications, APIs)? curious about this whole kubernetes thing that your coworkers are talking about? somewhat afraid? Then I have the post for you! Many R users come from an academic background, statistics and social sciences. [Read More]

The Whole Game; a Development Workflow

Developing software together

This is a post for people who only work alone or wonder why on earth you would use all those fancy tools like linting, unit-tests, and fancy editors. I hear you, why would I use all those extra steps? That sounds like busywork you do instead of actual work! I think you just don’t haven’t experienced development work like I have, and I would like to share how my work feels and looked like in the past few years. [Read More]

Distributing data science products

Where or what is production? What does it mean when someone says to bring some data science product ‘in production’ ? What does it mean for data science products to be in production? Is your product already in production? Is it a magical place? I think two questions are of importance: does my ‘thing’ provide value? is my work repeatable? If the answer to these questions is yes, than your ‘thing’ is in production. [Read More]

Reasons to Use Tidymodels

I was listening to episode 135 of ‘Not so standard deviations’ - Moderate confidence The hosts, Hilary and Roger talked about when to use tidymodels packages and when not. Here are my 2 cents for when I think it makes sense to use these packages and when not: When not you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. [Read More]

Tidymodels on UbiOps

I’ve been working with UbiOps lately, a service that runs your data science models as a service. They have recently started supporting R next to python! So let’s see if we can deploy a tidymodels model to UbiOps! I am not going to tell you a lot about UbiOps, that is for another post. I presume you know what it is, you know what tidymodels means for R and you want to combine these things. [Read More]

Some Thoughts About dbt for Data Engineering

Over the last week I have experimented with dbt (data built tool), a cmdline tool created by Fishtown-analytics. I’m hardly the first to write or talk about it (see all the references at the bottom of this piece). But I just want to record my thoughts at this point in time. What is it Imagine the following situation: you have a data warehouse where all your data lives. You as a data engineer support tens to hundreds of analysts who build dashboards and reports on top of that source data. [Read More]

TIL: Vectorization in Advent of Code Day 15

Indexing vectors is super fast!

I spend a lot of time yesterday on day 15 of advent of code (I’m three days behind I think). Advent of code is a nice way to practice your programming skills, and even though I think of myself as an advanced R programmer I learned something yesterday! The challenge is this: While you wait for your flight, you decide to check in with the Elves back at the North Pole. [Read More]