A feature store is a central place where you get your (transformed) training and prediction data from. But do you need this? Why would you invest (engineering effort) in a feature store?1
All engineering is making trade offs, a feature store is an abstraction that can lead to more consistency between teams and between projects. A feature store is not useful for a single data scientist for a single project. It becomes useful when you do multiple projects, with multiple teams.
[Read More]
Introducing the 'Smoll Data Stack'
The Small, Minimal, Open, Low effort, Low power (SMOLL) datastack1, is a pun with ambitions to grow into something larger, and more educational. I wanted a cheap platform to work on improving my data engineering skills and so I re-purposed some hardware for this project. A raspberry pi 3 with ubuntu & a NAS that I’ve installed a postgres database into.
What people call the ‘modern data stack’ is usually [1] a cloud data warehouse such as Snowflake, Bigquery or Redshift2 (In my opinion a data warehouse is something that holds all the data in table like format and your transformations are done with SQL).
[Read More]
Don't Panic! a Scientific Approach to Debugging Production Failure
Your production system just broke down. What should you do now? Can you imagine your shiny application / flask app, or your API service breaking down?
As a beginning programmer, or operations (or devops) person it can be overwhelming to deal with logs, messages, metrics and other possible relevant information that is coming at you at such a point.
And when something fails you want it to get back to working state as fast as possible.
[Read More]
WTF is Kubernetes and Should I Care as R User?
Fearless to production
I’m going to give you a high overview of kubernetes and how you can make your R work shine in kubernetes.
Are you,
an R-user in a company that uses kubernetes? building R applications (models that do predictions, shiny applications, APIs)? curious about this whole kubernetes thing that your coworkers are talking about? somewhat afraid? Then I have the post for you!
Many R users come from an academic background, statistics and social sciences.
[Read More]
The Whole Game; a Development Workflow
Developing software together
This is a post for people who only work alone or wonder why on earth you would use all those fancy tools like linting, unit-tests, and fancy editors. I hear you, why would I use all those extra steps? That sounds like busywork you do instead of actual work!
I think you just don’t haven’t experienced development work like I have, and I would like to share how my work feels and looked like in the past few years.
[Read More]
Distributing data science products
Where or what is production? What does it mean when someone says to bring some data science product ‘in production’ ? What does it mean for data science products to be in production? Is your product already in production? Is it a magical place?
I think two questions are of importance:
does my ‘thing’ provide value? is my work repeatable? If the answer to these questions is yes, than your ‘thing’ is in production.
[Read More]
Reasons to Use Tidymodels
I was listening to episode 135 of ‘Not so standard deviations’ - Moderate confidence The hosts, Hilary and Roger talked about when to use tidymodels packages and when not. Here are my 2 cents for when I think it makes sense to use these packages and when not:
When not you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models.
[Read More]
Tidymodels on UbiOps
I’ve been working with UbiOps lately, a service that runs your data science models as a service. They have recently started supporting R next to python! So let’s see if we can deploy a tidymodels model to UbiOps! I am not going to tell you a lot about UbiOps, that is for another post. I presume you know what it is, you know what tidymodels means for R and you want to combine these things.
[Read More]
Some Thoughts About dbt for Data Engineering
Over the last week I have experimented with dbt (data built tool), a cmdline tool created by Fishtown-analytics. I’m hardly the first to write or talk about it (see all the references at the bottom of this piece). But I just want to record my thoughts at this point in time.
What is it Imagine the following situation: you have a data warehouse where all your data lives. You as a data engineer support tens to hundreds of analysts who build dashboards and reports on top of that source data.
[Read More]
TIL: Vectorization in Advent of Code Day 15
Indexing vectors is super fast!
I spend a lot of time yesterday on day 15 of advent of code (I’m three days behind I think). Advent of code is a nice way to practice your programming skills, and even though I think of myself as an advanced R programmer I learned something yesterday!
The challenge is this:
While you wait for your flight, you decide to check in with the Elves back at the North Pole.
[Read More]