How to Use Lightgbm with Tidymodels

Treesnip standardizes everything

So you want to compete in a kaggle competition with R and you want to use tidymodels. In this howto I show how you can use lightgbm (LGBM) with tidymodels. I give very terse descriptions of what the steps do, because I believe you read this post for implementation, not background on how the elements work. Why tidymodels? It is a unified machine learning framework that uses sane defaults, keeps model definitions andimplementation separate and allows you to easily swap models or change parts of the processing. [Read More]

How Does Catboost Deal with Factors in loading?

What are you doing catboost?

Some people at curso-r, are working on an amazing extension of parsnip and allow you to use tidymodels packages like {parsnip} and {recipes} with the modern beasts of machine learning: lightgbm and catboost. the package is called treesnip and is still in development. Both lightgbm and catboost can work with categorical features but how do you pass those to the machinery? Both lightgbm and catboost use special data structures. I was reading through the catboost documentation and it just wasn’t very clear to me. [Read More]

Expressing size in bananas a dive into {vctrs}

Yes I made a stupid package to express lengths in bananas

Recently I’ve become interested in relative sizes of things. Maybe I’m paying more attention to my surroundings since I’m locked at home for so long. Maybe my inner child is finally breaking free. Whatever the reason, I channeled all of that into two packages: everydaysizes A rather unfinished collection of dimensions of everyday objects. banana A package that displays dimensions as … bananas. I’ve collected a bunch of sizes and turned them into ‘units’. [Read More]

New Package, Pinboardr

I’ve created a new package to interact with pinboard not to be confused with pinterest. I noticed there wasn’t a package yet and the API is fairly clear. So come and check it out {pinboardr} at I did see a new package to interact with pocket: pocketapi. Since pocket is also a kind of bookmark manager I thought there was a need for these kinds of packages. I will leave this package on github for a while, to figure out if I need to make changes and in a month or so I will push it to CRAN. [Read More]

Munging and reordering Polarsteps data

Turning nested lists into a data.frame with purrr

This post is about how to extract data from a json, turn it into a tibble and do some work with the result. I’m working with a download of personal data from polarsteps. A picture of Tokomaru Wharf (New Zealand) I was a month in New Zealand, birthplace of R and home to Hobbits. I logged my travel using the Polarsteps application. The app allows you to upload pictures and write stories about your travels. [Read More]

Where does the output of Rscript go?

stdin, stdout, stderr

We often run R interactively, through Rstudio or in the terminal. But you can also run Rscripts without manual intervention. Using Rscript. But where does the output go? Warning: This post is very linux/unix (macos) centred, I don’t know how this works in Windows. Also I’m using the standard shell in linux ‘bash’ I believe there are some small nuances in the commands in other shells like zsh. Why do I want to know this? [Read More]

Scraping Gdpr Fines

Into the DOM with a flavour of regex

The website Privacy Affairs keeps a list of fines related to GDPR. I heard * that this might be an interesting dataset for TidyTuesdays and so I scraped it. The dataset contains at this moment 250 fines given out for GDPR violations and is last updated (according to the website) on 31 March 2020. All data is from official government sources, such as official reports of national Data Protection Authorities. [Read More]

Gosset part 2: small sample statistics

Scientific brewing at scale

Simulation was the key to to achieve world beer dominance. ‘Scientific’ Brewing at scale in the early 1900s Beer bottles cheers This post is an explainer about the small sample experiments performed by William S. Gosset. This post contains some R code that simulates his simulations1 and the resulting determination of the ideal sample size for inference. If you brew your own beer, or if you want to know how many samples you need to say something useful about your data, this post is for you. [Read More]

William Sealy Gosset one of the first data scientists

The father of the t-distribution

I think William Sealy Gosset, better known as ‘Student’ is the first data scientist. He used math to solve real world business problems, he worked on experimental design, small sample statistics, quality control, and beer. In fact, I think we should start a fanclub! And as the first member of that fanclub, I have been to the Guinness brewery to take a picture of Gosset’s only visible legacy there. W. S. [Read More]

Setting up CSP on your hugo (+netlify) site

Content security policy is being nice to your readers browser

I recently got a compliment about having a content security policy (CSP) on my blog. But I’m not special, you can have one too! In this post I will show you how I created this policy and how you can too. I’m using the service which automates a lot the work. This is specific for building a hugo site using netlify. I am absolutely no expert and so this is mostly a description of what I did. [Read More]