Predicting links for network data

NETWORKS, PREDICT EDGES Can we predict if two nodes in the graph are connected or not? But let’s make it very practical: Let’s say you work in a social media company and your boss asks you to create a model to predict who will be friends, so you can feed those recommendations back to the website and serve those to users. You are tasked to create a model that predicts, once a day for all users, who is likely to connect to whom. [Read More]

Rectangling (Social) Network Data

Preparing data for link prediction

In this tutorial I will show you how we go from network data to a rectangular format that is suited for machine learning. Many things in the world are graphs (networks). For instance: real-life friendships, business interactions, links between websites and (digital) social networks. I find graphs (the formal name for networks) fascinating, and because I am also interested in machine learning and data engineering, the question naturally becomes: How do I get (social) network data into a rectangular structure for ML? [Read More]

How Does Catboost Deal with Factors in loading?

What are you doing catboost?

Some people at curso-r, are working on an amazing extension of parsnip and allow you to use tidymodels packages like {parsnip} and {recipes} with the modern beasts of machine learning: lightgbm and catboost. the package is called treesnip and is still in development. Both lightgbm and catboost can work with categorical features but how do you pass those to the machinery? Both lightgbm and catboost use special data structures. I was reading through the catboost documentation and it just wasn’t very clear to me. [Read More]

Expressing size in bananas a dive into {vctrs}

Yes I made a stupid package to express lengths in bananas

Recently I’ve become interested in relative sizes of things. Maybe I’m paying more attention to my surroundings since I’m locked at home for so long. Maybe my inner child is finally breaking free. Whatever the reason, I channeled all of that into two packages: everydaysizes A rather unfinished collection of dimensions of everyday objects. banana A package that displays dimensions as … bananas. I’ve collected a bunch of sizes and turned them into ‘units’. [Read More]

Scraping Gdpr Fines

Into the DOM with a flavour of regex

The website Privacy Affairs keeps a list of fines related to GDPR. I heard * that this might be an interesting dataset for TidyTuesdays and so I scraped it. The dataset contains at this moment 250 fines given out for GDPR violations and is last updated (according to the website) on 31 March 2020. All data is from official government sources, such as official reports of national Data Protection Authorities. [Read More]

Gosset part 2: small sample statistics

Scientific brewing at scale

Simulation was the key to to achieve world beer dominance. ‘Scientific’ Brewing at scale in the early 1900s Beer bottles cheers This post is an explainer about the small sample experiments performed by William S. Gosset. This post contains some R code that simulates his simulations1 and the resulting determination of the ideal sample size for inference. If you brew your own beer, or if you want to know how many samples you need to say something useful about your data, this post is for you. [Read More]

William Sealy Gosset one of the first data scientists

The father of the t-distribution

I think William Sealy Gosset, better known as ‘Student’ is the first data scientist. He used math to solve real world business problems, he worked on experimental design, small sample statistics, quality control, and beer. In fact, I think we should start a fanclub! And as the first member of that fanclub, I have been to the Guinness brewery to take a picture of Gosset’s only visible legacy there. W. S. [Read More]

Setting up CSP on your hugo (+netlify) site

Content security policy is being nice to your readers browser

I recently got a compliment about having a content security policy (CSP) on my blog. But I’m not special, you can have one too! In this post I will show you how I created this policy and how you can too. I’m using the service report-uri.com which automates a lot the work. This is specific for building a hugo site using netlify. I am absolutely no expert and so this is mostly a description of what I did. [Read More]

Graphing My Daily Phone Use

How many times do I look at my phone? I set up a small program on my phone to count the screen activations and logged to a file. In this post I show what went wrong and how to plot the results. The data I set up a small program on my phone that counts every day how many times I use my phone (to be specific, it counts the times the screen has been activated). [Read More]

Logging my phone use with tasker

In this post I’ll show you how I logged my phone use with tasker, in a follow up post I’ll show you how I visualized that. I had a great vacation last week but relaxing in Spain I thought about my use of technology and became a bit concerned with how many times I actually look at my phone. But how many times a day do I actually look at my phone? [Read More]