dplyr

Predicting links for network data

Posted on November 27, 2020 (Last modified on January 22, 2021) | 14 minutes | 2877 words | Roel M. Hogervorst

NETWORKS, PREDICT EDGES Can we predict if two nodes in the graph are connected or not? But let’s make it very practical: Let’s say you work in a social media company and your boss asks you to create a model to predict who will be friends, so you can feed those recommendations back to the website and serve those to users. You are tasked to create a model that predicts, once a day for all users, who is likely to connect to whom. [Read More]

Rectangling (Social) Network Data

Preparing data for link prediction

Posted on November 25, 2020 (Last modified on November 28, 2020) | 17 minutes | 3440 words | Roel M. Hogervorst

In this tutorial I will show you how we go from network data to a rectangular format that is suited for machine learning. Many things in the world are graphs (networks). For instance: real-life friendships, business interactions, links between websites and (digital) social networks. I find graphs (the formal name for networks) fascinating, and because I am also interested in machine learning and data engineering, the question naturally becomes: How do I get (social) network data into a rectangular structure for ML? [Read More]

rectangling networkdata tidygraph data:fb-pages-food dplyr readr ggraph igraph ggplot2 tutorial beginner igraph reshape2

Munging and reordering Polarsteps data

Turning nested lists into a data.frame with purrr

Posted on April 23, 2020 (Last modified on November 9, 2022) | 10 minutes | 1937 words | Roel M. Hogervorst

This post is about how to extract data from a json, turn it into a tibble and do some work with the result. I’m working with a download of personal data from polarsteps. A picture of Tokomaru Wharf (New Zealand) I was a month in New Zealand, birthplace of R and home to Hobbits. I logged my travel using the Polarsteps application. The app allows you to upload pictures and write stories about your travels. [Read More]

jsonlite dplyr purrr rectangling

Gosset part 2: small sample statistics

Scientific brewing at scale

Posted on October 11, 2019 (Last modified on November 9, 2022) | 14 minutes | 2804 words | Roel M. Hogervorst

Simulation was the key to to achieve world beer dominance. ‘Scientific’ Brewing at scale in the early 1900s Beer bottles cheers This post is an explainer about the small sample experiments performed by William S. Gosset. This post contains some R code that simulates his simulations1 and the resulting determination of the ideal sample size for inference. If you brew your own beer, or if you want to know how many samples you need to say something useful about your data, this post is for you. [Read More]

gosset t-distribution simulation tidyverse tibble dplyr

Quick post - detect and fix this ggplot2 antipattern

Posted on March 7, 2019 (Last modified on November 9, 2022) | 6 minutes | 1168 words | Roel M. Hogervorst

Recently one of my coworkers showed me a ggplot and although it is not wrong, it is also not ideal. Here is the TL:DR : Whenever you find yourself adding multiple geom_* to show different groups, reshape your data In software engineering there are things called antipatterns, ways of programming that lead you into potential trouble. This is one of them. I’m not saying it is incorrect, but it might lead you into trouble. [Read More]

ggplot2 magrittr tidyverse dplyr tidyr data:mtcars antipattern quickthoughts

Graphing My Daily Phone Use

Posted on January 28, 2019 (Last modified on November 9, 2022) | 3 minutes | 527 words | Roel M. Hogervorst

How many times do I look at my phone? I set up a small program on my phone to count the screen activations and logged to a file. In this post I show what went wrong and how to plot the results. The data I set up a small program on my phone that counts every day how many times I use my phone (to be specific, it counts the times the screen has been activated). [Read More]

tasker phone ggplot2 readr dplyr

Cleaning up and combining data, a dataset for practice

Posted on March 12, 2018 (Last modified on November 9, 2022) | 3 minutes | 564 words | Roel M. Hogervorst

tldr: I created an open dataset for the explicit practice of data munging. Feel free to use it in assignments, but do mention where you got it from (CC-by-4.0). Also unicorns are awesome. Find the dataset at: https://github.com/RMHogervorst/unicorns_on_unicycles Data munging / cleaning / engineering At work I was working with a two excel files that were slightly different but could be combined into 1 dataset. This is very typical for day to day cleaning operations that analysts and data scientists do (statisticians too). [Read More]

dirty data munging dplyr readxl unicorns unicycles exercise

add abbreviations to your rmarkdown doc

Posted on January 24, 2018 (Last modified on November 9, 2022) | 2 minutes | 238 words | Roel M. Hogervorst

Today a small tip for when you write rmarkdown documents. Add a chunk on top with abbreviations. In the first chunks you set the options and load the packages. Next create abbreviations, you don’t have to care about the ordering, just put them down as you realize you are creating them. The first step makes a dataframe (a tibble, rowwise), and the second step orders them. tribble( ~Abbreviation, ~ Explanation, "CIA", "Central Intelligence Agency", "dplyr", "data. [Read More]

rmarkdown tibble dplyr

Where to live in the Netherlands based on temperature XKCD style

Posted on November 20, 2017 (Last modified on November 9, 2022) | 5 minutes | 1041 words | Roel M. Hogervorst

After seeing a plot of best places to live in Spain and the USA based on the weather, I had to chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences. First the end result: This post explains how to make the plot, to see where I got the data and what procedures I took look at https://github. [Read More]

XKCD weather humidex dplyr ggplot2 readr Netherlands

Generate text using Markov Chains (sort of)

Posted on January 21, 2017 (Last modified on November 9, 2022) | 6 minutes | 1267 words | Roel M. Hogervorst

Inspired by the hilarious podcast The Greatest Generation, I have worked again with all the lines from all the episode scripts of TNG. Today I will make a clunky bot (although it does nothing and is absolutely not useful) that talks like Captain Picard. I actually wanted to use a Markov Chain to generate text. A Markov Chain has a specific property. It doesn’t care what happened before, it only looks at probabilities from the current state to a next state. [Read More]

Markov TNG dplyr tidytext bot