blog - Roel's R-tefacts

Zettlr to mkdocs

Let's use python, I already know that

Posted on May 1, 2024 (Last modified on April 30, 2024) | 4 minutes | 840 words | Roel M. Hogervorst

This is a technical walkthrough of how I turn zettlr markdown files into a mkdocs website. This is an experiment and not yet finished, but I want to write down my thoughtprocess in the hope that it works for you too. I was looking for a way to publish my zettelkasten to a website that only I can see. my solution is: any self published, local website will do, as long as I put it behind a tailscale network, then I can view the website with my devices everywhere (through tailscale). [Read More]

Private Personal Knowledge Management

But globally accessable to me alone

Posted on April 30, 2024 | 3 minutes | 609 words | Roel M. Hogervorst

Atomic ideas, connected I write many blogposts on the basis of my notes in my personal knowledge system. I’m using a digital zettelkasten method (you pull apart knowledge into atomic components (zettels) and write those components down, connecting them in ways that make sense for you). This allows me to make creative connections between ideas and concepts. living apart, together Sometimes at work I want to look up something that I know is in my zettelkasten. [Read More]

automation bibtex PKM citations zettelkasten

Entity resolution for data scientists

or data matching, or data deduplication or record linkage

Posted on April 24, 2024 (Last modified on May 22, 2024) | 12 minutes | 2384 words | Roel M. Hogervorst

I have a problem. Others have it too, it is a problem of duplication. I’m trying to track the books I read in Bookwyrm so I can talk about it online. But there are so many duplicates! How do we know if Soren Kierkegaard,Søren Kierkegaard, and Sören Kierkegaard are the same person? This is an example of entity resolution1. It is also called deduplication, record linkage and data matching 2. We want to compare entities from different datasets and make a confident claim that they match or not. [Read More]

data_engineering data_science entity_resolution

Adaptive Plasticity and Life History Theory

April cools post

Posted on April 1, 2024 | 5 minutes | 856 words | Roel M. Hogervorst

Happy April 1st! This post is part of April Cools Club: an April 1st effort to publish genuine essays on unexpected topics. I want to tell you about a fascinating topic of adaptive plasticity and life history theory. I haven’t read anything about this anymore since 2014 but the ideas have kept a place in my head (lived there rent free? a weird expression). This is also a free day for me, so I’m going to put minimal effort in writing about this topic, I am going to write without consulting even wikipedia. [Read More]

april-cools biology

ChatGPT in (the Core of) your Product is a Bad Idea

Foundational models are inherently risky.

Posted on March 19, 2024 (Last modified on August 1, 2024) | 4 minutes | 826 words | Roel M. Hogervorst

Google ~bard~1 gemini, Claude, or chatGPT seem to be able to do many things. They have easy APIs and many plugins. The price is lower than seems possible. And yet, integrating these things into your product is really risky. Here is why: Problems with foundational models These “AI’s”2 are build on foundational models. They are trained on massive amounts of text data, and finally finetuned for specific tasks. We don’t know what data was used for training. [Read More]

ai-risks antipattern pseudoprofoundBS genAI

Pytorch on an AMD gpu (frame.work 13)

Posted on March 12, 2024 | 1 minutes | 196 words | Roel M. Hogervorst

I have a frame.work laptop. it is really nice! it looks awesome and is easily repairable. I chose an AMD type, which as an integated GPU. the AMD Ryzen 7 7840U You can actually use this GPU with pytorch! But you need to perform a few steps, I write them down here for future use. (I’m using ubuntu on this device) allocate more VRAM to GPU with a bios setting (go into bios and change setting GPU to gaming mode or something, see this link) start a virtual environment in your project install the right versions of pytorch packages; go to https://pytorch. [Read More]

hardware pytorch

The art (and science) of feature engineering

combining best practices from science, and engineering

Posted on March 1, 2024 | 5 minutes | 1010 words | Roel M. Hogervorst

Data scientists, in general, do not just throw data into a model. They use feature engineering; transforming input data to make it easy for the chosen machine learning algorithm to pick up the subtleties in the data. Data scientists do this so the model can predict outcomes better. In the image below you see a transformation of data into numeric values with meaning. In this article I’ll discuss why we still need feature engineering (FE) in the age of Large language models, and what some best practices are. [Read More]

data_science programming reproducability

how I write tests for dagster

unit-testing data engineering

Posted on February 27, 2024 (Last modified on November 19, 2024) | 4 minutes | 813 words | Roel M. Hogervorst

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Dagster testing data_engineering

Evolution of Our Dagster File Organization

File structures should make your work easier

Posted on February 24, 2024 (Last modified on February 25, 2024) | 5 minutes | 978 words | Roel M. Hogervorst

Whenever you try to do intelligent data engineering tasks: refreshing tables in order, running python processes, ingesting and outputting data, you need a scheduler. Airflow is the best known of these beasts, but I have a fondness for Dagster. Dagster focuses on the result of computations: tables, trained model artefacts and not on the process itself, this works really well for data science solutions. Dagster makes it also quite easy to swap implementations, you can read from disk while testing or developing, and read from a database in production. [Read More]

Dagster pipelines

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Posted on January 28, 2024 (Last modified on February 25, 2024) | 5 minutes | 859 words | Roel M. Hogervorst

I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

advanced Dagster data_engineering data_science grist quickthoughts spreadsheets