These levels have been defined by the software carpentry people, and I have modified them to this:

  • beginner: You have just started out in this topic. You do not yet know how things are supposed to work. You do not have a mental model of this thing
  • intermediate: You are a regular user of this software/tool/concept, you have a mental model, but it is not very sophisticated
  • advanced: You have a sophisticated mental model how things work, and you even know when the model breaks, when it does not match reality.

A (Semantic) Search Engine will not Make you Organized

Maybe this time it will work?

A (Semantic) Search Engine will not Make you Organized
Ah it is so alluring! With this search engine you will finally be able to find the documents of your organization! This vendor even has a large language model so you can search on ‘meaning’ not only actual words! Alas one million dollars later you still can’t find your shit. I think the main reason you can’t find shit in your company is because you don’t organize your documents. As a messy person I empathize, but you need to get your shit together. [Read More]

Test for Tags in Dagster

How to enforce a style in your organization.

Test for Tags in Dagster
Dagster assets can be labelled with owners, tags, kinds and metadata. This is super useful but if you want to enforce a particular style on every merge request you will mess up eventually. But, you can use pytest to enforce this. testing for components You can find the file in the github linked below, but here it is in steps. import pytest from dagster import AssetSpec, AssetsDefinition # Import all the assets from your project as one list. [Read More]

Dagster: all the Ways you can Differentiate Assets

tags, kinds, metadata, and more

Dagster: all the Ways you can Differentiate Assets
When you have more than 10 assets in dagster, you might want to be able to quickly identify them. Here are all the ways (I think) you can differentiate assets. naming convention: This might not be for everyone but with a strong naming convention you can easily identify the asset. You could use a schema like <type>_<source>__<additional_context> (the dbt docs have excellent naming suggestions) prefixes: it is possible to add a prefix to an asset (I don’t really like this), but it would be like groupingsname/assetname groups: you can group multiple related assets together. [Read More]

Your Machine Learning Model is not the Product

Your Machine Learning Model is not the Product
I’m so sorry. Your precious AI model, with handcrafted beautiful perfect features, with awesome hyper parameters, is not the product. Listen, it is awesome work, not a lot of people can do it, but a good ML model is not the end-product1. I want to talk about value. In the jobs I’ve worked the machine learning model was part of a larger system. And only when all the components come together you create value. [Read More]

Dagster: Integrating Jobs with Assets and Vice Versa.

Today I learned (TIL) you can actually run jobs based on assets and vice versa.

Assets VS Jobs

In dagster you have assets, jobs, ops, sensors and schedules. I have been using dagster for a few years, since before the assets were introduced. Assets are awesome: I don’t care that much about the data engineering processes themselves, I care about the results! And that is what assets are focused on. And because you can let assets depend on each other you can set up processes without hacks such as checking if a job is finished or not. But some processes do not deliver a product at the end, setting up permissions in a database, correcting known mistakes, some variants of triggering webhooks, etc. In those cases you want to use ‘ops’ and ‘jobs’, the traditional way of working.

But what if you have several processes, where some are traditional jobs/ops and some are assets? And somehow your job depends on the asset? For example: run a cleaning process after a table is created. I had the feeling that you had to rewrite the second job into an asset, even if that was not semantically correct. But you don' t have to!

Assets AND jobs

You can materialize an asset, hook up an asset sensor that senses materialization and the sensor can then start a job.

Here is some example code to make it a bit more clear:

  • asset_1 for example creates a table
  • asset_1_sensor waits for a Materialization event and triggers
  • job_a
# pseudo code, misses all the important things

@asset
def asset_1():
	"""An asset that you care about."""
    #something
    return

@asset_sensor(asset_key=AssetKey("asset_1"), job=job_a)
def asset_1_sensor():
	yield RunRequest()


def job_a():
	"""A job, with no data product as end product."""
	pass
	

The other way around works too. You can have a job that materializes an asset. and Assets that depend on that further.

  • job_b materializes an asset
  • asset_2
  • asset_3 depends on asset_2
# pseudo code, misses all the important things
@job
def job_b():
	"""A job that materializes an asset."""
	# some work
	
	context.log_event(
        AssetMaterialization(asset_key="asset_2")
        )

@asset(deps=["asset_2"])
def asset_3():
	#some work
	pass

This is super useful, but there are some caveats:

  • when you make everything assets, you can instruct dagster to refresh all assets in the chain, but it does not ‘know’ how to start a traditional job. So this sort of breaks the chain.
  • An asset created by a job that has a materialization event, is seen as an ‘external asset’, and so dagster ‘believes’ it is not under its control. You can go downstream from an external asset, but you cannot couple ‘real’ assets as dependencies.

So, if you have many jobs that all need to work after each other, you better rewrite it into assets, but if you have a few incidental jobs with not a lot of downstream dependencies you can chain them up to assets.

Notes

Just enough kubernetes to be dangerous

Just enough kubernetes to be dangerous
As a data scientist that wants to achieve production results, one of the best options is to make your work available in kubernetes. Because kubernetes runs on all clouds and because many organizations use kubernetes. Make your prediction API available in kubernetes and your organization can ‘just’ plug it into their systems. Many data scientists don’t know anything about docker, not to mention kubernetes and its main tool helm. I think you should learn and practice just enough helm to be dangerous1. [Read More]

Search tools

local, website, logs or something else?

Search tools
If you want to search, you often get elastic, but there are many options and depending on what kind of search you want. Here are some ways of looking at it. I came across this post on mastodon: https://mastodon.social/@mhoye/112422822807191436 I’m on here looking for text indexers and everything is ‘lightning fast exoscale terafloops that scales to enterprise quantawarbles with polytopplic performanations’ and it would be great if this industry could breathe into a bag until it remembers that one person with one computer is a constituency that matters. [Read More]

zettlr to hugo

I already know hugo!

zettlr to hugo
This is a technical walkthrough of how I turn zettlr markdown files into a hugo website. This is an experiment and not yet finished, but I want to write down my thoughtprocess in the hope that it works for you too. I was looking for a way to publish my zettelkasten to a website that only I can see. my solution is: any self published, local website will do, as long as I put it behind a tailscale network, then I can view the website with my devices everywhere (through tailscale). [Read More]

Zettlr to mkdocs

Let's use python, I already know that

Zettlr to mkdocs
This is a technical walkthrough of how I turn zettlr markdown files into a mkdocs website. This is an experiment and not yet finished, but I want to write down my thoughtprocess in the hope that it works for you too. I was looking for a way to publish my zettelkasten to a website that only I can see. my solution is: any self published, local website will do, as long as I put it behind a tailscale network, then I can view the website with my devices everywhere (through tailscale). [Read More]

Private Personal Knowledge Management

But globally accessable to me alone

Private Personal Knowledge Management
Atomic ideas, connected I write many blogposts on the basis of my notes in my personal knowledge system. I’m using a digital zettelkasten method (you pull apart knowledge into atomic components (zettels) and write those components down, connecting them in ways that make sense for you). This allows me to make creative connections between ideas and concepts. living apart, together Sometimes at work I want to look up something that I know is in my zettelkasten. [Read More]