Dagster: Integrating Jobs with Assets and Vice Versa.

Today I learned (TIL) you can actually run jobs based on assets and vice versa.

Assets VS Jobs

In dagster you have assets, jobs, ops, sensors and schedules. I have been using dagster for a few years, since before the assets were introduced. Assets are awesome: I don’t care that much about the data engineering processes themselves, I care about the results! And that is what assets are focused on. And because you can let assets depend on each other you can set up processes without hacks such as checking if a job is finished or not. But some processes do not deliver a product at the end, setting up permissions in a database, correcting known mistakes, some variants of triggering webhooks, etc. In those cases you want to use ‘ops’ and ‘jobs’, the traditional way of working.

But what if you have several processes, where some are traditional jobs/ops and some are assets? And somehow your job depends on the asset? For example: run a cleaning process after a table is created. I had the feeling that you had to rewrite the second job into an asset, even if that was not semantically correct. But you don' t have to!

Assets AND jobs

You can materialize an asset, hook up an asset sensor that senses materialization and the sensor can then start a job.

Here is some example code to make it a bit more clear:

  • asset_1 for example creates a table
  • asset_1_sensor waits for a Materialization event and triggers
  • job_a
# pseudo code, misses all the important things

@asset
def asset_1():
	"""An asset that you care about."""
    #something
    return

@asset_sensor(asset_key=AssetKey("asset_1"), job=job_a)
def asset_1_sensor():
	yield RunRequest()


def job_a():
	"""A job, with no data product as end product."""
	pass
	

The other way around works too. You can have a job that materializes an asset. and Assets that depend on that further.

  • job_b materializes an asset
  • asset_2
  • asset_3 depends on asset_2
# pseudo code, misses all the important things
@job
def job_b():
	"""A job that materializes an asset."""
	# some work
	
	context.log_event(
        AssetMaterialization(asset_key="asset_2")
        )

@asset(deps=["asset_2"])
def asset_3():
	#some work
	pass

This is super useful, but there are some caveats:

  • when you make everything assets, you can instruct dagster to refresh all assets in the chain, but it does not ‘know’ how to start a traditional job. So this sort of breaks the chain.
  • An asset created by a job that has a materialization event, is seen as an ‘external asset’, and so dagster ‘believes’ it is not under its control. You can go downstream from an external asset, but you cannot couple ‘real’ assets as dependencies.

So, if you have many jobs that all need to work after each other, you better rewrite it into assets, but if you have a few incidental jobs with not a lot of downstream dependencies you can chain them up to assets.

Notes

The Disney+ App Really Sucks

The Disney+ App Really Sucks
The disney+ app, really sucks. I have an Android device that hosts the app and I only every play on the chromecast. To be clear, I use a chromecast on a TV with CEC enabled. That is, you can send commands from your remote to connected devices. This is really nice, you can pauze, play, stop, rewind, toggle subtitles. And you can skip ahead, and back. There is even a button to accept things. [Read More]

Message Broker Pattern for ML Systems

Message Broker Pattern for ML Systems
I’ve seen a pattern in different places but it is most useful for streaming data. Data that comes in over time, with quite some volume. The core of the solution is a message broker, this could be light weight like redis1, or a heavier log-like solution like kafka2. In stead of sending data from one microservice to another through API calls, we publish data to a central place, and services subscribe to data, and publish their results back (that is why it is called PubSub; publish - subscribe). [Read More]

So you've just lost a million dollars in the genAI hype

what lessons can you learn?

Hi C-level person! Are you feeling down because AI is not working for you? Let me know if this is you: A smug consultant sold you a genAI solution. By now you’ve realised that it doesn’t work, it can not work in theory and now it also doesn’t work in practice. You still have data quality issues, and your promised profits are non-existing. Are there any lessons you can learn from this fiasco? [Read More]

Just enough kubernetes to be dangerous

Just enough kubernetes to be dangerous
As a data scientist that wants to achieve production results, one of the best options is to make your work available in kubernetes. Because kubernetes runs on all clouds and because many organizations use kubernetes. Make your prediction API available in kubernetes and your organization can ‘just’ plug it into their systems. Many data scientists don’t know anything about docker, not to mention kubernetes and its main tool helm. I think you should learn and practice just enough helm to be dangerous1. [Read More]

Search tools

local, website, logs or something else?

Search tools
If you want to search, you often get elastic, but there are many options and depending on what kind of search you want. Here are some ways of looking at it. I came across this post on mastodon: https://mastodon.social/@mhoye/112422822807191436 I’m on here looking for text indexers and everything is ‘lightning fast exoscale terafloops that scales to enterprise quantawarbles with polytopplic performanations’ and it would be great if this industry could breathe into a bag until it remembers that one person with one computer is a constituency that matters. [Read More]

A rant about tp-link wifi boxes

No internet? no wifi for you!

A rant about tp-link wifi boxes
My internet was down for several days, (see previous post) and the only thing that really broke, except for obviously internet connected services on home-assistant, was the wifi. I have tp-link deco boxes and they work really okay for most of the time. They form a mesh and connect with whatever connection is best (through electricity, point to point wifi, or a network cable). In general, they just work. Until your internet connection is down. [Read More]

An offline first smart home is really nice

Local first smart home was a great decision

An offline first smart home is really nice
Recently my internet connection was down for several days, where I live in Europe that almost never happens. I am so used to having an internet connection that I really had to adjust to this. Of course having mobile phones with data connections means my online addiction was regularly fed, but I can’t roam my home lan over my mobile connection (yet?). Home assistant just kept going and that is awesome! [Read More]

OpenSanctions is an amazing example of entity resolution at scale

OpenSanctions is an amazing example of entity resolution at scale
In one of my previous post I talked about entity resolution and how data science plays a role. I am a big fan of OpenSanctions, and their process (entirely open) is a beautiful example of Entity resolution. OpenSanctions is an international database of persons and companies of political, criminal, or economic interest. It’s a combination of sanction lists, publicly available databases, and criminal information. Companies can use this information to check their customers, to prevent money laundering and sanction evasion. [Read More]

zettlr to hugo

I already know hugo!

zettlr to hugo
This is a technical walkthrough of how I turn zettlr markdown files into a hugo website. This is an experiment and not yet finished, but I want to write down my thoughtprocess in the hope that it works for you too. I was looking for a way to publish my zettelkasten to a website that only I can see. my solution is: any self published, local website will do, as long as I put it behind a tailscale network, then I can view the website with my devices everywhere (through tailscale). [Read More]