Dagster: all the Ways you can Differentiate Assets

tags, kinds, metadata, and more

Posted on November 7, 2024 (Last modified on November 10, 2024) | 2 minutes | 364 words | Roel M. Hogervorst

When you have more than 10 assets in dagster, you might want to be able to quickly identify them. Here are all the ways (I think) you can differentiate assets. naming convention: This might not be for everyone but with a strong naming convention you can easily identify the asset. You could use a schema like <type>_<source>__<additional_context> (the dbt docs have excellent naming suggestions) prefixes: it is possible to add a prefix to an asset (I don’t really like this), but it would be like groupingsname/assetname groups: you can group multiple related assets together. [Read More]

Dagster: Integrating Jobs with Assets and Vice Versa.

Posted on October 29, 2024 | 3 minutes | 490 words | Roel M. Hogervorst

Today I learned (TIL) you can actually run jobs based on assets and vice versa.

Assets VS Jobs

In dagster you have assets, jobs, ops, sensors and schedules. I have been using dagster for a few years, since before the assets were introduced. Assets are awesome: I don’t care that much about the data engineering processes themselves, I care about the results! And that is what assets are focused on. And because you can let assets depend on each other you can set up processes without hacks such as checking if a job is finished or not. But some processes do not deliver a product at the end, setting up permissions in a database, correcting known mistakes, some variants of triggering webhooks, etc. In those cases you want to use ‘ops’ and ‘jobs’, the traditional way of working.

But what if you have several processes, where some are traditional jobs/ops and some are assets? And somehow your job depends on the asset? For example: run a cleaning process after a table is created. I had the feeling that you had to rewrite the second job into an asset, even if that was not semantically correct. But you don' t have to!

Assets AND jobs

You can materialize an asset, hook up an asset sensor that senses materialization and the sensor can then start a job.

Here is some example code to make it a bit more clear:

asset_1 for example creates a table
asset_1_sensor waits for a Materialization event and triggers
job_a

# pseudo code, misses all the important things

@asset
def asset_1():
	"""An asset that you care about."""
    #something
    return

@asset_sensor(asset_key=AssetKey("asset_1"), job=job_a)
def asset_1_sensor():
	yield RunRequest()


def job_a():
	"""A job, with no data product as end product."""
	pass

The other way around works too. You can have a job that materializes an asset. and Assets that depend on that further.

job_b materializes an asset
asset_2
asset_3 depends on asset_2

# pseudo code, misses all the important things
@job
def job_b():
	"""A job that materializes an asset."""
	# some work
	
	context.log_event(
        AssetMaterialization(asset_key="asset_2")
        )

@asset(deps=["asset_2"])
def asset_3():
	#some work
	pass

This is super useful, but there are some caveats:

when you make everything assets, you can instruct dagster to refresh all assets in the chain, but it does not ‘know’ how to start a traditional job. So this sort of breaks the chain.
An asset created by a job that has a materialization event, is seen as an ‘external asset’, and so dagster ‘believes’ it is not under its control. You can go downstream from an external asset, but you cannot couple ‘real’ assets as dependencies.

So, if you have many jobs that all need to work after each other, you better rewrite it into assets, but if you have a few incidental jobs with not a lot of downstream dependencies you can chain them up to assets.

Notes

I wrote about dagster before: see tag: “dagster”
dagster docs: asset sensors

Dagster data_engineering TIL

OpenSanctions is an amazing example of entity resolution at scale

Posted on May 22, 2024 | 4 minutes | 817 words | Roel M. Hogervorst

In one of my previous post I talked about entity resolution and how data science plays a role. I am a big fan of OpenSanctions, and their process (entirely open) is a beautiful example of Entity resolution. OpenSanctions is an international database of persons and companies of political, criminal, or economic interest. It’s a combination of sanction lists, publicly available databases, and criminal information. Companies can use this information to check their customers, to prevent money laundering and sanction evasion. [Read More]

data_engineering data_science entity_resolution

Entity resolution for data scientists

or data matching, or data deduplication or record linkage

Posted on April 24, 2024 (Last modified on May 22, 2024) | 12 minutes | 2384 words | Roel M. Hogervorst

I have a problem. Others have it too, it is a problem of duplication. I’m trying to track the books I read in Bookwyrm so I can talk about it online. But there are so many duplicates! How do we know if Soren Kierkegaard,Søren Kierkegaard, and Sören Kierkegaard are the same person? This is an example of entity resolution1. It is also called deduplication, record linkage and data matching 2. We want to compare entities from different datasets and make a confident claim that they match or not. [Read More]

data_engineering data_science entity_resolution

how I write tests for dagster

unit-testing data engineering

Posted on February 27, 2024 (Last modified on November 19, 2024) | 4 minutes | 813 words | Roel M. Hogervorst

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Dagster testing data_engineering

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Posted on January 28, 2024 (Last modified on February 25, 2024) | 5 minutes | 859 words | Roel M. Hogervorst

I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

advanced Dagster data_engineering data_science grist quickthoughts spreadsheets