Test for Tags in Dagster

How to enforce a style in your organization.

Test for Tags in Dagster
Dagster assets can be labelled with owners, tags, kinds and metadata. This is super useful but if you want to enforce a particular style on every merge request you will mess up eventually. But, you can use pytest to enforce this. testing for components You can find the file in the github linked below, but here it is in steps. import pytest from dagster import AssetSpec, AssetsDefinition # Import all the assets from your project as one list. [Read More]

Dagster: all the Ways you can Differentiate Assets

tags, kinds, metadata, and more

Dagster: all the Ways you can Differentiate Assets
When you have more than 10 assets in dagster, you might want to be able to quickly identify them. Here are all the ways (I think) you can differentiate assets. naming convention: This might not be for everyone but with a strong naming convention you can easily identify the asset. You could use a schema like <type>_<source>__<additional_context> (the dbt docs have excellent naming suggestions) prefixes: it is possible to add a prefix to an asset (I don’t really like this), but it would be like groupingsname/assetname groups: you can group multiple related assets together. [Read More]

Dagster: Integrating Jobs with Assets and Vice Versa.

Today I learned (TIL) you can actually run jobs based on assets and vice versa.

Assets VS Jobs

In dagster you have assets, jobs, ops, sensors and schedules. I have been using dagster for a few years, since before the assets were introduced. Assets are awesome: I don’t care that much about the data engineering processes themselves, I care about the results! And that is what assets are focused on. And because you can let assets depend on each other you can set up processes without hacks such as checking if a job is finished or not. But some processes do not deliver a product at the end, setting up permissions in a database, correcting known mistakes, some variants of triggering webhooks, etc. In those cases you want to use ‘ops’ and ‘jobs’, the traditional way of working.

But what if you have several processes, where some are traditional jobs/ops and some are assets? And somehow your job depends on the asset? For example: run a cleaning process after a table is created. I had the feeling that you had to rewrite the second job into an asset, even if that was not semantically correct. But you don' t have to!

Assets AND jobs

You can materialize an asset, hook up an asset sensor that senses materialization and the sensor can then start a job.

Here is some example code to make it a bit more clear:

  • asset_1 for example creates a table
  • asset_1_sensor waits for a Materialization event and triggers
  • job_a
# pseudo code, misses all the important things

@asset
def asset_1():
	"""An asset that you care about."""
    #something
    return

@asset_sensor(asset_key=AssetKey("asset_1"), job=job_a)
def asset_1_sensor():
	yield RunRequest()


def job_a():
	"""A job, with no data product as end product."""
	pass
	

The other way around works too. You can have a job that materializes an asset. and Assets that depend on that further.

  • job_b materializes an asset
  • asset_2
  • asset_3 depends on asset_2
# pseudo code, misses all the important things
@job
def job_b():
	"""A job that materializes an asset."""
	# some work
	
	context.log_event(
        AssetMaterialization(asset_key="asset_2")
        )

@asset(deps=["asset_2"])
def asset_3():
	#some work
	pass

This is super useful, but there are some caveats:

  • when you make everything assets, you can instruct dagster to refresh all assets in the chain, but it does not ‘know’ how to start a traditional job. So this sort of breaks the chain.
  • An asset created by a job that has a materialization event, is seen as an ‘external asset’, and so dagster ‘believes’ it is not under its control. You can go downstream from an external asset, but you cannot couple ‘real’ assets as dependencies.

So, if you have many jobs that all need to work after each other, you better rewrite it into assets, but if you have a few incidental jobs with not a lot of downstream dependencies you can chain them up to assets.

Notes

how I write tests for dagster

unit-testing data engineering

It is vital that your data pipelines work as intented, but it is also easy to make mistakes. That is why we write tests. Testing in Airflow was a fucking pain. The best you could do was create a complete deployment in a set of containers and test your work in there. Or creating python packages, test those and hope it would work in the larger airflow system. In large deployments you could not add new dependencies without breaking other stuff, so you have to either be creative with the python /airflow standard library or run your work outside of airflow. [Read More]

Evolution of Our Dagster File Organization

File structures should make your work easier

Evolution of Our Dagster File Organization
Whenever you try to do intelligent data engineering tasks: refreshing tables in order, running python processes, ingesting and outputting data, you need a scheduler. Airflow is the best known of these beasts, but I have a fondness for Dagster. Dagster focuses on the result of computations: tables, trained model artefacts and not on the process itself, this works really well for data science solutions. Dagster makes it also quite easy to swap implementations, you can read from disk while testing or developing, and read from a database in production. [Read More]

Using Grist as Part of your Data Engineering Pipeline with Dagster

Human-in-the-loop workflows

Using Grist as Part of your Data Engineering Pipeline with Dagster
I haven’t tested this yet, but I see a very powerful combination with Dagster assets and grist. First something about spreadsheets Dagster en grist? Grist is something like google sheets, but more powerful, opensource and self-hostable. Dagster is an orchestrator for data, if you are familiar with airflow it is like that, but in my opinion way better for most of my work. If you don’t know airflow, or Dagster, this post is not very relevant for you. [Read More]

Planning Meals in an Overly Complicated Way

Creating an over-engineered technical solution for a first world household issue

Cooking, for some a chore, for some absolute joy. I’m somewhere in the middle. But over the years I’ve learned that if I need to plan my meals. If I plan a week of meals in advance we can do groceries for the entire week in one go and by thinking about your meals in advance you can vary your meals for nutritional value. I used to have a very strict diet to prevent stomach aches, and planning and cooking those meals was annoying, but eating the same things is very boring. [Read More]

Introducing the 'Smoll Data Stack'

The Small, Minimal, Open, Low effort, Low power (SMOLL) datastack1, is a pun with ambitions to grow into something larger, and more educational. I wanted a cheap platform to work on improving my data engineering skills and so I re-purposed some hardware for this project. A raspberry pi 3 with ubuntu & a NAS that I’ve installed a postgres database into. What people call the ‘modern data stack’ is usually [1] a cloud data warehouse such as Snowflake, Bigquery or Redshift2 (In my opinion a data warehouse is something that holds all the data in table like format and your transformations are done with SQL). [Read More]

How I Set Up Dagster in a Company

In the past few months I setup a dagster deployment within a kubernetes cluster. This was a lot of fun because I learned a lot, but I’d like to document some of the things we did so I’ll remember them later. Dagster is a scheduler/ orchestrator or workflow manager (I’ve seen all those words but I’m not sure what the differences are). When you need to get data from one place to another, do complex data operations that need to happen in a certain order or you have many many tasks to run, you might want to use such a thing. [Read More]