Distributing data science products

Where or what is production? What does it mean when someone says to bring some data science product ‘in production’ ? What does it mean for data science products to be in production? Is your product already in production? Is it a magical place?

I think two questions are of importance:

  • does my ‘thing’ provide value?
  • is my work repeatable?

If the answer to these questions is yes, than your ‘thing’ is in production. The devil is, as always, in the details. How do you integrate your work into the infrastructure. So how can you integrate your data science product into your company infrastructure?

Distributing data science products

I see 3 or 5 different end-products that I would call ‘production’.

Data scientists produce results with a statistical model.

How we go from there is one of three ways

  • The end goal is a rapport (analysis done in Rmarkdown/jupyter-notebook) End result delivered in a knowledge repo, internal website or pdf via email. You use statistical models to gain insight, an explanation.
  • The end goal is a prediction. You use a statistical model to create predictions. Data goes in, and predictions (in the form of data) go out.
  • The end goal is the trained model itself. You deliver a trained statistical model, to be used downstream by someone else. Very popular with neural networks (because it takes forever to train them), there are pre-trained word and image recognition models. There are several ways to distribute that model, see next section.

Options for distributing a trained model

  1. distribute the parameters of the model alone. For instance: If you build a linear model, you can extract the parameters and turn those into an advanced SQL query with f.i.: {tidypredict} or {modeldb}. I don’t know any python packages that can do this, but you could program it. If your model is sufficiently simple you can even print out the decision rules for practitioners, for instance with {FFTrees}.
  2. return the trained model artefact: save your pickled python model or .rds R model in a central location with some metadata and pull it where necessary. Tensorflow models are distributed like this.
  3. wrap your model and environment into a docker container, supply it with an API and distribute that container. The entire model is hidden away behind an interface that everyone in any programming language can work with. The big cloud vendors do it in a similar way (they call it AI for some reason).

So what should I choose?

Talk to your stakeholders from start to finish. Plan for production from the start of your proof-of-concept. You already know which of the options is required, is your end goal an explanation or a prediction? If your end product is a prediction, will you batch predict, or create an image that can be called with a standard API? Figure these things out as early as possible, so that your project has the best changes of becoming successful product and not one of the many failed proofs of concept.

Good luck!


Distributing data science products
by  |