Serving with FastAPI#

In the Local Training and Prediction guide, we saw how to create a prediction server locally using a model that we trained locally. This is great for certain use cases, but it won’t scale to bigger data or models.

In this guide, we’ll learn how to create an online prediction server using a model trained on a Flyte cluster.


Follow the Deploying to Flyte guide to:

  1. Set up a local Flyte demo cluster.

  2. Deploy a UnionML app on it.

  3. Train a model on it.

Serving a Model from a Flyte Backend#

Once we’ve trained a model, the Flyte backend effectively becomes a model registry that we can use to serve models, and the way we can do this is very similar to creating the prediction service from a model that we’ve trained locally:

from fastapi import FastAPI

# dataset and model definition

app = FastAPI()

model.serve(app, remote=True, model_version="latest")


The model_version argument is "latest" by default, but you can serve other models by passing in the unique identifier of the Flyte execution that produced a specific model that you want to serve.

To list these prior model versions, do:


Or you can use the UnionML cli:

unionml list-model-versions app:model --limit 5

Then start the server:

unionml serve app:app --reload

Once the server’s started, you can use the Python requests library or any other HTTP library to get predictions from input features:

import requests
from sklearn.datasets import load_digits

# generate predictions
    json={"features": load_digits(as_frame=True).frame.sample(5, random_state=42).to_dict(orient="records")},

And that’s it 🙌


Serving online predictions on FastAPI gives you full control over the server infrastructure, but what if you want to standup a serverless online prediction service? We’ll see how we can achieve this in the Serving with AWS Lambda guide.