MNIST: Digits Classification#
The MNIST dataset is considered to be the “hello world” dataset of machine learning. It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.
In that same spirit, we’ll be making the “hello world” UnionML app using this dataset and a simple linear classifier with sklearn.
With this dataset, we’ll see just how easy it is to create a single-script UnionML app.
Note
This tutorial is adapted from this sklearn guide.
Setup and importing libraries#
First let’s import our dependencies and create the UnionML Dataset
and Model
objects:
from typing import List, Union
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from unionml import Dataset, Model
dataset = Dataset(name="mnist_dataset", test_size=0.2, shuffle=True, targets=["class"])
model = Model(name="mnist_classifier", dataset=dataset)
Let’s break down the code cell above.
We first define a Dataset>
, which defines the specification for data that
can be used for training and prediction. We also give it a few keyword options:
test_size
: this indicated the percentage of the data that should be held over for testing. In this case the dataset is divided into test-set (20%) and training set (80%) for evaluation.shuffle
: this randomly shuffles the data before splitting into train/test splits.targets
: this accepts a list of strings referring to the column names of the dataset.
Then we define a Model>
, which refers to the specification
for how to actually train the model, evaluate it, and generate predictions from
it. Note that we bind the dataset
we just defined to the model
.
Caching Data#
For convenience, we cache the dataset so that MNIST loading is faster upon
subsequent calls to the fetch_openml
function:
from pathlib import Path
from joblib import Memory
memory = Memory(Path.home() / "tmp")
fetch_openml_cached = memory.cache(fetch_openml)
We do this so we don’t have to re-download the dataset it every time we need to train a model.
Define Core UnionML Functions#
Run the following command to define our core UnionML app functions:
@dataset.reader(cache=True, cache_version="1")
def reader() -> pd.DataFrame:
dataset = fetch_openml_cached(
"mnist_784",
version=1,
cache=True,
as_frame=True,
)
# randomly sample a subset for faster training
return dataset.frame.sample(1000, random_state=42)
@model.init
def init(hyperparameters: dict) -> Pipeline:
estimator = Pipeline(
[
("scaler", StandardScaler()),
("classifier", LogisticRegression()),
]
)
return estimator.set_params(**hyperparameters)
@model.trainer(cache=True, cache_version="1")
def trainer(
estimator: Pipeline,
features: pd.DataFrame,
target: pd.DataFrame,
) -> Pipeline:
return estimator.fit(features, target.squeeze())
@model.predictor
def predictor(
estimator: Pipeline,
features: pd.DataFrame,
) -> List[float]:
return [float(x) for x in estimator.predict(features)]
@model.evaluator
def evaluator(
estimator: Pipeline,
features: pd.DataFrame,
target: pd.DataFrame,
) -> float:
return float(accuracy_score(target.squeeze(), estimator.predict(features)))
The Dataset
and Model
objects expose function decorators where we define
the behavior of our machine learning app:
reader()
- Register a function for getting data from some external source.init()
- Register a function for initializing a model object. This is equivalent to specifying a class or callable using theinit
kwarg in theModel
constructor.trainer()
- Register a function for training a model object.predictor()
- Register a function that generates predictions from a model object.evaluator()
- Register a function for evaluating given model object.
Training a Model Locally#
Then we can train our model locally:
estimator, metrics = model.train(
hyperparameters={
"classifier__penalty": "l2",
"classifier__C": 0.1,
"classifier__max_iter": 1000,
}
)
print(estimator, metrics, sep="\n")
Note that we pass a dictionary of hyperparameters
when we invoke
evaluating
,
which, in this case, follows the sklearn conventions for specifying
hyperparameters for sklearn Pipeline
s
Serving on a Gradio Widget#
Finally, let’s create a gradio
widget by simply using the
predict()
method in the gradio.Interface
object.
Before we do this, however, we want to define a feature_loader()
function to handle the raw input coming from the gradio
widget:
import numpy as np
@dataset.feature_loader
def feature_loader(data: np.ndarray) -> pd.DataFrame:
return (
pd.DataFrame(data.ravel())
.transpose()
.rename(columns=lambda x: f"pixel{x + 1}")
.astype(float)
)
We also need to take care to handle the None
case when we press the clear
button on the widget using a lambda
function:
import gradio as gr
gr.Interface(
fn=lambda img: img if img is None else model.predict(img)[0],
inputs="sketchpad",
outputs="label",
live=True,
allow_flagging="never",
).launch()
You might notice that the model may not perform as well as you might expect…
welcome to the world of machine learning practice! To obtain a better model
given a fixed dataset, feel free to play around with the model hyperparameters
or even switch up the model type/architecture that’s defined in the trainer
function.