Defining a Dataset#
In Initializing a UnionML App, we created a UnionML app project,
which contains an app.py
script. In this guide, we’ll learn how a Dataset
is
defined and how we can customize its behavior.
What’s a UnionML Dataset?#
A Dataset
is one of the core parts of a UnionML app. You can think of
it as a specification for a dataset’s source in addition to a set of common
machine-learning-specific abstractions, which we’ll get into later in this guide.
First, let’s define a dataset:
from unionml import Dataset
dataset = Dataset(name="digits_dataset", test_size=0.2, shuffle=True, targets=["target"])
Note
In the above code snippet you might notice a few things:
We’re defining a
Dataset
with the name"digits_dataset"
.The
targets
argument accepts a list of strings referring to the column names. By defaultunionml.Dataset
understandspandas.DataFrame
objects as datasets, but as we’ll see later this can be customized to accept data in any arbitrary format.The
test_size
argument specifies what fraction of the dataset should be reserved as the hold-out test set for model evaluation.The
shuffle
argument ensures that the data is shuffled before it’s split into training and test sets, whilerandom_state
makes this shuffling process deterministic.
Core Dataset
Functions#
In this toy example, we’ll use the sklearn digits as our dataset.
Important
By default, the Dataset
class understands how to work with pandas.DataFrame
objects,
so in this section we’ll assume that we’re working with one. If you would like built-in support
for other data structures, please create an issue!
reader()
#
When working with pandas.DataFrame
s, the only Dataset
method you need to implement is
the reader
, which specifies how to get your training data. This is done by decorating a
function with the dataset.reader
decorator.
import pandas as pd
from sklearn.datasets import load_digits
@dataset.reader
def reader(sample_frac: float = 1.0, random_state: int = 12345) -> pd.DataFrame:
data = load_digits(as_frame=True).frame
return data.sample(frac=sample_frac, random_state=random_state)
Notice how we can define any arbitrary set of arguments. In this case, we can choose to sample the digits dataset to produce a subset of data.
loader()
#
The loader
function should specify how to load the output of the reader
function into memory.
Since the Dataset
class knows how to handle pandas.DataFrame
s, defining the loader
function
is optional if you’re working with them.
However, suppose that we refactor our reader function so that it returns a parquet file. This
is where the Flyte Type System comes in handy.
We can use FlyteFile as the output annotation of the reader
like so:
import pandas as pd
from flytekit.types.file import FlyteFile
from sklearn.datasets import load_digits
@dataset.reader
def reader(sample_frac: float = 1.0, random_state: int = 12345) -> FlyteFile:
data = load_digits(as_frame=True).frame
output_path = "./digits.parquet"
data.to_parquet(output_path)
return FlyteFile(path=output_path)
Then to read the file back into memory, we specify our loader:
@dataset.loader
def loader(data: FlyteFile) -> pd.DataFrame
with open(data) as f:
return pd.from_parquet(f)
Why do we need two separate steps?
When using UnionML-supported data structures (such as pandas.DataFrame
s and other supported
Flyte Types), it automatically
understands how to handle the serialization/deserialization across the data reading and
model training functions.
For unrecognized types, UnionML will use Pickle Type as the fallback, which is not guaranteed to work across different versions of Python or different versions of your package dependencies.
With that context, there are two cases where you want to define a reader
and loader
function:
When the most natural way of storing a dataset is in files or a directory structure.
When you don’t want to use pickle as the data transfer format between data reading and model training.
splitter()
#
The splitter
function should specify how to split your data into train and test sets. When
working with pandas.DataFrame
s you can supply the test_size
, shuffle
, and random_state
arguments to the Dataset
initializer to split your data as a fraction of test_size
.
If shuffle == True
then the dataframe is shuffled before splitting using random_state
as
the random seed.
To implement your own splitting behavior, you can use the dataset.splitter
decorator. The
following example is roughly equivalent to the built-in behavior:
from typing import NamedTuple
Splits = NamedTuple("Splits", train=pd.DataFrame, test=pd.DataFrame)
@dataset.splitter
def splitter(data: pd.DataFrame, test_size: float, shuffle: bool, random_state: int) -> Splits:
if shuffle:
data = data.sample(frac=1.0, random_state=random_state)
n = int(data.shape[0] * test_size)
return data.iloc[:-n], data.iloc[-n:]
Note
The splitter is expected to return an indexable type whose underlying type matches the
output of reader
. In this case, we return a NamedTuple
of pd.DataFrame
s.
parser()
#
Finally, this specifies how to extract features and targets from your dataset.
By supplying the features
and targets
arguments for the Dataset
initializer,
you’re indicating which columns in the pandas.DataFrame
are features and which
are targets, respectively.
Note
If you only supply the targets
argument, the Dataset
assumes that the rest
of the columns in the dataframe are features.
Similar to the dataset.splitter
decorator, you can use the dataset.parser
decorator
to implement your own parser. The following example is roughly equivalent to the built-in
behavior:
from typing import Optional
Parsed = NamedTuple("Parsed", features=pd.DataFrame, targets=pd.DataFrame)
@dataset.parser
def parser(data: pd.DataFrame, features: Optional[List[str]], targets: List[str]) -> Parsed:
if not features:
features = [col for col in data if col not in targets]
return data[features], data[targets]
Dataset
Functions for Prediction#
The following functions define behavior for prediction across multiple use cases.
feature_loader()
#
Similar to the loader
function, the feature_loader
function handles the loading data into memory
from a file or from some raw data format.
The default feature loader is equivalent to the following:
import json
from typing import Any, Dict, List, Union
from pathlib import Path
RawFeatures = List[Dict[str, Any]]
@dataset.feature_loader
def feature_loader(features: Union[Path, RawFeatures]) -> pd.DataFrame:
if isinstance(features, Path):
# handle case where `features` input is a filepath
with features.open() as f:
features: RawFeatures = json.load(f)
return pd.DataFrame(features)
Note that this function handles the case where the input is a file path or a list of dictionary records.
feature_transformer()
#
The feature_transformer
function handles additional processing steps performed on the
output of feature_loader
in case you want to do some stateless transforms, like normalizing
the values of your feature data based on static parameters.
For example, suppose we received an image in the form of a dataframe, where pixel values are in the range 0 to 256. To normalize the data to be between 0 and 1, we’d specify a function like this:
@dataset.feature_transformer
def feature_transformer(data: pd.DataFrame) -> pd.DataFrame:
return data / 255
Next#
Now that we’ve defined a Dataset
, we need to Bind a Model and Dataset together
to create our UnionML app.