unionml.dataset.Dataset#

class unionml.dataset.Dataset(name='dataset', *, features=None, targets=None, test_size=0.2, shuffle=True, random_state=12345)#

Initialize a UnionML Dataset.

The term UnionML Dataset refers to the specification of data used to train a model object from features and targets (see unionml.model.Model for more details) or generate predictions from one based on some features. This specification is implemented by the used via the functional entrypoints, e.g. unionml.dataset.Dataset.reader().

By default the UnionML Dataset knows how to handle pandas.DataFrame objects automatically, meaning that the only function that needs to be implemented is the unionml.dataset.Dataset.reader(). To add support for other data structures, the user needs to implement the rest of the functional entrypoints.

Parameters:
  • name (str) – name of the dataset.

  • features (Optional[List[str]]) – a list of string keys used to access features from the data structure. The type of this data structure is determined by the output of the unionml.dataset.Dataset.reader() by default, but if unionml.dataset.Dataset.loader() is implemented then the output type of the latter function is taken.

  • targets (Optional[List[str]]) – a list of string keys used to access targets data. The type of this data is determined in the same way as the features argument.

  • test_size (float) – the percent of the dataset to split out as the test set.

  • shuffle (bool) – if True, shuffles the dataset before dataset splitting.

  • random_state (int) – random state used for data shuffling.

Methods

dataset_task

Create a Flyte task for getting the dataset using the reader function.

feature_loader

Register an optional feature loader that loads data from some serialized format into raw features.

feature_transformer

Register an optional feature transformer that performs pre-processing on features before prediction.

from_sqlalchemy_task

Converts a sqlalchemy task to a dataset.

from_sqlite_task

Converts a sqlite task to a dataset.

get_data

Get training data from from its raw form to its model-ready form.

get_features

Get feature data from its raw form to its model-ready form.

loader

Register an optional loader function for loading data into memory for model training.

parser

Register an optional parser function that produces a tuple of features and targets.

reader

Register a reader function for getting data from some external source.

splitter

Register an optional splitter function that partitions data into training and test sets.

Attributes

dataset_datatype

Get the output type of the reader or a user-defined loader.

dataset_datatype_source

Get the output type of the reader or a user-defined loader.

feature_type

Get the type returned by the feature_transformer, falling back to the parser.

loader_kwargs_type

Dynamically create a dataclass for loader kwargs.

parser_kwargs

The keyword arguments to be forwarded to the parser function.

parser_kwargs_type

Dynamically create a dataclass for parser kwargs.

parser_return_types

Get an iterable of types produced by the parser.

reader_input_types

Get the input parameters of the reader.

splitter_kwargs

The keyword arguments to be forwarded to the splitter function.

splitter_kwargs_type

Dynamically create a dataclass for splitter kwargs.

property dataset_datatype: Dict[str, Type]#

Get the output type of the reader or a user-defined loader.

The type from the loader takes precedence.

property dataset_datatype_source: ReaderReturnTypeSource#

Get the output type of the reader or a user-defined loader.

The type from the loader takes precedence.

dataset_task()#

Create a Flyte task for getting the dataset using the reader function.

feature_loader(fn)#

Register an optional feature loader that loads data from some serialized format into raw features.

This function handles prediction cases in two contexts:

  1. When the unionml predict cli command is invoked with the –features flag.

  2. When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.

And it should return the data structure needed for model training.

feature_transformer(fn)#

Register an optional feature transformer that performs pre-processing on features before prediction.

This function handles prediction cases in three contexts:

  1. When the unionml predict cli command is invoked with the –features flag.

  2. When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.

  3. When the model.predict or model.remote_predict functions are invoked.

property feature_type: Type#

Get the type returned by the feature_transformer, falling back to the parser.

The fallback behavior occurs if the user didn’t define a feature_transformer function.

classmethod from_sqlalchemy_task(task, *args, **kwargs)#

Converts a sqlalchemy task to a dataset.

This class method creates a UnionML Dataset that uses the sqlalchemy task as its reader function.

Return type:

Dataset

classmethod from_sqlite_task(task, *args, **kwargs)#

Converts a sqlite task to a dataset.

This class method creates a UnionML Dataset* that uses the sqlite task as its reader function.

Return type:

Dataset

get_data(raw_data, loader_kwargs=None, splitter_kwargs=None, parser_kwargs=None)#

Get training data from from its raw form to its model-ready form.

Parameters:

raw_data – Raw data in the same form as the reader output.

Return type:

Dict[str, Any]

This function uses the following registered functions to create parsed, split data:

get_features(features)#

Get feature data from its raw form to its model-ready form.

This function uses the following registered functions to create model-ready features:

loader(fn)#

Register an optional loader function for loading data into memory for model training.

This function should take the output of the reader function and return the data structure needed for model training. If specified, the output type of this function take precedence over that of the reader function and the type signatures of splitter and parser should adhere to it.

By default this is simply a pass through function that returns the output of the reader function.

Parameters:

fn – function to register

property loader_kwargs_type#

Dynamically create a dataclass for loader kwargs.

parser(fn, feature_key=0)#

Register an optional parser function that produces a tuple of features and targets.

Parameters:
  • fn – function to register

  • feature_key (int) – the index of the features in the output of the parser function. By default, this assumes that the first element of the output contains the features.

The following is equivalent to the default implementation.

from typing import Optional, Tuple

Parsed = Tuple[pd.DataFrame, pd.DataFrame]

@dataset.parser
def parser(data: pd.DataFrame, features: Optional[List[str]], targets: List[str]) -> Parsed:
    if not features:
        features = [col for col in data if col not in targets]
    return data[features], data[targets]
property parser_kwargs#

The keyword arguments to be forwarded to the parser function.

property parser_kwargs_type#

Dynamically create a dataclass for parser kwargs.

property parser_return_types: Tuple[Any, ...]#

Get an iterable of types produced by the parser.

reader(fn=None, **reader_task_kwargs)#

Register a reader function for getting data from some external source.

The signature of this function is flexible and dependent on the use case.

Parameters:

fn – function to register

property reader_input_types: Optional[List[Parameter]]#

Get the input parameters of the reader.

splitter(fn)#

Register an optional splitter function that partitions data into training and test sets.

Parameters:

fn – function to register

The following is equivalent to the default implementation.

from typing import Tuple

Splits = Tuple[pd.DataFrame, pd.DataFrame]

@dataset.splitter
def splitter(data: pd.DataFrame, test_size: float, shuffle: bool, random_state: int) -> Splits:
    if shuffle:
        data = data.sample(frac=1.0, random_state=random_state)
    n = int(data.shape[0] * test_size)
    return data.iloc[:-n], data.iloc[-n:]
property splitter_kwargs#

The keyword arguments to be forwarded to the splitter function.

property splitter_kwargs_type#

Dynamically create a dataclass for splitter kwargs.