unionml.dataset.Dataset#

class unionml.dataset.Dataset(name='dataset', *, features=None, targets=None, test_size=0.2, shuffle=True, random_state=12345)#

Initialize a UnionML Dataset.

The term UnionML Dataset refers to the specification of data used to train a model object from features and targets (see unionml.model.Model for more details) or generate predictions from one based on some features. This specification is implemented by the used via the functional entrypoints, e.g. unionml.dataset.Dataset.reader().

By default the UnionML Dataset knows how to handle pandas.DataFrame objects automatically, meaning that the only function that needs to be implemented is the unionml.dataset.Dataset.reader(). To add support for other data structures, the user needs to implement the rest of the functional entrypoints.

Parameters:

name (str) – name of the dataset.
features (Optional[List[str]]) – a list of string keys used to access features from the data structure. The type of this data structure is determined by the output of the unionml.dataset.Dataset.reader() by default, but if unionml.dataset.Dataset.loader() is implemented then the output type of the latter function is taken.
targets (Optional[List[str]]) – a list of string keys used to access targets data. The type of this data is determined in the same way as the features argument.
test_size (float) – the percent of the dataset to split out as the test set.
shuffle (bool) – if True, shuffles the dataset before dataset splitting.
random_state (int) – random state used for data shuffling.

Methods

`dataset_task`	Create a Flyte task for getting the dataset using the `reader` function.
`feature_loader`	Register an optional feature loader that loads data from some serialized format into raw features.
`feature_transformer`	Register an optional feature transformer that performs pre-processing on features before prediction.
`from_sqlalchemy_task`	Converts a sqlalchemy task to a dataset.
`from_sqlite_task`	Converts a sqlite task to a dataset.
`get_data`	Get training data from from its raw form to its model-ready form.
`get_features`	Get feature data from its raw form to its model-ready form.
`loader`	Register an optional loader function for loading data into memory for model training.
`parser`	Register an optional parser function that produces a tuple of features and targets.
`reader`	Register a reader function for getting data from some external source.
`splitter`	Register an optional splitter function that partitions data into training and test sets.

Attributes

`dataset_datatype`	Get the output type of the `reader` or a user-defined `loader`.
`dataset_datatype_source`	Get the output type of the `reader` or a user-defined `loader`.
`feature_type`	Get the type returned by the `feature_transformer`, falling back to the `parser`.
`loader_kwargs_type`	Dynamically create a dataclass for loader kwargs.
`parser_kwargs`	The keyword arguments to be forwarded to the parser function.
`parser_kwargs_type`	Dynamically create a dataclass for parser kwargs.
`parser_return_types`	Get an iterable of types produced by the parser.
`reader_input_types`	Get the input parameters of the reader.
`splitter_kwargs`	The keyword arguments to be forwarded to the splitter function.
`splitter_kwargs_type`	Dynamically create a dataclass for splitter kwargs.

property dataset_datatype: Dict[str, Type]#

Get the output type of the reader or a user-defined loader.

The type from the loader takes precedence.

Return type:: Dict[str, Type]

property dataset_datatype_source: ReaderReturnTypeSource#

Get the output type of the reader or a user-defined loader.

The type from the loader takes precedence.

Return type:: ReaderReturnTypeSource

dataset_task()#: Create a Flyte task for getting the dataset using the reader function.

feature_loader(fn)#

This function handles prediction cases in two contexts:

When the unionml predict cli command is invoked with the –features flag.
When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.

And it should return the data structure needed for model training.

feature_transformer(fn)#

This function handles prediction cases in three contexts:

When the unionml predict cli command is invoked with the –features flag.
When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.
When the model.predict or model.remote_predict functions are invoked.

property feature_type: Type#

Get the type returned by the feature_transformer, falling back to the parser.

The fallback behavior occurs if the user didn’t define a feature_transformer function.

Return type:: Type

classmethod from_sqlalchemy_task(task, *args, **kwargs)#

Converts a sqlalchemy task to a dataset.

This class method creates a UnionML Dataset that uses the sqlalchemy task as its reader function.

Return type:: Dataset

classmethod from_sqlite_task(task, *args, **kwargs)#

Converts a sqlite task to a dataset.

This class method creates a UnionML Dataset* that uses the sqlite task as its reader function.

Return type:: Dataset

get_data(raw_data, loader_kwargs=None, splitter_kwargs=None, parser_kwargs=None)#

Get training data from from its raw form to its model-ready form.

Parameters:: raw_data – Raw data in the same form as the reader output.

This function uses the following registered functions to create parsed, split data:

unionml.dataset.Dataset.loader()
unionml.dataset.Dataset.splitter()
unionml.dataset.Dataset.parser()

Return type:: Dict[str, Any]

get_features(features)#

Get feature data from its raw form to its model-ready form.

This function uses the following registered functions to create model-ready features:

unionml.dataset.Dataset.feature_loader()
unionml.dataset.Dataset.feature_transformer()

loader(fn)#

This function should take the output of the reader function and return the data structure needed for model training. If specified, the output type of this function take precedence over that of the reader function and the type signatures of splitter and parser should adhere to it.

By default this is simply a pass through function that returns the output of the reader function.

Parameters:: fn – function to register

property loader_kwargs_type#: Dynamically create a dataclass for loader kwargs.

parser(fn, feature_key=0)#

Parameters:

fn – function to register
feature_key (int) – the index of the features in the output of the parser function. By default, this assumes that the first element of the output contains the features.

The following is equivalent to the default implementation.

from typing import Optional, Tuple

Parsed = Tuple[pd.DataFrame, pd.DataFrame]

@dataset.parser
def parser(data: pd.DataFrame, features: Optional[List[str]], targets: List[str]) -> Parsed:
    if not features:
        features = [col for col in data if col not in targets]
    return data[features], data[targets]

property parser_kwargs#: The keyword arguments to be forwarded to the parser function.

property parser_kwargs_type#: Dynamically create a dataclass for parser kwargs.

property parser_return_types: Tuple[Any, ...]#

Get an iterable of types produced by the parser.

Return type:: Tuple[Any, ...]

reader(fn=None, **reader_task_kwargs)#

The signature of this function is flexible and dependent on the use case.

Parameters:: fn – function to register

property reader_input_types: Optional[List[Parameter]]#

Get the input parameters of the reader.

Return type:: Optional[List[Parameter]]

splitter(fn)#

Parameters:: fn – function to register

The following is equivalent to the default implementation.

from typing import Tuple

Splits = Tuple[pd.DataFrame, pd.DataFrame]

@dataset.splitter
def splitter(data: pd.DataFrame, test_size: float, shuffle: bool, random_state: int) -> Splits:
    if shuffle:
        data = data.sample(frac=1.0, random_state=random_state)
    n = int(data.shape[0] * test_size)
    return data.iloc[:-n], data.iloc[-n:]

property splitter_kwargs#: The keyword arguments to be forwarded to the splitter function.

property splitter_kwargs_type#: Dynamically create a dataclass for splitter kwargs.