unionml.dataset.Dataset#
- class unionml.dataset.Dataset(name='dataset', *, features=None, targets=None, test_size=0.2, shuffle=True, random_state=12345)#
Initialize a UnionML Dataset.
The term UnionML Dataset refers to the specification of data used to train a model object from features and targets (see
unionml.model.Model
for more details) or generate predictions from one based on some features. This specification is implemented by the used via the functional entrypoints, e.g.unionml.dataset.Dataset.reader()
.By default the UnionML Dataset knows how to handle
pandas.DataFrame
objects automatically, meaning that the only function that needs to be implemented is theunionml.dataset.Dataset.reader()
. To add support for other data structures, the user needs to implement the rest of the functional entrypoints.- Parameters:
name (
str
) – name of the dataset.features (
Optional
[List
[str
]]) – a list of string keys used to access features from the data structure. The type of this data structure is determined by the output of theunionml.dataset.Dataset.reader()
by default, but ifunionml.dataset.Dataset.loader()
is implemented then the output type of the latter function is taken.targets (
Optional
[List
[str
]]) – a list of string keys used to access targets data. The type of this data is determined in the same way as thefeatures
argument.test_size (
float
) – the percent of the dataset to split out as the test set.shuffle (
bool
) – if True, shuffles the dataset before dataset splitting.random_state (
int
) – random state used for data shuffling.
Methods
Create a Flyte task for getting the dataset using the
reader
function.Register an optional feature loader that loads data from some serialized format into raw features.
Register an optional feature transformer that performs pre-processing on features before prediction.
Converts a sqlalchemy task to a dataset.
Converts a sqlite task to a dataset.
Get training data from from its raw form to its model-ready form.
Get feature data from its raw form to its model-ready form.
Register an optional loader function for loading data into memory for model training.
Register an optional parser function that produces a tuple of features and targets.
Register a reader function for getting data from some external source.
Register an optional splitter function that partitions data into training and test sets.
Attributes
Get the output type of the
reader
or a user-definedloader
.Get the output type of the
reader
or a user-definedloader
.Get the type returned by the
feature_transformer
, falling back to theparser
.Dynamically create a dataclass for loader kwargs.
The keyword arguments to be forwarded to the parser function.
Dynamically create a dataclass for parser kwargs.
Get an iterable of types produced by the parser.
Get the input parameters of the reader.
The keyword arguments to be forwarded to the splitter function.
Dynamically create a dataclass for splitter kwargs.
- property dataset_datatype: Dict[str, Type]#
Get the output type of the
reader
or a user-definedloader
.The type from the
loader
takes precedence.
- property dataset_datatype_source: ReaderReturnTypeSource#
Get the output type of the
reader
or a user-definedloader
.The type from the
loader
takes precedence.- Return type:
- dataset_task()#
Create a Flyte task for getting the dataset using the
reader
function.
- feature_loader(fn)#
Register an optional feature loader that loads data from some serialized format into raw features.
This function handles prediction cases in two contexts:
When the unionml predict cli command is invoked with the –features flag.
When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.
And it should return the data structure needed for model training.
- feature_transformer(fn)#
Register an optional feature transformer that performs pre-processing on features before prediction.
This function handles prediction cases in three contexts:
When the unionml predict cli command is invoked with the –features flag.
When the FastAPI app /predict/ endpoint is invoked with features passed in as JSON or string encoding.
When the model.predict or model.remote_predict functions are invoked.
- property feature_type: Type#
Get the type returned by the
feature_transformer
, falling back to theparser
.The fallback behavior occurs if the user didn’t define a
feature_transformer
function.- Return type:
- classmethod from_sqlalchemy_task(task, *args, **kwargs)#
Converts a sqlalchemy task to a dataset.
This class method creates a UnionML Dataset that uses the sqlalchemy task as its
reader
function.- Return type:
- classmethod from_sqlite_task(task, *args, **kwargs)#
Converts a sqlite task to a dataset.
This class method creates a UnionML Dataset* that uses the sqlite task as its
reader
function.- Return type:
- get_data(raw_data, loader_kwargs=None, splitter_kwargs=None, parser_kwargs=None)#
Get training data from from its raw form to its model-ready form.
- Parameters:
raw_data – Raw data in the same form as the
reader
output.
This function uses the following registered functions to create parsed, split data:
- get_features(features)#
Get feature data from its raw form to its model-ready form.
This function uses the following registered functions to create model-ready features:
- loader(fn)#
Register an optional loader function for loading data into memory for model training.
This function should take the output of the reader function and return the data structure needed for model training. If specified, the output type of this function take precedence over that of the
reader
function and the type signatures ofsplitter
andparser
should adhere to it.By default this is simply a pass through function that returns the output of the
reader
function.- Parameters:
fn – function to register
- property loader_kwargs_type#
Dynamically create a dataclass for loader kwargs.
- parser(fn, feature_key=0)#
Register an optional parser function that produces a tuple of features and targets.
- Parameters:
fn – function to register
feature_key (
int
) – the index of the features in the output of the parser function. By default, this assumes that the first element of the output contains the features.
The following is equivalent to the default implementation.
from typing import Optional, Tuple Parsed = Tuple[pd.DataFrame, pd.DataFrame] @dataset.parser def parser(data: pd.DataFrame, features: Optional[List[str]], targets: List[str]) -> Parsed: if not features: features = [col for col in data if col not in targets] return data[features], data[targets]
- property parser_kwargs#
The keyword arguments to be forwarded to the parser function.
- property parser_kwargs_type#
Dynamically create a dataclass for parser kwargs.
- reader(fn=None, **reader_task_kwargs)#
Register a reader function for getting data from some external source.
The signature of this function is flexible and dependent on the use case.
- Parameters:
fn – function to register
- splitter(fn)#
Register an optional splitter function that partitions data into training and test sets.
- Parameters:
fn – function to register
The following is equivalent to the default implementation.
from typing import Tuple Splits = Tuple[pd.DataFrame, pd.DataFrame] @dataset.splitter def splitter(data: pd.DataFrame, test_size: float, shuffle: bool, random_state: int) -> Splits: if shuffle: data = data.sample(frac=1.0, random_state=random_state) n = int(data.shape[0] * test_size) return data.iloc[:-n], data.iloc[-n:]
- property splitter_kwargs#
The keyword arguments to be forwarded to the splitter function.
- property splitter_kwargs_type#
Dynamically create a dataclass for splitter kwargs.