datasets

The API for configuring datasets.

class Dataset(name: 'str', version: 'str', dependencies: 'Dependencies' = (), tasks: 'Tasks' = ())[source]

Bases: object

check_version(after_execution=())[source]
dependencies: Dependencies = ()

The first task(s) of this Dataset will be marked as downstream of any of the listed dependencies. In case of bare Task, a direct link will be created whereas for a Dataset the link will be made to all of its last tasks.

name: str

The name of the Dataset

register()[source]

Register dataset sources and targets in a single transaction. Only writes if sources or targets have changed. Creates table if it doesn’t exist yet.

sources: DatasetSources

The sources used by the datasets. Could be tables, files and urls

targets: DatasetTargets

The targets created by the datasets. Could be tables and files

tasks: Tasks = ()

The tasks of this Dataset. A TaskGraph will automatically be converted to Tasks_.

update(session)[source]
version: str

The Dataset’s version. Can be anything from a simple semantic versioning string like “2.1.3”, to a more complex string, like for example “2021-01-01.schleswig-holstein.0” for OpenStreetMap data. Note that the latter encodes the Dataset’s date, region and a sequential number in case the data changes without the date or region changing, for example due to implementation changes.

class DatasetSources(tables: 'Dict[str, str]' = <factory>, files: 'Dict[str, str]' = <factory>, urls: 'Dict[str, str]' = <factory>)[source]

Bases: object

empty()[source]
files: Dict[str, str]
classmethod from_dict(data)[source]
get_table_name(key: str) str[source]

Returns the table name of the table identified by key.

get_table_schema(key: str) str[source]

Returns the schema of the table identified by key.

tables: Dict[str, str]
to_dict()[source]
urls: Dict[str, str]
class DatasetTargets(tables: 'Dict[str, str]' = <factory>, files: 'Dict[str, str]' = <factory>)[source]

Bases: object

empty()[source]
files: Dict[str, str]
from_dict(data)[source]
get_table_name(key: str) str[source]

Returns the table name of the table identified by key.

get_table_schema(key: str) str[source]

Returns the schema of the table identified by key.

tables: Dict[str, str]
to_dict()[source]
Dependencies

A dataset can depend on other datasets or the tasks of other datasets.

alias of Iterable[Dataset | Callable[[], None] | BaseOperator]

class Model(**kwargs)[source]

Bases: Base

dependencies
epoch
id
name
scenarios
version
class SourcesTargetsModel(**kwargs)[source]

Bases: Base

name
sources
targets
Task

A Task is an Airflow Operator or any Callable taking no arguments and returning None. Callables will be converted to Operators by wrapping them in a PythonOperator and setting the task_id to the Callable’s __name__, with underscores replaced with hyphens. If the Callable’s __module__ attribute contains the string "egon.data.datasets.", the task_id is also prefixed with the module name, followed by a dot and with "egon.data.datasets." removed.

alias of Callable[[], None] | BaseOperator

TaskGraph

A graph of tasks is, in its simplest form, just a single node, i.e. a single Task. More complex graphs can be specified by nesting sets and tuples of TaskGraphs. A set of TaskGraphs means that they are unordered and can be executed in parallel. A tuple specifies an implicit ordering so a tuple of TaskGraphs will be executed sequentially in the given order.

alias of Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, …]

Tasks

A type alias to help specifying that something can be an explicit Tasks_ object or a TaskGraph, i.e. something that can be converted to Tasks_.

alias of Tasks_ | Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, …]

class Tasks_(graph: 'TaskGraph')[source]

Bases: dict

first: Set[Callable[[], None] | BaseOperator]
graph: Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, ...] = ()
last: Set[Callable[[], None] | BaseOperator]
export_dataset_io_to_json(output_path: str = 'dataset_io_overview.json') None[source]

Export all sources and targets of datasets to a JSON file. :Parameters: output_path (str) – Path to the output JSON file.

load_sources_and_targets(name: str) tuple[DatasetSources, DatasetTargets][source]

Load DatasetSources and DatasetTargets from dataset_sources_targets table.

Parameters:

name (str) (Name of the dataset.)

Returns:

Tuple[DatasetSources, DatasetTargets]

prefix(o)[source]
setup()[source]

Create the database structure for storing dataset information.

wrapped_partial(func, *args, **kwargs)[source]

Like functools.partial(), but preserves the original function’s name and docstring. Also allows to add a postfix to the function’s name.