datasets

The API for configuring datasets.

class Dataset(name: 'str', version: 'str', dependencies: 'Dependencies' = (), tasks: 'Tasks' = ())[source]

Bases: object

check_version(after_execution=())[source]

dependencies: Dependencies = (): The first task(s) of this Dataset will be marked as downstream of any of the listed dependencies. In case of bare Task, a direct link will be created whereas for a Dataset the link will be made to all of its last tasks.

name: str: The name of the Dataset

register()[source]: Register dataset sources and targets in a single transaction. Only writes if sources or targets have changed. Creates table if it doesn’t exist yet.

sources: DatasetSources: The sources used by the datasets. Could be tables, files and urls

targets: DatasetTargets: The targets created by the datasets. Could be tables and files

tasks: Tasks = (): The tasks of this Dataset. A TaskGraph will automatically be converted to Tasks_.

update(session)[source]

version: str: The Dataset’s version. Can be anything from a simple semantic versioning string like “2.1.3”, to a more complex string, like for example “2021-01-01.schleswig-holstein.0” for OpenStreetMap data. Note that the latter encodes the Dataset’s date, region and a sequential number in case the data changes without the date or region changing, for example due to implementation changes.

class DatasetSources(tables: 'Dict[str, str]' = <factory>, files: 'Dict[str, str]' = <factory>, urls: 'Dict[str, str]' = <factory>)[source]

Bases: object

empty()[source]

files: Dict[str, str]

classmethod from_dict(data)[source]

get_table_name(key: str) → str[source]: Returns the table name of the table identified by key.

get_table_schema(key: str) → str[source]: Returns the schema of the table identified by key.

tables: Dict[str, str]

to_dict()[source]

urls: Dict[str, str]

class DatasetTargets(tables: 'Dict[str, str]' = <factory>, files: 'Dict[str, str]' = <factory>)[source]

Bases: object

empty()[source]

files: Dict[str, str]

from_dict(data)[source]

get_table_name(key: str) → str[source]: Returns the table name of the table identified by key.

get_table_schema(key: str) → str[source]: Returns the schema of the table identified by key.

tables: Dict[str, str]

to_dict()[source]

Dependencies

A dataset can depend on other datasets or the tasks of other datasets.

alias of Iterable[Dataset | Callable[[], None] | BaseOperator]

class Model(**kwargs)[source]

Bases: Base

dependencies

epoch

id

name

scenarios

version

class SourcesTargetsModel(**kwargs)[source]

Bases: Base

name

sources

targets

Task

A Task is an Airflow Operator or any Callable taking no arguments and returning None. Callables will be converted to Operators by wrapping them in a PythonOperator and setting the task_id to the Callable’s __name__, with underscores replaced with hyphens. If the Callable’s __module__ attribute contains the string "egon.data.datasets.", the task_id is also prefixed with the module name, followed by a dot and with "egon.data.datasets." removed.

alias of Callable[[], None] | BaseOperator

TaskGraph

A graph of tasks is, in its simplest form, just a single node, i.e. a single Task. More complex graphs can be specified by nesting sets and tuples of TaskGraphs. A set of TaskGraphs means that they are unordered and can be executed in parallel. A tuple specifies an implicit ordering so a tuple of TaskGraphs will be executed sequentially in the given order.

alias of Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, …]

Tasks

A type alias to help specifying that something can be an explicit Tasks_ object or a TaskGraph, i.e. something that can be converted to Tasks_.

alias of Tasks_ | Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, …]

class Tasks_(graph: 'TaskGraph')[source]

Bases: dict

first: Set[Callable[[], None] | BaseOperator]

graph: Callable[[], None] | BaseOperator | Set[TaskGraph] | Tuple[TaskGraph, ...] = ()

last: Set[Callable[[], None] | BaseOperator]

export_dataset_io_to_json(output_path: str = 'dataset_io_overview.json') → None[source]: Export all sources and targets of datasets to a JSON file. :Parameters: output_path (str) – Path to the output JSON file.

load_sources_and_targets(name: str) → tuple[DatasetSources, DatasetTargets][source]

Load DatasetSources and DatasetTargets from dataset_sources_targets table.

Parameters:: name (str) (Name of the dataset.)
Returns:: Tuple[DatasetSources, DatasetTargets]

prefix(o)[source]

setup()[source]: Create the database structure for storing dataset information.

wrapped_partial(func, *args, **kwargs)[source]: Like functools.partial(), but preserves the original function’s name and docstring. Also allows to add a postfix to the function’s name.