datasets

The API for configuring datasets.

class Dataset(name: 'str', version: 'str', dependencies: 'Dependencies' = (), tasks: 'Tasks' = ())[source]

Bases: object

check_version(after_execution=())[source]
dependencies = ()

The first task(s) of this Dataset will be marked as downstream of any of the listed dependencies. In case of bare Task, a direct link will be created whereas for a Dataset the link will be made to all of its last tasks.

name = None

The name of the Dataset

tasks = ()

The tasks of this Dataset. A TaskGraph will automatically be converted to Tasks_.

update(session)[source]
version = None

The Dataset’s version. Can be anything from a simple semantic versioning string like “2.1.3”, to a more complex string, like for example “2021-01-01.schleswig-holstein.0” for OpenStreetMap data. Note that the latter encodes the Dataset’s date, region and a sequential number in case the data changes without the date or region changing, for example due to implementation changes.

Dependencies = typing.Iterable[typing.Union[ForwardRef('Dataset'), typing.Callable[[], NoneType], airflow.models.baseoperator.BaseOperator]]

A dataset can depend on other datasets or the tasks of other datasets.

class Model(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

dependencies
epoch
id
name
version
Task = typing.Union[typing.Callable[[], NoneType], airflow.models.baseoperator.BaseOperator]

A Task is an Airflow Operator or any Callable taking no arguments and returning None. Callables will be converted to Operators by wrapping them in a PythonOperator and setting the task_id to the Callable’s __name__, with underscores replaced with hyphens. If the Callable’s __module__ attribute contains the string "egon.data.datasets.", the task_id is also prefixed with the module name, followed by a dot and with "egon.data.datasets." removed.

TaskGraph = typing.Union[typing.Callable[[], NoneType], airflow.models.baseoperator.BaseOperator, typing.Set[ForwardRef('TaskGraph')], typing.Tuple[ForwardRef('TaskGraph'), ...]]

A graph of tasks is, in its simplest form, just a single node, i.e. a single Task. More complex graphs can be specified by nesting sets and tuples of TaskGraphs. A set of TaskGraphs means that they are unordered and can be executed in parallel. A tuple specifies an implicit ordering so a tuple of TaskGraphs will be executed sequentially in the given order.

Tasks = typing.Union[ForwardRef('Tasks_'), typing.Callable[[], NoneType], airflow.models.baseoperator.BaseOperator, typing.Set[ForwardRef('TaskGraph')], typing.Tuple[ForwardRef('TaskGraph'), ...]]

A type alias to help specifying that something can be an explicit Tasks_ object or a TaskGraph, i.e. something that can be converted to Tasks_.

class Tasks_(graph: 'TaskGraph')[source]

Bases: dict

graph = ()
prefix(o)[source]
setup()[source]

Create the database structure for storing dataset information.