Skip to main content

Workflows

SCHEMA api supports the execution of computational workflows through a declarative specification. It adopts this specification as the language based on which it validates the submitted workflows and performs the required business logic.

Workflows in SCHEMA api consist of multiple inter-dependent jobs called executors. These executors may run serially or in parallel, produce files or directories that other executors depend on, thus enabling for the encoding of complex computational tasks in one go.

This document aims to elaborate on the support of declarative computational workflows in SCHEMA api. It introduces the specification that SCHEMA api uses natively for the representation of a workflow and provides an overview on how submitted workflows are managed by SCHEMA api.

Workflow specification

SCHEMA api introduces a declarative workflow specification that draws inspiration from the corresponding task definition structures. Essentially, it slightly modifies the task's specification so that a declarative workflow can be defined. Fields like volumes or input.path and output.path which users can declare in order to imperatively describe a workflow execution are ignored. Conversely, the specification introduces certain fields and requirements which are used by SCHEMA api to validate a submitted workflow and map executor dependencies.

Executors

Similar to tasks, executors in workflows are the main components that specify the job execution. Through the defined executors, SCHEMA api can infer the required inputs and the expected outputs. Every executor in a native workflow definition must declare the image and the command that will run within the executor's containerized environment. Additionally, if the executor's job produces output files which should be accessible within the workflow context, the executor definition should specify these files and the location where they are expected to be found, after the executor job finishes.

Workflow executors in SCHEMA api's native specification may declare the expected outputs in a property named yields. For example, the following executor definition, declares a job that runs within a containerized environment based on an image named user/python-software:latest and which is expected to produce a file at /data/outputs/reports.pdf and a directory at /data/outputs/execution-data/.

{
"image": "user/python-software:latest",
"command": ["python","run.py","-o","/data/outputs"],
"yields": [
{
"name": "report",
"path": "/data/outputs/reports.pdf"
},
{
"name": "metrics-dir",
"path": "/data/outputs/execution-data/"
}
]
}

After the executor terminates, the execution backend will utilize the declared yields; it will seek the outputs in their respective paths and store them in a persistent storage that outlives the executor. Subsequent executors may require these yielded outputs, in which case the execution backend will inject them within the corresponding containerized environment.

Fields like stdin, stdout, stderr and command may use references to transient files with the use of workflow metavariables. In essence, metavariables map to a transient file, either passed as a workflow input or produced by an executor. They are defined with the transient file's name and a double dollar sign prefix '$$'. SCHEMA api, will look for these references and will attempt to understand whether that is an input transient file (dependency) or an output transient file (yield). Specifically, any transient file reference in the command field, which is not declared as an executor yield, is automatically construed as a dependency. References in the standard stream fields are considered of the same nature as the stream they are defined for; references in stdout and stderr are considered as yields while a reference in stdin is considered as a dependency.

The following example shows a workflow executor definition that performs a machine learning job. The executor declares two yields: "training_logs" and "TRAINED_MODEL". "training_logs" is used to store data logged during the job execution. "TRAINED_MODEL" is referenced within the command and maps to the trained model produced by the machine learning process. Apart from the aforementioned transient file references, an additional transient file is referenced within the command, $$CSV_DATASET_DIR. Since "CSV_DATASET_DIR" is not declared as an executor yield, it is construed as an executor dependency. Finally, the executor definition specifies the working directory and sets necessary environment variables.

{
"command": [
"python","train.py","-i","$$CSV_DATASET_DIR","-o","$$TRAINED_MODEL"
],
"image": "user/ml:latest",
"stdout": "$$training_logs",
"workdir": "/training",
"yields": [
{
"name": "training_logs",
"path": "/data/outputs/logs/training_logs.log"
},
{
"name": "TRAINED_MODEL",
"path": "/data/outputs/model.bin"
}
],
"env": {
"TRAINING_LOGGING_LEVEL": "debug"
}
}

A workflow typically consists of multiple executors, each with its own dependencies and yields. The declarative nature of SCHEMA api's native workflow specification, allows for users to disregard about the proper order of executor definition. SCHEMA api will analyze submitted workflows and resolve the proper order of execution based on the dependencies and yields of each workflow executor. Workflow executors, similar to the case of tasks, are defined as a list of objects.

The following example is a snippet of a workflow definition, showing the definition of multiple executors. The first defined executor counts the number of characters of a transient file named apiResponse, ultimately storing the number at a transient file named numOfCharacters. The second executor, polls an API and retrieves a text which is stored in the transient file apiResponse.

note

Note that apiResponse is created by the second executor but is referenced by the first one. SCHEMA api will be able to identify this dependency and will make sure that the second executor runs first.

{
...
"executors": [
{
"image": "busybox:latest",
"command": ["wc","-m","$$apiResponse"],
"stdout": "$$numOfCharacters",
"yields": [
{
"name": "numOfCharacters",
"path": "/outputs/number-of-characters.txt"
}
]
},
{
"image": "alpine/curl:latest",
"command": ["curl","http://metaphorpsum.com/sentences/3"],
"stdout": "$$apiResponse",
"yields": [
{
"name": "apiResponse",
"path": "/data/outputs/api-response.log"
}
]
}
],
...
}

Workflow inputs and outputs

Workflow inputs refer to files or directories which should be available prior to a workflow's execution. Equivalently, workflow outputs refer to files or directories which are expected to be generated after a workflow's execution. Workflow inputs and outputs definitions use a similar structure to the one used by tasks. However in the context of workflows, paths are not relevant since these are directly managed by the workflow execution backend. Furthermore, workflow inputs and outputs must be identified by a name, so that they can be inferred in executors as dependencies, as shown above.

Workflows inputs and outputs are declared as lists of objects, identified by the respective properties. The example below showcases a snippet of a workflow definition, specifying one input file and two output files.

{
...
"inputs": [
{
"name": "CSV_DATASET_DIR",
"url": "datasets/crossref-publications/",
"type": "DIRECTORY"
}
],
"outputs": [
{
"name": "training_logs",
"url": "results/training.log"
},
{
"name": "TRAINED_MODEL",
"url": "results/trained_model.bin",
"type": "FILE"
}
],
...
}