Workflows
SCHEMA api supports the execution of computational workflows through a declarative specification. It adopts this specification as the language based on which it validates the submitted workflows and performs the required business logic.
Workflows in SCHEMA api consist of multiple inter-dependent jobs called executors. These executors may run serially or in parallel, produce files or directories that other executors depend on, thus enabling for the encoding of complex computational tasks in one go.
This document aims to elaborate on the support of declarative computational workflows in SCHEMA api. It introduces the specification that SCHEMA api uses natively for the representation of a workflow and provides an overview on how submitted workflows are managed by SCHEMA api.
Workflow specification
SCHEMA api introduces a declarative workflow specification that draws inspiration from the corresponding task
definition structures. Essentially, it slightly modifies the task's specification so that a declarative workflow can be
defined. Fields like volumes
or input.path
and output.path
which users can declare in order to imperatively
describe a workflow execution are ignored. Conversely, the specification introduces certain fields and requirements
which are used by SCHEMA api to validate a submitted workflow and map executor dependencies.
Executors
Similar to tasks, executors in workflows are the main components that specify the job execution. Through the defined
executors, SCHEMA api can infer the required inputs and the expected outputs. Every executor in a native workflow
definition must declare the image
and the command
that will run within the executor's containerized environment.
Additionally, if the executor's job produces output files which should be accessible within the workflow context, the
executor definition should specify these files and the location where they are expected to be found, after the executor
job finishes.
Workflow executors in SCHEMA api's native specification may declare the expected outputs in a property named yields
.
For example, the following executor definition, declares a job that runs within a containerized environment based on an
image named user/python-software:latest
and which is expected to produce a file at /data/outputs/reports.pdf
and a
directory at /data/outputs/execution-data/
.
{
"image": "user/python-software:latest",
"command": ["python","run.py","-o","/data/outputs"],
"yields": [
{
"name": "report",
"path": "/data/outputs/reports.pdf"
},
{
"name": "metrics-dir",
"path": "/data/outputs/execution-data/"
}
]
}
After the executor terminates, the execution backend will utilize the declared yields; it will seek the outputs in their respective paths and store them in a persistent storage that outlives the executor. Subsequent executors may require these yielded outputs, in which case the execution backend will inject them within the corresponding containerized environment.
Fields like stdin
, stdout
, stderr
and command
may use references to transient files with the use of workflow
metavariables. In essence, metavariables map to a transient file, either passed as a workflow input or produced by an
executor. They are defined with the transient file's name and a double dollar sign prefix '$$'
. SCHEMA api, will look
for these references and will attempt to understand whether that is an input transient file (dependency) or an output
transient file (yield). Specifically, any transient file reference in the command
field, which is not declared as an
executor yield, is automatically construed as a dependency. References in the standard stream fields are considered of
the same nature as the stream they are defined for; references in stdout
and stderr
are considered as yields while
a reference in stdin
is considered as a dependency.
The following example shows a workflow executor definition that performs a machine learning job. The executor declares
two yields: "training_logs"
and "TRAINED_MODEL"
. "training_logs"
is used to store data logged during the job
execution. "TRAINED_MODEL"
is referenced within the command and maps to the trained model produced by the machine
learning process. Apart from the aforementioned transient file references, an additional transient file is referenced
within the command, $$CSV_DATASET_DIR
. Since "CSV_DATASET_DIR"
is not declared as an executor yield, it is construed
as an executor dependency. Finally, the executor definition specifies the working directory and sets necessary
environment variables.
{
"command": [
"python","train.py","-i","$$CSV_DATASET_DIR","-o","$$TRAINED_MODEL"
],
"image": "user/ml:latest",
"stdout": "$$training_logs",
"workdir": "/training",
"yields": [
{
"name": "training_logs",
"path": "/data/outputs/logs/training_logs.log"
},
{
"name": "TRAINED_MODEL",
"path": "/data/outputs/model.bin"
}
],
"env": {
"TRAINING_LOGGING_LEVEL": "debug"
}
}
A workflow typically consists of multiple executors, each with its own dependencies and yields. The declarative nature of SCHEMA api's native workflow specification, allows for users to disregard about the proper order of executor definition. SCHEMA api will analyze submitted workflows and resolve the proper order of execution based on the dependencies and yields of each workflow executor. Workflow executors, similar to the case of tasks, are defined as a list of objects.
The following example is a snippet of a workflow definition, showing the definition of multiple executors. The first
defined executor counts the number of characters of a transient file named apiResponse
, ultimately storing the number
at a transient file named numOfCharacters
. The second executor, polls an API and retrieves a text which is stored in
the transient file apiResponse
.
Note that apiResponse
is created by the second executor but is referenced by the first one. SCHEMA api will be able
to identify this dependency and will make sure that the second executor runs first.
{
...
"executors": [
{
"image": "busybox:latest",
"command": ["wc","-m","$$apiResponse"],
"stdout": "$$numOfCharacters",
"yields": [
{
"name": "numOfCharacters",
"path": "/outputs/number-of-characters.txt"
}
]
},
{
"image": "alpine/curl:latest",
"command": ["curl","http://metaphorpsum.com/sentences/3"],
"stdout": "$$apiResponse",
"yields": [
{
"name": "apiResponse",
"path": "/data/outputs/api-response.log"
}
]
}
],
...
}
Workflow inputs and outputs
Workflow inputs refer to files or directories which should be available prior to a workflow's execution. Equivalently, workflow outputs refer to files or directories which are expected to be generated after a workflow's execution. Workflow inputs and outputs definitions use a similar structure to the one used by tasks. However in the context of workflows, paths are not relevant since these are directly managed by the workflow execution backend. Furthermore, workflow inputs and outputs must be identified by a name, so that they can be inferred in executors as dependencies, as shown above.
Workflows inputs and outputs are declared as lists of objects, identified by the respective properties. The example below showcases a snippet of a workflow definition, specifying one input file and two output files.
{
...
"inputs": [
{
"name": "CSV_DATASET_DIR",
"url": "datasets/crossref-publications/",
"type": "DIRECTORY"
}
],
"outputs": [
{
"name": "training_logs",
"url": "results/training.log"
},
{
"name": "TRAINED_MODEL",
"url": "results/trained_model.bin",
"type": "FILE"
}
],
...
}