Five Safes Crate profile

TRE-FX draft in development

(comments and suggestions welcome)

Permalink: https://w3id.org/ro/five-safes/0.1-DRAFT (TODO)

This document specifies a draft profile of RO-Crate for the purpose of TRE-FX implementation of workflow execution in a distributed trusted research environment.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Archive serialisation

A compliant Five Safes Crate SHOULD be stored and transferred as an ZIP archive containing a single BagIt directory (bag) of an arbitrary name, which payload data/ contains the RO-Crate Metadata File ro-crate-metadata.json and any required data files (e.g. inputs).

The BagIt payload manifest [RFC8493] MUST be present using sha-512 checksums, and the tag manifest SHOULD be included as sha-512 [FIPS 180-4]. Payload and tag manifests using other checksums MAY be included, taking care to exclude tagmanifest-* from their checksums.

Example:

query-12389/

  |   bagit.txt                 # MUST indicate BagIt 1.0 or later

  |   bag-info.txt              # As per BagIt specification

  |   manifest-sha512.txt       # As per BagIt specification

  |   tagmanifest-sha512.txt    # As per BagIt specification

  |   fetch.txt                 # Optional, per BagIt Specification

  |   data/                     # Payload: RO-Crate root directory

      |   ro-crate-metadata.json           # RO-Crate Metadata File MUST be present

      |   [payload files and directories]  # 1 or more SHOULD be present

BagIt expectations

The RO-Crate BagIt expectations for Adding RO-Crate to Bagit  MUST be followed. The bag-info.txt MUST include a generated External-Identifier: field which SHOULD be a UUID URN [rfc4122], e.g.:

External-Identifier: urn:uuid:9796155a-fe44-4614-89b8-71945f718ffb

It is RECOMMENDED to not modify this identifier as the Five Safes Crate progresses through the distributed TRE processing.

Zip expectations

The ZIP archive MUST only contain a single top-level entry for the bag directory, identified by the bagit.txt marker. For interoperability in terms of ZIP features, implementations SHOULD follow guidance for an OSF ZIP Container (ignoring OCF Abstract Container content).

Metadata file expectations

The RO-Crate Metadata File MUST conform to RO-Crate 1.2 (or later minor version). The compliant version MUST be declared in the metadata file descriptor:

   {

        "@type": "CreativeWork",

        "@id": "ro-crate-metadata.json",

        "about": {"@id": "./"},

        "conformsTo": {"@id": "https://w3id.org/ro/crate/1.2-DRAFT"}

   }

The root data entity of a Five Saves Crate MUST have the @id equal "./" (as it is stored within the BagIt ZIP archive).

Profile

Crates conforming to this profile specification SHOULD indicate this on the Root Data Entity:

   {

        "@id": "./",

        "@type": "Dataset",

        "conformsTo": {"@id": "https://w3id.org/ro/five-safes/0.1-DRAFT"},

        "hasPart": [

        ],
       "mainEntity": {"@id": "https://workflowhub.eu/workflows/289?version=1"},

        "mentions": {"@id": "#query-37252371-c937-43bd-a0a7-3680b48c0538"},

        "sourceOrganization":
                {"@id": "#project-be6ffb55-4f5a-4c14-b60e-47e0951090c70"}

    },

Referencing a Workflow Crate

The metadata file MUST reference a Workflow RO-Crate Dataset as its mainEntity, indicating the workflow to execute.

The identifier SHOULD be a permalink or versioned URL (e.g. https://workflowhub.eu/workflows/289?version=1) or MAY be a nested directory within the BagIt payload directory (e.g. "data/workflow289.1/").

Note: unlike in the Workflow Run profile, the programming language of the workflow and its other metadata are not expressed in this RO-Crate, but within the referenced Workflow RO-Crate. The programmingLanguage inside the Workflow RO-Crate SHOULD be either https://w3id.org/workflowhub/workflow-ro-crate#cwl or https://w3id.org/workflowhub/workflow-ro-crate#nextflow.

Finding the RO-Crate archive

If the identifier is a URI, an URL to the downloadable Workflow RO-Crate ZIP archive SHOULD be included with distribution, otherwise clients SHOULD use Signposting to find the link to the RO-Crate by looking for the link with rel="item" type="application/zip" profile="https://w3id.org/ro/crate" - for instance:

curl -I "https://workflowhub.eu/workflows/419"

HTTP/1.1 200 OK

Content-Type: text/html; charset=UTF-8

<https://workflowhub.eu/workflows/419/ro_crate?version=1> ;

      rel="item" ;

      type="application/zip" ;

      profile="https://w3id.org/ro/crate"

Example:

 {

    "@id": "https://workflowhub.eu/workflows/289?version=1",

    "@type": "Dataset",

    "name": "CWL Protein MD Setup tutorial with mutations",

    "conformsTo": {"@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0"},

    "distribution": {"@id": "https://workflowhub.eu/workflows/289/ro_crate?version=1"}

  },

  {

    "@id": "https://workflowhub.eu/workflows/289/ro_crate?version=1",

    "@type": "DataDownload",

    "conformsTo": {"@id": "https://w3id.org/ro/crate"},

    "encodingFormat": "application/zip"

  }

Requested Workflow Run

The metadata file MUST include a CreateAction, which MUST be referenced from mentions of the root entity. The identifier SHOULD be based on a UUID (different from the BagIt External-Identifier).

The CreateAction MUST reference the Workflow Crate using instrument.

Example:

   {

        "@id": "#query-37252371-c937-43bd-a0a7-3680b48c0538",

        "@type": "CreateAction",
       "actionStatus": "https://schema.org/PotentialActionStatus",

        "agent": {"@id": "https://orcid.org/0000-0001-9842-9718"},

        "instrument": {"@id": "https://workflowhub.eu/workflows/289?version=1"},

        "name": "Execute query 12389 on workflow ",

        "object": [

            {"@id": "input1.txt"}

        ]

    },

Execution state

The main purpose of a Five Safes Crate is to trigger and communicate a workflow execution within a distributed TRE. The states of the Five Safes Crate is indicated by the actionStatus of this main action:

When the execution is in CompletedActionStatus or FailedActionStatus, the crate SHOULD also follow the Provenance Crate profile, e.g. the workflow outputs will be listed as result.

Requesting Agent

The individual person who is requesting the run MUST be indicated as an agent from the CreateAction, which SHOULD have an affiliation to the organization they are representing for access control purposes.

   {

        "@id": "https://orcid.org/0000-0001-9842-9718",

        "@type": "Person",

        "name": "Stian Soiland-Reyes",

        "affiliation": { "@id": "https://ror.org/027m9bs27"}

    },

    {

        "@id": "https://ror.org/027m9bs27",

        "@type": "Organization",

        "name": "The University of Manchester"

    },

Note: The organisation under affiliation is typically the employing organization, e.g. a university or hospital. Virtual organisations such as research projects can be listed using memberOf (see also Responsible Project below).

Responsible Project

The project that the request is sent on behalf of, typically related to permission to use a TRE, MUST be indicated from the root dataset using sourceOrganization to a Project.

Note: The responsible project is not necessarily a ResearchProject corresponding to a funded grant, but may be more specific studies within such a project. Various TREs may have different granularity and identifiers for such projects.

It is RECOMMENDED to include TRE-specific ids under identifier (which MAY be an array).  If the identifier is not globally unique (e.g. a PID), it is RECOMMENDED to use a repository-specific identifier with an PropertyValue entity.

The project MAY indicate the member organisations, in which case one of them SHOULD match the affiliation of the Requesting Agent.

{"@id": "#project-be6ffb55-4f5a-4c14-b60e-47e0951090c70",

 "@type": "Project",

 "name": "Investigation of cancer",

 "identifier": [

"https://doi.org/10.3030/101057344",
        "
https://gtr.ukri.org/projects?ref=10038992"

 ],
"member": [
   {"@id": "https://ror.org/027m9bs27"},

    {"@id": "https://ror.org/01ee9ar58"},

 ]

}

Inputs

Requested inputs SHOULD be set on the CreateAction using the object property:
{

        "@id": "#query-37252371-c937-43bd-a0a7-3680b48c0538",

        "@type": "CreateAction",
       "object": [

            {"@id": "input1.txt"},

            {"@id": "#enableFastMode"}

        ],
       "…": {}
}

Each input MUST have a corresponding data entity:

{ "@id": "input1.txt",
 "@type": "File",
 "name": "input1"
}
{

  "@id": "#enableFastMode",

  "@type": "PropertyValue",

  "name": "--fast-mode",

  "value": "True"

},

TODO: How to reference existing secure TRE data? Do we have an identifier scheme?

TODO: Link up inputs to corresponding FormalParameter to indicate their input port binding. How to import these from the Workflow Bundle?

Outputs

If the workflow has successfully executed, that is the CreateAction has actionStatus set to CompletedActionStatus, the output data entities SHOULD be referenced from the results array.

Output entities MUST be described as in the Workflow Run Crate profile, with type SHOULD be File, Dataset, Collection, DigitalDocument or PropertyValue.

Implementations MAY include the outputs within the Crate BagIt archive, in which case it is RECOMMENDED to use the folder outputs/ to avoid conflict with other files in the crate.

{

        "@id": "#query-37252371-c937-43bd-a0a7-3680b48c0538",

        "@type": "CreateAction",
       "result": [

            {"@id": "outputs/table.csv"},

            {"@id": "outputs/diagrams/"}

        ],
       "…": {}
},

{

"@id": "outputs/qa.csv",

"@type": "File",

"encodingFormat": "text/csv",

"name": "Tabular listing of quality assessment"

},

{

"@id": "outputs/diagrams/",

"@type": "Dataset",

"name": "Diagrams of regions of interest"

}

Tip: Implementations may need to inspect the FormalParameter of the Workflow Crate to propagate a human readable name and encodingFormat file format of the inputs and output.

Sensitive data

Outputs MAY include references to sensitive data that is only accessible from within the TRE or through URIs that require authentication. The requirement for permission SHOULD be indicated by typing the data entity as a DigitalDocument that use hasDigitalDocumentPermission to reference the DigitalDocumentPermission entity, typically assigning https://schema.org/ReadPermission with grantee to only to the Responsible Project.

{

"@id": "urn:uuid:07b81e0f-7ac4-5428-9940-878b241e2397",

"@type": "DigitalDocument",

"encodingFormat": "text/csv",

"name": "Patient measurement 07b81e0f-7ac4-5428-9940-878b241e2397",

"hasDigitalDocumentPermission": {"@id": "#permissions-07b81e0f"},

},

{        "@id": "#permissions-07b81e0f",

       "@type": "DigitalDocumentPermission",

       "permissionType": "https://schema.org/ReadPermission",

       "grantee": { "@id": "#project-be6ffb55-4f5a-4c14-b60e-47e0951090c70"}

}

Security considerations

Allowing execution of any Workflow Crate effectively allows execution of arbitrary code. It is RECOMMENDED that implementers apply strong access control.

It is currently out of scope for this specification how to verify that Five Saves Crate was requested by the given person, or how to verify if the person has access to a particular TRE. It is therefore RECOMMENDED that implementers check authentication and authorization of a submitted query and use strong encryption. Implementers SHOULD check that the @id and affiliation of the Requesting Agent and Responsible Project corresponds to the authentication, and MAY inject/overwrite this.

Malicious clients submitting a Five Safes Crate may have included additional entities, properties and @types, which may cause security concerns in an implementation. Implementers SHOULD sanity check inputs, including ensuring that all paths are relative within the bag or absolute URIs. Malicious clients MAY attempt to reference URLs that are only accessible within a TRE. Implementers MUST perform any URL downloads (such as Workflow RO-Crates or required Containers) from within a DMZ firewalled from the TRE.

As an executed Five Saves Crate may be intended for publishing (possibly following an embargo period), it SHOULD NOT include sensitive data or security tokens within the metadata file or the BagIt archive.

The crate MAY include references (e.g. S3 URIs) to sensitive data, in which case the implementation and executed workflow SHOULD protect against divulging sensitive data (directly or indirectly) in the File identifiers, e.g. the example below use UUID v5 hashing [RFC4122]  to hide sensitive identifier patient-d373740c3fd2 (Note: predictive identifiers like patient-456 would still be vulnerable here due to iteration attacks):

$ uuidgen --sha1 --namespace @url --name file://tre.example.com/project-123/patient-d373740c3fd2.txt

07b81e0f-7ac4-5428-9940-878b241e2397

Media type and profiles

When transferring a HTTP Five Safes Crate using HTTP, implementations SHOULD use the following HTTP headers for content-type and profile:

Content-Type: application/zip
Link: <https://w3id.org/ro/crate>; rel="profile"

HTML landing pages that reference a Five Safes Crate SHOULD include Signposting using HTTP Link headers:

Link: <https://example.com/query-12389.zip>; rel="item", type="application/zip"
Link: <https://w3id.org/ro/crate>; rel="profile"; type="application/zip";
  anchor="https://example.com/query-12389.zip"