Skip to content

Updating DAR resource files

Every Data Access Request (DAR) project consists of below key files:

1) config

2) governance/project

3) access/access

4) metadata/dataset

Initial state of the project repository

They are all toml files. The cr8tor initiate command uses a cookie-cutter template which provides you with the basic version of them. The content needs to be updated with relevant information before you can successfully run the orchestration workflow.

config

The config file contains the following fields, which must be populated with relevant information:

Field Description
Source-Organization eg. LSC SDE
Organization-Address eg. Lancashire Teaching Hospitals NHS Trust, PR2 9HT
Contact-Name eg. LSC SDE Program Team
Contact-Email eg. lsc.sde@test.com

governance/project

The governance/project contains the following fields, which must be populated with relevant information:

[project]

Field Description
description eg. example-project created using CR8TOR.
reference auto-generated by cookie-cutter based on project_name. Pattern cr8-<project_name>
name auto-generated by cookie-cutter based on project_name
project_name follow organisation policy for project naming convention

[project.destination]

Field Description
type Destination type. Supported values: [filestore, postgresql]
name Destination name (for filestore destinations, this drives storage location), e.g., [LSC, NW]
format Data format for filestore destinations. Supported values: [csv, duckdb]

Supported Destination Types

The CR8TOR Publisher supports two main destination types, each with different behaviors and requirements:

Filestore Destination (filestore)

When type = "filestore", the Publish Service stores data as files in a mounted filestore rather than a database.

Key Characteristics:

  • File-based Storage: Data is stored as files (CSV or DuckDB format) in mounted filesystem paths
  • Destination-Specific Storage: The name field drives storage location via environment variables (TARGET_STORAGE_ACCOUNT_{NAME}_SDE_MNT_PATH)
  • Two-stage Process: Data flows through staging container first, then moves to production container
  • BagIt Packaging: Files are organized following BagIt standards with SHA512 checksums for integrity verification
  • Format Options: Supports both CSV and DuckDB formats for flexible data consumption

Required Fields:

  • type = "filestore"
  • name - Destination name (e.g., "LSC", "NW") that determines storage mount point
  • format - Data format: "csv" or "duckdb"

Directory Structure Created:

{storage_mount_path}/
├── staging/
│   └── {project_name}/
│       └── {project_start_time}/
│           └── data/outputs/
└── production/
    └── {project_name}/
        └── {project_start_time}/
            └── data/outputs/

PostgreSQL Destination (postgresql)

When type = "postgresql", the Publish Service loads data directly into a PostgreSQL database and creates OPAL resources for secure access.

Key Characteristics:

  • Database Storage: Data is loaded directly into PostgreSQL tables
  • OPAL Integration: Automatically creates Obiba OPAL components for secure data access:
  • Creates OPAL project named after the CR8TOR project
  • Establishes OPAL resources pointing to PostgreSQL tables
  • Creates DataSHIELD permission groups ({project_name}_group)
  • Sets up DataSHIELD and resource-level permissions
  • Access Control: Leverages OPAL's DataSHIELD framework for secure, privacy-preserving data analysis
  • Table Filtering: Filters tables based on project name and start time patterns, excluding DLT internal tables

Required Fields:

  • type = "postgresql"
  • name - Destination name (used for OPAL project identification)
  • format - Not applicable for PostgreSQL destinations

OPAL Resources Created:

  • Resource Naming: tre_postgresql_{schema}_{table}
  • Project Structure: One OPAL project per CR8TOR project
  • Group Management: DataSHIELD groups with "use" and "view" permissions
  • User Management: Ensures default DataSHIELD user (dsuser_default) exists

Choose filestore for file-based data delivery and postgresql for database-hosted data with integrated access control through OPAL's DataSHIELD framework.

[repository]

Field Description
codeRepository Auto-generated by cookie-cutter. e.g., https://github.com/lsc-sde-crates/cr8-example-project
description Auto-generated by cookie-cutter.
name Auto-generated by cookie-cutter.

[requesting_agent]

Field Description
name e.g. Prof. Jane Doe

[requesting_agent.affiliation]

Field Description
url e.g. https://someuni.com
name e.g. Jane Doe's University

access/access

The access/access contains the following fields, which must be populated with relevant information:

[source]

The required source fields depend on the type of the source. The Solution supports extracting data from multiple source types including Databricks SQL warehouse endpoints and various SQL databases. The required fields for each type are validated by cr8tor and cr8tor-publisher app (using pydantic models).

Supported Source Types

The following source types are currently supported:

  • databrickssql - Databricks Unity Catalog
  • postgresql - PostgreSQL databases
  • mysql - MySQL databases
  • mssql - Microsoft SQL Server databases
  • sqlserver - Microsoft SQL Server databases (alternative identifier)

Common Source Fields

Field Description
name Name of the data source, e.g., My Database Connection.
type Must match one of the supported source types listed above.
host_url Host URL of the database server.

Databricks Unity Catalog (databrickssql)

For Databricks Unity Catalog sources, the following additional fields are required:

Field Description
http_path HTTP path for the Databricks SQL Warehouse endpoint, e.g., /sql/1.0/warehouses/0aec44b2e70e201d.
port Optional. Default is 443 for Databricks SQL endpoints.
catalog Databricks Unity Catalog name from which data will be extracted.

SQL Databases (postgresql, mysql, mssql)

For SQL database sources, the following additional fields are required:

Field Description
database Database name to connect to.
port Database connection port (varies by database type).

[credentials]

Credentials fields are tied to the source type. Different source types require different credential configurations:

Databricks Unity Catalog Credentials

For Databricks SQL endpoints, we use Databricks Service Principal (SPN). See here how to create a new SPN and assign required roles and permissions.

Field Description
provider Credential provider, e.g., AzureKeyVault.
spn_clientid Key name in secrets provider containing Service Principal client ID, e.g., databricksspnclientid.
spn_secret Key name in secrets provider containing Service Principal secret, e.g., databricksspnsecret.

SQL Database Credentials

For SQL database sources (PostgreSQL, MySQL, MSSQL), we use username/password authentication:

Field Description
provider Credential provider, e.g., AzureKeyVault.
username_key Key name in secrets provider containing database username, e.g., db-username.
password_key Key name in secrets provider containing database password, e.g., db-password.

metadata/dataset

We expect a single file per dataset (in terms of Databricks SQL source, it is per Unity Catalog schema). The metadata/dataset file is contains the following fields, which must be populated with relevant information:

Field Description
name Name of the dataset, e.g., dataset_1.
description Description of our dataset.
schema_name Schema name, e.g., z__cr8tor_poc.
Warning

Now, we can define the tables and columns we want to retrieve, but this is optional. If we do not provide tables and/or columns, the solution will retrieve all available tables and/or columns.

[[tables]]

Field Description
name Name of the table, e.g., domain.

[[tables.columns]]

Field Description
name Name of the column, e.g., domain_concept_id.