CR8TOR Publisher Overview¶
The CR8TOR Publisher is a microservices-based platform that facilitates secure data access requests, enabling users to request, approve, and retrieve datasets safely and efficiently. The publisher consists of three FastAPI-based microservices that work together to orchestrate the data transfer process.
Architecture Components¶
The CR8TOR Publisher comprises three microservices:
1. Approval Service¶
The Approval Service acts as an API gateway, taking requests from the outside world and forwarding them to the relevant services. It serves as the main entry point for all CR8TOR operations.
Key Features:
- API gateway functionality for routing requests
- Request validation and authentication
- Coordination between Metadata and Publish services
- Centralized error handling and response formatting
Main Endpoints:
POST project/validate- Validates connections and retrieves metadataPOST project/package- Initiates data packaging to stagingPOST project/publish- Publishes data to production storage
See detailed Approval Service documentation
2. Metadata Service¶
The Metadata Service fetches dataset metadata, including table-column level descriptions, data types, and names, without exposing the actual data.
Key Features:
- Validates source and destination connections
- Retrieves metadata from data sources (e.g. SQL Server, MySQL, PostgreSQL, Databricks Unity Catalog)
- Provides schema information for requested datasets
- Ensures metadata accuracy without data exposure
Main Endpoints:
POST metadata/project- Retrieves and validates dataset metadata
See detailed Metadata Service documentation
3. Publish Service¶
The Publish Service handles the actual data transfer operations, retrieving datasets from source systems and packaging them for consumption.
Key Features:
- Data extraction from source databases (e.g. SQL Server, MySQL, PostgreSQL, Databricks Unity Catalog)
- Data packaging and format conversion (csv or DuckDB)
- Staging and production data management
- Hash calculation and integrity verification using BagIt
Main Endpoints:
POST data-publish/validate- Validates source/destination connectionsPOST data-publish/package- Packages data to staging containerPOST data-publish/publish- Publishes data to production container
See detailed Publish Service documentation
Destination Type Behaviors¶
The Publish Service adapts its data handling behavior based on the specified project destination type:
PostgreSQL Destination¶
When the project destination is configured as postgresql, the Publish Service:
- Data Loading: Loads the source data directly into a PostgreSQL database
- OPAL Integration: Creates and configures Obiba OPAL components for secure data access:
- Creates an OPAL project for the dataset
- Establishes OPAL resources pointing to PostgreSQL tables within the project
- Creates DataSHIELD permission groups (named
{project_name}_group) - Assigns DataSHIELD permissions to the created groups
- Sets resource-level permissions for the groups to access project data
- Access Control: Leverages OPAL's DataSHIELD framework for secure, privacy-preserving data analysis
Filestore Destination¶
When the project destination is configured as filestore, the Publish Service:
- File-based Storage: Loads data to the mounted filestore rather than a database
- Destination-Specific Storage: Uses the destination name to determine the target storage location via environment variables (e.g.,
TARGET_STORAGE_ACCOUNT_{DESTINATION_NAME}_SDE_MNT_PATH) - Two-stage Process:
- Staging Phase: Data is first written to a staging container/filestore
- Production Phase: Data is then moved from staging to the production container/filestore
- Format Options: Data can be packaged in multiple formats (CSV or DuckDB) for flexible consumption
- BagIt Packaging: Files are organized following BagIt standards with checksums for integrity verification
This destination-specific behavior ensures optimal data handling and access patterns for different target environments while maintaining consistent security and governance standards.
Required Environment Variables¶
The Publish Service requires different environment variables depending on the destination type:
PostgreSQL Destination Environment Variables¶
When using PostgreSQL as the destination, the following environment variables are required:
OPAL Configuration:
DESTINATION_OPAL_HOST- The OPAL server host URLDESTINATION_OPAL_USERNAME- Username for OPAL authenticationDESTINATION_OPAL_PASSWORD_SECRET_NAME- Name of the secret containing the OPAL passwordDESTINATION_OPAL_NO_SSL_VERIFY- Whether to skip SSL verification (default: "false")
PostgreSQL Configuration:
DESTINATION_POSTGRESQL_HOST- PostgreSQL server hostDESTINATION_POSTGRESQL_PORT- PostgreSQL server portDESTINATION_POSTGRESQL_DATABASE- Target database nameDESTINATION_POSTGRESQL_OPAL_READONLY_USERNAME- Read-only username for OPAL resource accessDESTINATION_POSTGRESQL_OPAL_READONLY_PASSWORD_SECRET_NAME- Name of the secret containing the read-only password
Filestore Destination Environment Variables¶
When using filestore as the destination, the system requires destination-specific environment variables for storage mount paths:
Storage Mount Configuration:
TARGET_STORAGE_ACCOUNT_{DESTINATION_NAME}_SDE_MNT_PATH- Base path to the mounted storage account for the specific destination
Where {DESTINATION_NAME} is the uppercase version of the destination name specified in the project configuration. For example:
- If destination name is "LSC", the environment variable would be
TARGET_STORAGE_ACCOUNT_LSC_SDE_MNT_PATH - If destination name is "NW", the environment variable would be
TARGET_STORAGE_ACCOUNT_NW_SDE_MNT_PATH
The system creates the following directory structure within the mounted storage:
{base_path}/
├── staging/
│ └── {project_name}/
│ └── {project_start_time}/
│ └── data/outputs/
└── production/
└── {project_name}/
└── {project_start_time}/
└── data/outputs/
Additional DLT Configuration:
DLTHUB_PIPELINE_WORKING_DIR- Working directory for DLT Hub pipeline operationsDATA_WRITER__FILE_MAX_BYTES- Maximum file size in bytes (default: 100MB)DATA_WRITER__DISABLE_COMPRESSION- Whether to disable compression for CSV files
OPAL Integration Details¶
For PostgreSQL destinations, the system performs the following OPAL operations:
- Project Creation: Creates an OPAL project named after the CR8TOR project
- Group Management: Creates a DataSHIELD group named
{project_name}_group - User Management: Ensures a default DataSHIELD user (
dsuser_default) exists and is assigned to the group - Resource Creation: Creates OPAL resources for each PostgreSQL table with naming pattern
tre_postgresql_{schema}_{table} - Permission Assignment:
- Adds the group to DataSHIELD permissions with "use" permission
- Sets resource-level permissions for the group with "view" permission on the project
The OPAL resources are configured as SQL resources pointing to the specific PostgreSQL tables, enabling secure DataSHIELD-compliant data access for approved users.
Data Flow Architecture¶
graph TD
A[CR8TOR CLI] --> B[Approval Service]
B --> C[Metadata Service]
B --> D[Publish Service]
C --> E[Data Source]
D --> E
D --> F[Staging Storage]
D --> G[Production Storage]
F --> G
Deployment Architecture¶
The microservices are containerized using Docker and designed to be deployed on Kubernetes clusters such as Azure Kubernetes Service (AKS). Each service:
- Runs in its own container with isolated dependencies
- Supports volume mounting for secrets and configuration
- Provides health checks and monitoring endpoints
- Scales independently based on workload demands
Security Features¶
- API Key Authentication: Each service uses static API keys for inter-service communication
- Secret Management: Secrets are mounted at container level and stored in a secure credential storage (e.g. Azure Key Vault)
- Data Isolation: Services operate with minimal data exposure principles
- Audit Logging: Comprehensive logging of all data access operations
Integration with CR8TOR Workflow¶
The publisher services integrate seamlessly with the CR8TOR CLI workflow:
- Validation Phase: Metadata Service validates data source connections
- Staging Phase: Publish Service extracts and stages data
- Publication Phase: Publish Service moves data to production storage
Infrastructure Requirements
The Lancashire and Cumbria Secure Data Environment department uses Azure Kubernetes to host and run the microservices within their SDE environment. The specific Infrastructure and Kubernetes (K8S) configuration can be found here.
Configuration Management¶
All services support:
- Environment variable configuration
- Docker network communication
- Secrets mounting for sensitive data
- Configurable storage paths and endpoints
For detailed configuration information, see the individual service documentation linked above.
