Dataflow Deployment Guide
📘 Overview
CSGHUB Dataflow is the data flow management and labeling subsystem of the CSGHub platform.
It handles data processing, annotation, preprocessing, and distribution for model training workflows.
This Helm Chart enables fast deployment of Dataflow and its dependencies (Label Studio, Redis, PostgreSQL, MongoDB, etc.) in a Kubernetes environment.
The chart supports both built-in mode (all dependencies deployed automatically) and external resource mode (connect to managed databases and caches).
⚙️ System Requirements
| Item | Description |
|---|---|
| Kubernetes Version | v1.28+ |
| Helm Version | v3.12+ |
| Network | Cluster nodes must access the CSGHub main service (externalUrl) |
| Permissions | Requires permissions to create Namespace, Service, PVC, Ingress, etc. |
| Cluster Storage | Must support ReadWriteMany persistent volumes |
🧩 1. Preparation
Add the CSGHub Helm Repository
helm repo add csghub https://charts.opencsg.com/repository/csghub
helm repo update
Create Namespace (Optional)
kubectl create namespace csghub
🏗️ 2. Deploy Dataflow
Basic Installation (with built-in dependencies)
For testing or development, you can use the default configuration:
-
Get the externalUrl of CSGHub:
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub' -
Install Dataflow:
helm install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set global.ingress.domain="example.com" \
--set externalUrl="<csghub externalUrl>" \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio"
This installation automatically deploys:
- Dataflow main service
- Label Studio annotation service
- Built-in PostgreSQL, Redis, and MongoDB
- Built-in NGINX Ingress Controller
Using External Resources
For production environments, it is recommended to use managed external services:
helm install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml
Example custom-values.yaml:
global:
ingress:
domain: "csghub.company.com"
tls:
enabled: true
secretName: "csghub-tls"
postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"
redis:
enabled: false
external:
host: "redis.company.com"
port: 6379
password: "******"
mongo:
enabled: false
external:
host: "mongo.company.com"
port: 27017
user: "admin"
password: "******"
externalUrl: "https://csghub.company.com"
⚙️ 3. Key Configuration Parameters
Global Configuration (global)
| Parameter | Default | Description |
|---|---|---|
| global.edition | ee | Edition: Community (ce) / Enterprise (ee) |
| global.ingress.domain | example.com | Base domain for ingress access |
| global.image.tag | v1.12.0 | Default image version tag |
| global.persistence.size | 10Gi | Default persistent volume size |
| global.postgresql.enabled | true | Enable built-in PostgreSQL |
| global.redis.enabled | true | Enable built-in Redis |
| global.mongo.enabled | true | Enable built-in MongoDB |
Dataflow Service Configuration
| Parameter | Default | Description |
|---|---|---|
| externalUrl | https://csghub.example.com | CSGHub main system URL |
| dataflow.image.repository | opencsghq/dataflow | Dataflow image repository |
| dataflow.image.tag | v1.12.0 | Dataflow image tag |
| dataflow.persistence.size | 100Gi | Persistent volume size |
| dataflow.postgresql | Override PostgreSQL settings | |
| dataflow.redis | Override Redis settings | |
| dataflow.mongo | Override MongoDB settings |
Worker Configuration
| Parameter | Default | Description |
|---|---|---|
| worker.logging.level | info | Logging level for Celery Worker |
Label Studio Configuration
| Parameter | Default | Description |
|---|---|---|
| labelStudio.image.repository | opencsghq/label-studio | Label Studio image |
| labelStudio.image.tag | v1.12.0 | Label Studio image tag |
| labelStudio.persistence.size | 100Gi | Persistent volume size |
| labelStudio.securityContext.runAsUser | 0 | Container user UID |
| labelStudio.postgresql.database | csghub_label_studio | Label Studio DB name |
Built-in Dependencies
| Component | Parameter | Default | Description |
|---|---|---|---|
| PostgreSQL | postgresql.image.repository | opencsghq/postgres | Built-in database image |
| postgresql.databases | [csghub_dataflow, csghub_label_studio] | Pre-created databases | |
| postgresql.persistence.size | 50Gi | Persistent volume size | |
| Redis | redis.image.repository | redis | Redis image |
| redis.persistence.size | 10Gi | Persistent volume size | |
| MongoDB | mongo.image.repository | opencsghq/mongo | MongoDB image |
| mongo.persistence.size | 10Gi | Persistent volume size |
🔍 4. Verify Deployment
Check running Pods:
kubectl get pods -n csghub
Check services:
kubectl get svc -n csghub
Functional testing requires connection to the main CSGHub service.
🔄 5. Upgrade and Uninstall
Upgrade Chart
helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml
Uninstall Chart
helm uninstall dataflow -n csghub-dataflow
🧠 FAQ
| Issue | Cause | Solution |
|---|---|---|
| Dataflow cannot reach main CSGHub | externalUrl misconfigured | Verify URL and TLS settings |
| Label Studio failed to start | Database or PVC misconfigured | Check PostgreSQL/Mongo mount paths |
| Image pull failure | Missing registry credentials | Add image.pullSecrets configuration |
| Redis/Mongo not starting | Conflict with external config | Disable built-in components and redeploy |