Dataflow
1. Overview
Dataflow is the data stream management and annotation subsystem within the CSGHub platform. It is designed to handle model training data, annotation tasks, data preprocessing, and distribution workflows.
By deploying via Helm Chart, you can quickly run Dataflow and its dependencies—including Label Studio and PostgreSQL—in a Kubernetes environment. Redis and MongoDB are no longer required by the Dataflow service (removed in chart v2.2.0). The chart supports an All-in-one installation (Built-in mode) as well as connection to External managed resources.
2. Environment Requirements
| Item | Description |
|---|---|
| Kubernetes Version | v1.33+ |
| Helm Version | v3.12+ |
| Network | Cluster nodes must be able to access the CSGHub main service (externalUrl) |
| Permissions | Authority to create Namespaces, Services, PVCs, Gateways, etc. |
| Storage | Requires storage volumes that support ReadWriteOnce (RWO) (changed from RWX in chart v2.2.0) |
3. Deployment
3.1 Add Helm Repository
helm repo add csghub https://charts.opencsg.com/csghub
helm repo update
3.2 Create Namespace (Optional)
kubectl create namespace csghub
3.3 Deploy Dataflow
-
Obtain externalUrl:
Get the CSGHub access address using the following command:
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub' -
Execute Installation:
💡 Tip for China-based deployments:
Add these flags to use local mirrors:
-
--set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com" -
--set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"helm install dataflow csghub/dataflow \--namespace csghub \--create-namespace \--set global.gateway.external.domain="example.com" \--set externalUrl="<csghub externalUrl>" \--set dataflow.postgresql.database="csghub_dataflow" \--set labelStudio.postgresql.database="csghub_label_studio"This will automatically start:
- Dataflow Main Service
- Label Studio Annotation Service
- Built-in PostgreSQL
- Built-in Gateway API Controller
ℹ️ Redis and MongoDB are no longer required by the Dataflow service in v2.2.0. The chart only needs PostgreSQL.
-
4. Using External Resources
For production environments, it is recommended to use external managed databases and caches:
helm upgrade --install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml
Example custom-values.yaml:
global:
gateway:
external:
domain: "company.com"
tls:
enabled: true
secretName: "csghub-tls"
postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"
externalUrl: "https://csghub.company.com"
ℹ️
mongo.*is no longer used by Dataflow in chart v2.2.0. Themongoblock is kept for backward compatibility with other sub-charts (e.g. label-studio when deployed standalone), but dataflow itself does not require MongoDB.
5. Configuration Parameters
5.1 Global Configuration
| Parameter | Default | Description |
|---|---|---|
global.edition | ee | Edition: ce / ee / saas (added in v2.2.0) |
global.gateway.external.domain | example.com | Access domain |
global.image.tag | v2.2.0 | Image version tag (csghub chart) |
global.persistence.size | 10Gi | Default PV size |
global.redis.enabled | true | Enable built-in Redis (not required by dataflow v2.2.0+) |
global.mongo.enabled | true | Enable built-in MongoDB |
global.postgresql.enabled | true | Enable built-in PostgreSQL |
5.2 Service Configuration
| Parameter | Default | Description |
|---|---|---|
externalUrl | https://csghub.example.com | CSGHub main system URL |
dataflow.image.repository | opencsghq/dataflow | Dataflow image repository |
dataflow.image.tag | v2.2.0-api | Dataflow image tag (note: -api suffix in v2.2.0) |
dataflow.persistence.size | 50Gi | Dataflow PV size (was 100Gi in v2.1.x) |
dataflow.persistence.accessModes | ["ReadWriteOnce"] | PV access mode (changed from ReadWriteMany in v2.2.0) |
dataflow.postgresql | {} | Override default PostgreSQL config |
5.3 Label Studio Configuration
| Parameter | Default | Description |
|---|---|---|
labelStudio.image.repository | opencsghq/label-studio | Label Studio image repository |
labelStudio.image.tag | v2.2.0 | Label Studio image tag |
labelStudio.persistence.size | 100Gi | Annotation data PV size |
labelStudio.securityContext.runAsUser | 0 | Container User UID |
labelStudio.postgresql.database | csghub_label_studio | Database name for Label Studio |
5.4 Built-in Third-party Components
| Components | Parameter | Default | Description |
|---|---|---|---|
| PostgreSQL | postgresql.image.repository | opencsghq/postgres | Built-in database image |
| postgresql.databases | [csghub_dataflow, csghub_label_studio] | Databases created automatically at startup | |
| postgresql.persistence.size | 50Gi | Persistent volume storage size |
ℹ️ Redis and MongoDB are no longer required by the dataflow service as of v2.2.0. If other sub-charts (e.g. csghub core) need them, configure via the parent chart's
global.redis.*/global.mongo.*.
6. Verification
# Check Pod status
kubectl get pods -n csghub
# Verify services
kubectl get svc -n csghub
Note: Full functional verification requires successful integration with the CSGHub main system.
7. Upgrade & Uninstallation
7.1 Upgrade Chart
helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml
⚠️ Breaking change when upgrading from chart v2.1.x to v2.2.0:
StatefulSet → Deployment migration: The dataflow workload switched from
StatefulSettoDeployment. PVCs from the old StatefulSet will not be reused. Before upgrading, back up the database and then delete the old PVC:kubectl delete pvc data-<release-name>-dataflow-0 -n csghubPre-upgrade migration Job: A new
pre-upgradeHelm hook Job runs automatically. It applies all SQL files in/scripts/*_dataflow_*.sqlwith idempotency tracking via the_migrationstable. The initial migration snapshots and truncates 6 tables (collection_tasks,data_format_tasks,datasources,deletion_status,job,workers). Make sure to dump thecsghub_dataflowdatabase beforehand.
7.2 Uninstall Chart
helm uninstall dataflow -n csghub
8. FAQ
- Dataflow cannot access main system: Ensure
externalUrland TLS settings are correctly configured. - Label Studio startup failure: Check for Database connection issues or PVC mounting path errors.
- Image pull failure: Ensure
image.pullSecretsare added if using a private registry. - Upgrade stuck / migration Job fails: Inspect the Job logs (
kubectl logs job -l app.kubernetes.io/name=dataflow -n csghub). If the migration SQL fails, restore thecsghub_dataflowdatabase from the pre-upgrade dump.