Skip to main content

Dataflow

1. Overview

Dataflow is the data stream management and annotation subsystem within the CSGHub platform. It is designed to handle model training data, annotation tasks, data preprocessing, and distribution workflows.

By deploying via Helm Chart, you can quickly run Dataflow and its dependencies—including Label Studio and PostgreSQL—in a Kubernetes environment. Redis and MongoDB are no longer required by the Dataflow service (removed in chart v2.2.0). The chart supports an All-in-one installation (Built-in mode) as well as connection to External managed resources.

2. Environment Requirements

ItemDescription
Kubernetes Versionv1.33+
Helm Versionv3.12+
NetworkCluster nodes must be able to access the CSGHub main service (externalUrl)
PermissionsAuthority to create Namespaces, Services, PVCs, Gateways, etc.
StorageRequires storage volumes that support ReadWriteOnce (RWO) (changed from RWX in chart v2.2.0)

3. Deployment

3.1 Add Helm Repository

helm repo add csghub https://charts.opencsg.com/csghub
helm repo update

3.2 Create Namespace (Optional)

kubectl create namespace csghub

3.3 Deploy Dataflow

  1. Obtain externalUrl:

    Get the CSGHub access address using the following command:

    helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'
  2. Execute Installation:

    💡 Tip for China-based deployments:

    Add these flags to use local mirrors:

    • --set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"

    • --set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"

      helm install dataflow csghub/dataflow \
      --namespace csghub \
      --create-namespace \
      --set global.gateway.external.domain="example.com" \
      --set externalUrl="<csghub externalUrl>" \
      --set dataflow.postgresql.database="csghub_dataflow" \
      --set labelStudio.postgresql.database="csghub_label_studio"

      This will automatically start:

      • Dataflow Main Service
      • Label Studio Annotation Service
      • Built-in PostgreSQL
      • Built-in Gateway API Controller

      ℹ️ Redis and MongoDB are no longer required by the Dataflow service in v2.2.0. The chart only needs PostgreSQL.

4. Using External Resources

For production environments, it is recommended to use external managed databases and caches:

helm upgrade --install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml

Example custom-values.yaml:

global:
gateway:
external:
domain: "company.com"
tls:
enabled: true
secretName: "csghub-tls"

postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"

externalUrl: "https://csghub.company.com"

ℹ️ mongo.* is no longer used by Dataflow in chart v2.2.0. The mongo block is kept for backward compatibility with other sub-charts (e.g. label-studio when deployed standalone), but dataflow itself does not require MongoDB.

5. Configuration Parameters

5.1 Global Configuration

ParameterDefaultDescription
global.editioneeEdition: ce / ee / saas (added in v2.2.0)
global.gateway.external.domainexample.comAccess domain
global.image.tagv2.2.0Image version tag (csghub chart)
global.persistence.size10GiDefault PV size
global.redis.enabledtrueEnable built-in Redis (not required by dataflow v2.2.0+)
global.mongo.enabledtrueEnable built-in MongoDB
global.postgresql.enabledtrueEnable built-in PostgreSQL

5.2 Service Configuration

ParameterDefaultDescription
externalUrlhttps://csghub.example.comCSGHub main system URL
dataflow.image.repositoryopencsghq/dataflowDataflow image repository
dataflow.image.tagv2.2.0-apiDataflow image tag (note: -api suffix in v2.2.0)
dataflow.persistence.size50GiDataflow PV size (was 100Gi in v2.1.x)
dataflow.persistence.accessModes["ReadWriteOnce"]PV access mode (changed from ReadWriteMany in v2.2.0)
dataflow.postgresql{}Override default PostgreSQL config

5.3 Label Studio Configuration

ParameterDefaultDescription
labelStudio.image.repositoryopencsghq/label-studioLabel Studio image repository
labelStudio.image.tagv2.2.0Label Studio image tag
labelStudio.persistence.size100GiAnnotation data PV size
labelStudio.securityContext.runAsUser0Container User UID
labelStudio.postgresql.databasecsghub_label_studioDatabase name for Label Studio

5.4 Built-in Third-party Components

ComponentsParameterDefaultDescription
PostgreSQLpostgresql.image.repositoryopencsghq/postgresBuilt-in database image
postgresql.databases[csghub_dataflow, csghub_label_studio]Databases created automatically at startup
postgresql.persistence.size50GiPersistent volume storage size

ℹ️ Redis and MongoDB are no longer required by the dataflow service as of v2.2.0. If other sub-charts (e.g. csghub core) need them, configure via the parent chart's global.redis.* / global.mongo.*.

6. Verification

# Check Pod status
kubectl get pods -n csghub

# Verify services
kubectl get svc -n csghub

Note: Full functional verification requires successful integration with the CSGHub main system.

7. Upgrade & Uninstallation

7.1 Upgrade Chart

helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml

⚠️ Breaking change when upgrading from chart v2.1.x to v2.2.0:

  1. StatefulSet → Deployment migration: The dataflow workload switched from StatefulSet to Deployment. PVCs from the old StatefulSet will not be reused. Before upgrading, back up the database and then delete the old PVC:

    kubectl delete pvc data-<release-name>-dataflow-0 -n csghub
  2. Pre-upgrade migration Job: A new pre-upgrade Helm hook Job runs automatically. It applies all SQL files in /scripts/*_dataflow_*.sql with idempotency tracking via the _migrations table. The initial migration snapshots and truncates 6 tables (collection_tasks, data_format_tasks, datasources, deletion_status, job, workers). Make sure to dump the csghub_dataflow database beforehand.

7.2 Uninstall Chart

helm uninstall dataflow -n csghub

8. FAQ

  • Dataflow cannot access main system: Ensure externalUrl and TLS settings are correctly configured.
  • Label Studio startup failure: Check for Database connection issues or PVC mounting path errors.
  • Image pull failure: Ensure image.pullSecrets are added if using a private registry.
  • Upgrade stuck / migration Job fails: Inspect the Job logs (kubectl logs job -l app.kubernetes.io/name=dataflow -n csghub). If the migration SQL fails, restore the csghub_dataflow database from the pre-upgrade dump.