Skip to main content

Dataflow

1. Overview

Dataflow is the data stream management and annotation subsystem within the CSGHub platform. It is designed to handle model training data, annotation tasks, data preprocessing, and distribution workflows.

By deploying via Helm Chart, you can quickly run Dataflow and its dependencies—including Label Studio, Redis, PostgreSQL, and MongoDB—in a Kubernetes environment. This Chart supports both an All-in-one installation (Built-in mode) and connection to External managed resources.

2. Environment Requirements

ItemDescription
Kubernetes Versionv1.33+
Helm Versionv3.12+
NetworkCluster nodes must be able to access the CSGHub main service (externalUrl)
PermissionsAuthority to create Namespaces, Services, PVCs, Gateways, etc.
StorageRequires storage volumes that support ReadWriteMany (RWX)

3. Deployment

3.1 Add Helm Repository

helm repo add csghub https://charts.opencsg.com/csghub
helm repo update

3.2 Create Namespace (Optional)

kubectl create namespace csghub

3.3 Deploy Dataflow

  1. Obtain externalUrl:

    Get the CSGHub access address using the following command:

    helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'
  2. Execute Installation:

    💡 Tip for China-based deployments:

    Add these flags to use local mirrors:

    • --set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"

    • --set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"

      helm install dataflow csghub/dataflow \
      --namespace csghub \
      --create-namespace \
      --set global.gateway.external.domain="example.com" \
      --set externalUrl="<csghub externalUrl>" \
      --set dataflow.postgresql.database="csghub_dataflow" \
      --set labelStudio.postgresql.database="csghub_label_studio"

      This will automatically start:

      • Dataflow Main Service
      • Label Studio Annotation Service
      • Built-in PostgreSQL, Redis, and MongoDB
      • Built-in Gateway API Controller

4. Using External Resources

For production environments, it is recommended to use external managed databases and caches:

helm upgrade --install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml

Example custom-values.yaml:

global:
gateway:
external:
domain: "company.com"
tls:
enabled: true
secretName: "csghub-tls"

postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"

redis:
enabled: false
external:
host: "redis.company.com"
port: 6379
password: "******"

mongo:
enabled: false
external:
host: "mongo.company.com"
port: 27017
user: "admin"
password: "******"

externalUrl: "https://csghub.company.com"

5. Configuration Parameters

5.1 Global Configuration

ParameterDefaultDescription
global.editioneeEdition: ce (Community) / ee (Enterprise)
global.gateway.external.domainexample.comAccess domain
global.image.tagv1.16.0Image version tag
global.persistence.size10GiDefault PV size
global.postgresql.enabledtrueEnable built-in PostgreSQL
global.redis.enabledtrueEnable built-in Redis
global.mongo.enabledtrueEnable built-in MongoDB

5.2 Service Configuration

ParameterDefaultDescription
externalUrlhttps://csghub.example.comCSGHub main system URL
dataflow.image.repositoryopencsghq/dataflowDataflow image repository
dataflow.persistence.size100GiDataflow PV size

5.3 Label Studio Configuration

ParameterDefaultDescription
labelStudio.image.repositoryopencsghq/label-studioLabel Studio image repository
labelStudio.persistence.size100GiAnnotation data PV size
labelStudio.securityContext.runAsUser0Container User UID
labelStudio.postgresql.databasecsghub_label_studioDatabase name for Label Studio

5.4 Built-in Third-party Components

ComponentsParameterDefaultDescription
PostgreSQLpostgresql.image.repositoryopencsghq/postgresBuilt-in database image
postgresql.databases[csghub_dataflow, csghub_label_studio]Databases created automatically at startup
postgresql.persistence.size50GiPersistent volume storage size
Redisredis.image.repositoryredis
redis.persistence.size10Gi
MongoDBmongo.image.repositoryopencsghq/mongo
mongo.persistence.size10Gi

6. Verification

# Check Pod status
kubectl get pods -n csghub

# Verify services
kubectl get svc -n csghub

Note: Full functional verification requires successful integration with the CSGHub main system.

7. Upgrade & Uninstallation

7.1 Upgrade Chart

helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml

7.2 Uninstall Chart

helm uninstall dataflow -n csghub

8. FAQ

  • Dataflow cannot access main system: Ensure externalUrl and TLS settings are correctly configured.
  • Label Studio startup failure: Check for Database connection issues or PVC mounting path errors.
  • Image pull failure: Ensure image.pullSecrets are added if using a private registry.
  • Redis/Mongo failed to start: Check for configuration conflicts between built-in and external resource settings.