Runner Deployment Guide
📘 Overview
CSGHUB Runner is the core component of the CSGHub platform responsible for executing model training, inference, and job scheduling workloads.
Through Runner, the system communicates with the CSGHUB Server, dynamically creating and destroying user workloads within the Kubernetes cluster.
This Helm Chart provides a standardized way to deploy the Runner, supporting flexible configuration, external resource integration, and automated resource management.
⚙️ System Requirements
| Item | Description |
|---|---|
| Kubernetes Version | v1.28+ |
| Helm Version | v3.12+ |
| Network Requirement | Nodes must access the CSGHub Server and external image registries (if required) |
| Permissions | cluster-admin or equivalent privileges to create namespaces and RBAC resources |
📦 Installation Steps
1️⃣ Add the Helm Repository
helm repo add csghub https://charts.opencsg.com/repository/csghub
helm repo update
2️⃣ Create a Namespace (Optional)
kubectl create namespace csghub
3️⃣ Deploy the Runner
You’ll need to obtain the following information from the CSGHUB main service:
-
domain
Provide a subdomain for exposing the Runner service.
If your domain is
example.com, the Runner will be exposed atrunner.example.comdefault. -
externalUrl
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'Use this command to get the CSGHub external access URL.
-
hubAPIToken
kubectl get cm csghub-core -o yaml -n csghub | grep 'API_TOKEN' | awk '{print $NF}' -
region
A custom label to identify the cluster region (e.g.,
cn-north). -
registry
helm get notes csghub -n csghub | grep -A 8 'Minio Console'Get the registry
domain,username,password, andinsecureflag depends on whether HTTPS is enabled. -
objectStore
helm get notes csghub -n csghub | grep -A 8 'Distribution Registry'This provides the
endpoint,accessKey, andsecretKey.bucket,region, andpathStyleare fixed values.securedepends on whether HTTPS is enabled.
-
Deploy
💡 Tip: The object store and image registry can be replaced with external infrastructure.
helm install runner csghub/runner \
--namespace csghub \
--create-namespace \
--set global.ingress.domain="example.com" \
--set externalUrl="<csghub external_url>" \
--set hubAPIToken="<csghub hub_api_token>" \
--set region="<region name>" \
--set registry.registry="<csghub registry>" \
--set registry.repository="csghub" \
--set registry.username="<csghub registry username>" \
--set registry.password="<csghub registry password>" \
--set registry.insecure="<if registry is insecure>" \
--set objectStore.endpoint="<csghub minio>" \
--set objectStore.accessKey="<csghub minio username>" \
--set objectStore.secretKey="<csghub minio password>" \
--set objectStore.bucket="csghub-registry" \
--set objectStore.region="cn-north-1" \
--set objectStore.secure="false" \
--set objectStore.pathStyle="true"
💡 Tip: For long-term management, it’s recommended to save custom configurations in a custom-values.yaml file.
🧾 Configuration Reference (values.yaml)
Global Configuration
| Parameter | Default | Description |
|---|---|---|
| global.ingress.domain | example.com | Base domain for the platform |
| global.ingress.tls.enabled | false | Enable TLS or not |
| global.image.tag | - | Image version tag |
Runner Configuration
| Parameter | Default | Description |
|---|---|---|
| name | runner | Resource name prefix (used for domain exposure) |
| region | region-0 | Runner region identifier |
| interval | 60 | Communication interval with the Server (in seconds) |
| namespace | spaces | Default namespace for user workloads |
| autoConfigure | true | Auto-install knative, argo, and lws components |
| kymlMode | update | Cluster resource management mode (create/update/replace) |
| mergingNamespace | disable | Namespace merging mode (multi/single/disable) |
| usePublicDomain | true | Use public domain for access (false may restrict functionality) |
Package & Image Management
| Parameter | Default | Description |
|---|---|---|
| pipIndexUrl | https://pypi.tuna.tsinghua.edu.cn/simple/ | Custom PyPI mirror |
| extraBuildArgs | [] | Additional Kaniko args |
| modelRegistry | OpenCSG ACR | Model image registry URL |
GPU Configuration
| Parameter | Default | Description |
|---|---|---|
| gpuModelLabel.typeLabel | nvidia.com/gpu.product | GPU model label key |
| gpuModelLabel.capacityLabel | nvidia.com/gpu | GPU capacity label |
Knative Serving Configuration
💡 Note: These parameters are deprecated since v1.12.0 and retained for backward compatibility.
| Parameter | Default | Description |
|---|---|---|
| knative.serving.domain | “example.com” | Knative service domain suffix |
| knative.serving.services | [] | Legacy configuration |
RBAC Configuration
| Parameter | Default | Description |
|---|---|---|
| rbac.create | true | Whether to create ServiceAccount & Roles |
| rbac.serviceAccountName | runner-admin | ServiceAccount name (currently fixed) |
Logging & Monitoring
| Parameter | Default | Description |
|---|---|---|
| logging.level | info | Log level (info/debug/error) |
| logcollector.enabled | false | Enable log collector |
| logcollector.loki.address | “” | Loki service address |
| tempo.address | “” | Tempo tracing endpoint |
-
Loki service is not exposed by default.
Enable it in the main CSGHub chart with loki.ingress.enabled=true to use.
-
Tempo tracing is currently internal only; external exposure is planned.
External Resources
🔹 Image Registry
registry:
registry: "registry.example.com"
repository: "csghub"
username: "user"
password: "pass"
insecure: false
🔹 Object Storage
objectStore:
endpoint: "https://minio.example.com"
accessKey: "admin"
secretKey: "password"
bucket: "csghub-registry"
region: "us-east-1"
secure: true
pathStyle: true
Resource & Scheduling
| Parameter | Default | Description |
|---|---|---|
| resources | Pod resource requests/limits config | |
| nodeSelector | Node selector | |
| tolerations | [] | Tolerations |
| affinity | Affinity rules |
🔍 Verify Deployment
Check the status:
kubectl get pods -n csghub
kubectl get svc -n csghub
View Runner logs:
kubectl logs -f deploy/runner-runner -n csghub
🔄 Upgrade & Uninstall
Upgrade Chart
helm upgrade runner csghub/runner -n csghub -f custom-values.yaml
Uninstall Chart
helm uninstall runner -n csghub
🧠 Troubleshooting
| Issue | Solution |
|---|---|
| Runner cannot reach Server | Verify externalUrl and hubAPIToken are configured correctly |
| Knative not auto-installed | Ensure autoConfigure: true and proper cluster permissions |
| GPU job not scheduled | Check node GPU labels and drivers |
| Image pull failed | Verify registry credentials and image.pullSecrets settings |