Hardware Requirements
1. Description
CSGHub is a cloud-native AI hosting platform that includes the following core load types:
- Control Plane Services (API / Web / Scheduling)
- Data Plane Services (Model, Dataset, Artifact storage)
- Computing Tasks (Dataflow / Runner / Inference tasks)
- Optional AI Components (GPU / Knative / Argo)
Therefore, hardware requirements are highly dependent on the deployment scale and usage scenarios.
2. Deployment Mode Classification
| Deployment Mode | Target Scenarios | Characteristics |
|---|---|---|
| Docker (Single Machine) | Development / Demo | Simple, low resource consumption |
| Single-node K8s | Testing / POC | Close to production architecture |
| Standard K8s Cluster | Production | Scalable and high-availability cluster |
| Large-scale Production | Multi-node redundancy | Large-scale production environment |
3. Testing/Development Environment (Minimum Configuration)
Applicable to:
- Functional verification
- Local development
- Individual use
3.1 Recommended Configuration
| Resource | Configuration |
|---|---|
| CPU | 4 Cores |
| Memory | 8 GB |
| Storage | ≥ 200 GB (SSD) |
| Network | ≥ 1 Gbps |
3.2 Notes
- Can use Docker.
- Not recommended to enable: Dataflow, large-scale Runner, or GPU inference.
- Local disk (
hostPath) can be used for storage.
4. Small to Medium-scale Production (Recommended Configuration)
Applicable to:
- Team use (10–100 people)
- Model / Dataset management
- Medium-scale task scheduling
4.1 Cluster Scale
- 3 to 5-node Kubernetes cluster.
4.2 Per-node Configuration
| Resource | Recommended |
|---|---|
| CPU | 8 ~ 16 Cores |
| Memory | 16 ~ 32 GB |
| Storage | ≥ 1 TB SSD |
| Network | ≥ 1 ~ 10 Gbps |
4.3 Total Resources (Example)
- Total CPU: ≥ 32 Cores
- Total Memory: ≥ 64 GB
- Storage: ≥ 3 TB
5. Large-scale Production (High Load)
Applicable to:
- Multi-team / Multi-tenant
- High-frequency task scheduling
- AI Inference / Training
- Large-scale datasets
5.1 Cluster Scale
- 5 to 20+ nodes.
5.2 Per-node Configuration
| Resource | Recommended |
|---|---|
| CPU | 16 ~ 64 Cores |
| Memory | 64 ~ 256 GB |
| Storage | ≥ 2 TB NVMe SSD |
| Network | ≥ 10 Gbps |
6. GPU Resources (Optional)
Applicable to:
- Model inference
- AI training
- Model evaluation
6.1 Recommended Configuration
| Scenario | GPU |
|---|---|
| Lightweight Inference | 1 × T4 / L4 |
| Medium Load | 1~4 × A10 / A100 |
| Large-scale Training | Multi-node GPU |
6.2 Requirements
- Must deploy: NVIDIA Driver, NVIDIA Device Plugin.
7. Storage Requirements (Critical)
7.1 Mandatory Capabilities
- ✅ CSI support
- ✅ Dynamic Provisioning support
- ✅ At least one StorageClass
7.2 Storage Type Recommendations
| Type | Usage | Recommendation |
|---|---|---|
| Local SSD | Testing | ✅ Recommended |
| NAS / NFS | RWX scenarios | ⚠️ (Average performance) |
| Distributed Storage (Ceph / Longhorn) | Production | ✅ Recommended |
| Object Storage (S3) | Dataset / Model | ✅ Recommended |
7.3 RWX (ReadWriteMany) Requirements
The following components must support RWX:
- Dataflow
- CSGShip
- Parts of task scheduling
👉 Failure to meet this will result in: Task failures and non-shareable data.
7.4 Storage Capacity Estimation
Total Storage = (Model Size × Quantity) + (Dataset Size × Quantity) + Build Cache (~20%) + Logs (~10%). Example: 500 GB Models + 2 TB Datasets + 500 GB Cache/Logs ≈ 3 TB Total.
8. Component Resource Consumption
| Component | CPU | Memory | Storage | Characteristics |
|---|---|---|---|---|
| API / Web | Low | Low | Low | Control Plane |
| Dataflow | Med | Med | High | Heavy I/O dependency |
| Runner | High | Med | Med | Elastic scaling |
| Knative | Med | Med | Low | Auto-scaling |
| Argo | Med | Med | Med | Workflow scheduling |
9. Deployment Method vs. Hardware Advice
| Method | Configuration Details |
|---|---|
| Docker Single Machine | CPU ≥ 4 Core, RAM ≥ 8 GB (for Demos) |
| K8s Single-node | CPU ≥ 8 Core, RAM ≥ 16 GB |
| Standard K8s | ≥ 3 nodes, 8C / 16GB per node |
| High Availability | K8s: 3 Master (4C/8GB), ≥ 3 Worker (8C/16GB) PostgreSQL: 3 nodes (4C/8GB) Object Storage: 4 nodes (4C/8GB) Gitaly: 3 nodes (8C/16GB) |
10. Common Problems & Risks
- Resource Depletion: Leads to Pod
OOMKilled, scheduling failures, and stuck tasks. I/O bottlenecks are the most common issue. - Storage Issues: Lack of RWX causes Dataflow startup failure. Slow I/O degrades training/inference performance.
- Network Issues: Insufficient bandwidth slows model pulling. High latency causes service instability.
11. Final Summary Recommendations
- Testing Environment: 4C / 8GB / 200GB.
- Production Environment: 8C+ / 16GB+ / 1TB+.
- Preferred Deployment: Kubernetes cluster.
- Storage Priority: RWX + High I/O.
- AI Scenario: Dedicated GPU nodes are highly suggested.