Hardware Requirements

1. Description

CSGHub is a cloud-native AI hosting platform that includes the following core load types:

Control Plane Services (API / Web / Scheduling)
Data Plane Services (Model, Dataset, Artifact storage)
Computing Tasks (Dataflow / Runner / Inference tasks)
Optional AI Components (GPU / Knative / Argo)

Therefore, hardware requirements are highly dependent on the deployment scale and usage scenarios.

2. Deployment Mode Classification

Deployment Mode	Target Scenarios	Characteristics
Docker (Single Machine)	Development / Demo	Simple, low resource consumption
Single-node K8s	Testing / POC	Close to production architecture
Standard K8s Cluster	Production	Scalable and high-availability cluster
Large-scale Production	Multi-node redundancy	Large-scale production environment

3. Testing/Development Environment (Minimum Configuration)

Applicable to:

Functional verification
Local development
Individual use

3.1 Recommended Configuration

Resource	Configuration
CPU	4 Cores
Memory	8 GB
Storage	≥ 200 GB (SSD)
Network	≥ 1 Gbps

3.2 Notes

Can use Docker.
Not recommended to enable: Dataflow, large-scale Runner, or GPU inference.
Local disk (hostPath) can be used for storage.

4. Small to Medium-scale Production (Recommended Configuration)

Applicable to:

Team use (10–100 people)
Model / Dataset management
Medium-scale task scheduling

4.1 Cluster Scale

3 to 5-node Kubernetes cluster.

4.2 Per-node Configuration

Resource	Recommended
CPU	8 ~ 16 Cores
Memory	16 ~ 32 GB
Storage	≥ 1 TB SSD
Network	≥ 1 ~ 10 Gbps

4.3 Total Resources (Example)

Total CPU: ≥ 32 Cores
Total Memory: ≥ 64 GB
Storage: ≥ 3 TB

5. Large-scale Production (High Load)

Applicable to:

Multi-team / Multi-tenant
High-frequency task scheduling
AI Inference / Training
Large-scale datasets

5.1 Cluster Scale

5 to 20+ nodes.

5.2 Per-node Configuration

Resource	Recommended
CPU	16 ~ 64 Cores
Memory	64 ~ 256 GB
Storage	≥ 2 TB NVMe SSD
Network	≥ 10 Gbps

6. GPU Resources (Optional)

Applicable to:

Model inference
AI training
Model evaluation

6.1 Recommended Configuration

Scenario	GPU
Lightweight Inference	1 × T4 / L4
Medium Load	1~4 × A10 / A100
Large-scale Training	Multi-node GPU

6.2 Requirements

Must deploy: NVIDIA Driver, NVIDIA Device Plugin.

7. Storage Requirements (Critical)

7.1 Mandatory Capabilities

✅ CSI support
✅ Dynamic Provisioning support
✅ At least one StorageClass

7.2 Storage Type Recommendations

Type	Usage	Recommendation
Local SSD	Testing	✅ Recommended
NAS / NFS	RWX scenarios	⚠️ (Average performance)
Distributed Storage (Ceph / Longhorn)	Production	✅ Recommended
Object Storage (S3)	Dataset / Model	✅ Recommended

7.3 RWX (ReadWriteMany) Requirements

The following components must support RWX:

Dataflow
CSGShip
Parts of task scheduling

👉 Failure to meet this will result in: Task failures and non-shareable data.

7.4 Storage Capacity Estimation

Total Storage = (Model Size × Quantity) + (Dataset Size × Quantity) + Build Cache (~20%) + Logs (~10%). Example: 500 GB Models + 2 TB Datasets + 500 GB Cache/Logs ≈ 3 TB Total.

8. Component Resource Consumption

Component	CPU	Memory	Storage	Characteristics
API / Web	Low	Low	Low	Control Plane
Dataflow	Med	Med	High	Heavy I/O dependency
Runner	High	Med	Med	Elastic scaling
Knative	Med	Med	Low	Auto-scaling
Argo	Med	Med	Med	Workflow scheduling

9. Deployment Method vs. Hardware Advice

Method	Configuration Details
Docker Single Machine	CPU ≥ 4 Core, RAM ≥ 8 GB (for Demos)
K8s Single-node	CPU ≥ 8 Core, RAM ≥ 16 GB
Standard K8s	≥ 3 nodes, 8C / 16GB per node
High Availability	K8s: 3 Master (4C/8GB), ≥ 3 Worker (8C/16GB) PostgreSQL: 3 nodes (4C/8GB) Object Storage: 4 nodes (4C/8GB) Gitaly: 3 nodes (8C/16GB)

10. Common Problems & Risks

Resource Depletion: Leads to Pod OOMKilled, scheduling failures, and stuck tasks. I/O bottlenecks are the most common issue.
Storage Issues: Lack of RWX causes Dataflow startup failure. Slow I/O degrades training/inference performance.
Network Issues: Insufficient bandwidth slows model pulling. High latency causes service instability.

11. Final Summary Recommendations

Testing Environment: 4C / 8GB / 200GB.
Production Environment: 8C+ / 16GB+ / 1TB+.
Preferred Deployment: Kubernetes cluster.
Storage Priority: RWX + High I/O.
AI Scenario: Dedicated GPU nodes are highly suggested.

1. Description​

2. Deployment Mode Classification​

3. Testing/Development Environment (Minimum Configuration)​

3.1 Recommended Configuration​

3.2 Notes​

4. Small to Medium-scale Production (Recommended Configuration)​

4.1 Cluster Scale​

4.2 Per-node Configuration​

4.3 Total Resources (Example)​

5. Large-scale Production (High Load)​

5.1 Cluster Scale​

5.2 Per-node Configuration​

6. GPU Resources (Optional)​

6.1 Recommended Configuration​

6.2 Requirements​

7. Storage Requirements (Critical)​

7.1 Mandatory Capabilities​

7.2 Storage Type Recommendations​

7.3 RWX (ReadWriteMany) Requirements​

7.4 Storage Capacity Estimation​

8. Component Resource Consumption​

9. Deployment Method vs. Hardware Advice​

10. Common Problems & Risks​

11. Final Summary Recommendations​

1. Description

2. Deployment Mode Classification

3. Testing/Development Environment (Minimum Configuration)

3.1 Recommended Configuration

3.2 Notes

4. Small to Medium-scale Production (Recommended Configuration)

4.1 Cluster Scale

4.2 Per-node Configuration

4.3 Total Resources (Example)

5. Large-scale Production (High Load)

5.1 Cluster Scale

5.2 Per-node Configuration

6. GPU Resources (Optional)

6.1 Recommended Configuration

6.2 Requirements

7. Storage Requirements (Critical)

7.1 Mandatory Capabilities

7.2 Storage Type Recommendations

7.3 RWX (ReadWriteMany) Requirements

7.4 Storage Capacity Estimation

8. Component Resource Consumption

9. Deployment Method vs. Hardware Advice

10. Common Problems & Risks

11. Final Summary Recommendations