Skip to main content

Introduction

The CSGHub Getting Started Guide aims to provide comprehensive instructions to help users efficiently manage LLM assets with CSGHub.

CSGHub and LLM

What is CSGHub?

CSGHub is an open-source, trusted large model asset management platform designed to help users govern the assets involved in the lifecycle of LLMs (datasets, model files, code, etc.). CSGHub aims to provide an asset management solution that is specifically designed for LLMs, with the ability for private deployment and offline operation. It offers functionalities similar to those of platforms like Hugging Face, with private deployment options, akin to how GitLab manages source code, OpenStack Glance manages virtual machine images, Harbor manages container images, and Sonatype Nexus manages artifacts.

You can get CSGHub code from Github. Also you can quickly understand the CSGHub architecture design at this page.

We welcome and encourage users to initiate issues on GitHub for discussion or to contribute code to the CSGHub open-source project, to help foster continuous development and improvement of the platform.

What is a Model?

Definition

In the fields of machine learning and natural language processing, a model is a trained mathematical representation used to perform a specific task, such as text generation, sentiment analysis, or machine translation. A model learns the relationships between inputs and outputs by analyzing large amounts of data.

Models in CSGHub

CSGHub provides a rich model library containing pre-trained models that users can directly use for inference or further fine-tune. CSGHub models are fully compatible with the Hugging Face ecosystem, allowing users to use the Hugging Face Transformers library, which supports various architectures such as GPT, BERT, and T5 for a range of tasks:

  • Text Classification: e.g., sentiment analysis
  • Named Entity Recognition: identifying specific entities in text
  • Text Generation: generating new text based on input
  • Translation: translating one language into another

How to Use Models?

Users can load pre-trained models using simple API calls. For example, using Python code:

from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline('sentiment-analysis')

# Perform inference with the model
result = classifier("I love Huggingface!")
print(result)

What is a Dataset?

Definition

A dataset is a collection of data used to train and evaluate machine learning models. In natural language processing, datasets typically consist of text and labels, such as sentences, articles, and annotated sentiments.

Datasets in CSGHub

CSGHub's dataset library offers a variety of publicly available datasets covering a wide range of topics and tasks. Users can download, load, and use these datasets for their own model training. These datasets may include:

  • Text Classification Datasets: for training sentiment analysis models
  • Parallel Translation Datasets: for training translation models
  • Question-Answering Datasets: for use in Q&A systems
  • Conversational System Datasets: for training chatbots

How to Use Datasets?

Users can easily load datasets using the Hugging Face datasets library. Here's an example of loading and viewing a dataset:s

from datasets import load_dataset

# Load a sentiment analysis dataset
dataset = load_dataset("imdb")

# View the structure of the dataset
print(dataset)

What is a Space?

Definition

A Space is a service provided by CSGHub for quickly building and hosting applications, allowing users to showcase their machine learning models interactively. Users can create web applications to demonstrate the capabilities of their models, allowing others to experience the model's real-world performance.

Features of Space in CSGHub

  • Interactivity: Provides an interface for user interaction with the model
  • Simple Deployment: Users only need to upload code and models, and CSGHub takes care of building, deploying, and hosting the application
  • Privacy: Can be used in enterprise or personal environments

How to Create a Space?

Users can create their own Space with the following steps:

  1. Log in to your CSGHub account
  2. Click on the "Create New Space" button in the upper right corner
  3. Select the type of application (e.g., Gradio or Streamlit)
  4. Fill in the required code and configuration files
  5. Publish and use

Example Code (Gradio)

import gradio as gr
from transformers import pipeline

classifier = pipeline('sentiment-analysis')

def predict_sentiment(text):
return classifier(text)

iface = gr.Interface(fn=predict_sentiment, inputs="text", outputs="label")
iface.launch()

What is a Code Repository?

Definition

A code repository is a place for storing and version-controlling code, typically used for managing project code, documentation, and other resources. On the CSGHub platform, code repositories allow users to store, share, and collaborate on machine learning projects.

Features of Code Repositories in CSGHub

  • Version Control: Users can view code modification history and revert to previous versions
  • Collaborative Development: Supports multi-user collaboration, enhancing project management efficiency
  • Public & Private: Users can set repositories as public or private

How to Use a Code Repository?

Users can create and manage code repositories through the CSGHub interface or web interface. The steps are as follows:

  1. Create your CSGHub account
  2. Click on the "Create New Repository" button in the upper right corner
  3. Fill in the repository name and description
  4. Upload code files or documentation

To upload local code, you can use git commands:

git clone https://huggingface.co/username/repository_name
cd repository_name
# 添加你的代码文件
git add .
git commit -m "Initial commit"
git push

Why Use CSGHub?

In the era of rapid advancement and diversified evolution of large language models (LLMs), data and models have become the most critical digital assets for enterprises and individuals. However, fragmented toolchains, large-file transfer bottlenecks, and disconnected compute scheduling hinder sustainable AI innovation. CSGHub has evolved from a standalone “model and dataset hosting repository” into a full-lifecycle, end-to-end, LLM-native asset management platform.

Core Capabilities:

Unified Multi-Dimensional Asset Management & Traceability

  • Centrally manage model files, datasets, code repositories, and application Spaces.
  • Native support for Prompt repositories and MCP (Model Context Protocol) repositories.
  • Visualized Model Tree and Asset Relationship Graph for tracing model lineage and dependencies.

End-to-End LLMOps

  • Built-in online Notebook instances for interactive development.
  • One-click dataset mounting and model fine-tuning (LLaMA-Factory, MS-SWIFT supported).
  • Multi-framework evaluation support (OpenCompass, EvalScope, lm-evaluation-harness).
  • One-click publishing of fine-tuned models as public APIs or dedicated inference services.

Integrated Data Processing Toolchain

  • Direct data ingestion from MySQL and MongoDB.
  • Multi-format file extraction and conversion (Word, Excel).
  • Visualized processing console with cleaning, deduplication (SimHash/MinHash), and LLM-assisted operators.
  • Deep integration with Label Studio for multimodal data annotation.

Ecosystem Compatibility & Storage Acceleration

  • Compatible with Hugging Face SDK.
  • Supports Git, Web UI, CLI, and Python SDK workflows.
  • XNet intelligent block acceleration engine enables chunk-level deduplication, incremental updates, and parallel downloads.

Enterprise Security & Private Deployment

  • One-click private deployment without internet dependency.
  • Integration with enterprise SSO systems (e.g., Casdoor, Paraview).
  • Organization-based fine-grained role control and asset visibility isolation.

Global Resource Scheduling & Multi-Source Sync

  • Dedicated admin console for compute resource monitoring and log inspection.
  • Resumable synchronization of models and datasets from remote communities into private environments.

Technical Architecture Highlights

Revolutionary Storage Architecture

  • Integrates Git Server, Git LFS, and OSS object storage with a self-developed XNet backend.
  • Encrypted hashing and intelligent chunking ensure data integrity, high deduplication rates, and large-scale parallel transfer.

Cloud-Native Architecture & Reliable Scheduling

  • Production-ready deployment via Docker Compose and Kubernetes Helm Charts.
  • Compute scheduling migrated to Volcano for improved reliability and fault tolerance.

Broad AI Framework Integration

  • Framework-agnostic architecture with deep integration of leading stacks.
  • Inference: vLLM, SGLang, TGI, llama.cpp, KTransformers, MindIE
  • Fine-tuning & evaluation: mainstream open-source toolchains

Modern Big Data Engine

  • Frontend preview powered by Apache Arrow and DuckDB for instant browsing of large-scale Parquet, CSV, and JSON files.
  • DataFlow leverages Celery distributed task queues for efficient large-scale corpus processing.

Enterprise Infrastructure Integration

  • Abstract SSO interface design for rapid enterprise integration.
  • Seamless connectivity with private cloud storage and compute pools for hybrid deployment.

Tutorial Content

This tutorial aims to provide a comprehensive introduction to CSGHub from practical operations, quick deployment, basic concepts, and application perspectives so that you can efficiently master CSGHub and LLM capabilities. Even if you have no deployment experience, this tutorial will help you get started quickly. For more advanced content and features, we also provide additional documentation aimed at advanced users and developers, offering detailed explanations and guidance.

Contact Us

If you encounter any problems during use, you can contact us through any of the following methods:

  1. Open an Issue on GitHub
  2. Scan the WeChat QR code below to add our assistant, reply "Open Source" to join our WeChat group
  3. Join our Discord channel: OpenCSG Discord Channel
  4. Join our Slack channel: OpenCSG Slack Channel

wechat discord slack