Introduction
The CSGHub Getting Started Guide aims to provide comprehensive instructions to help users efficiently manage LLM assets with CSGHub.
CSGHub and LLM
What is CSGHub?
CSGHub is an open-source, trusted large model asset management platform designed to help users govern the assets involved in the lifecycle of LLMs (datasets, model files, code, etc.). CSGHub aims to provide an asset management solution that is specifically designed for LLMs, with the ability for private deployment and offline operation. It offers functionalities similar to those of platforms like Hugging Face, with private deployment options, akin to how GitLab manages source code, OpenStack Glance manages virtual machine images, Harbor manages container images, and Sonatype Nexus manages artifacts.
You can get more details and the latest updates by visiting the CSGHub open-source project at https://github.com/OpenCSGs/CSGHub or the OpenCSG Community website at https://opencsg.com.
We welcome and encourage users to initiate issues on GitHub for discussion or to contribute code to the CSGHub open-source project, to help foster continuous development and improvement of the platform.
What is a Model?
Definition
In the fields of machine learning and natural language processing, a model is a trained mathematical representation used to perform a specific task, such as text generation, sentiment analysis, or machine translation. A model learns the relationships between inputs and outputs by analyzing large amounts of data.
Models in CSGHub
CSGHub provides a rich model library containing pre-trained models that users can directly use for inference or further fine-tune. CSGHub models are fully compatible with the Hugging Face ecosystem, allowing users to use the Hugging Face Transformers library, which supports various architectures such as GPT, BERT, and T5 for a range of tasks:
- Text Classification: e.g., sentiment analysis
- Named Entity Recognition: identifying specific entities in text
- Text Generation: generating new text based on input
- Translation: translating one language into another
How to Use Models?
Users can load pre-trained models using simple API calls. For example, using Python code:
from transformers import pipeline
# Create a sentiment analysis pipeline
classifier = pipeline('sentiment-analysis')
# Perform inference with the model
result = classifier("I love Huggingface!")
print(result)
What is a Dataset?
Definition
A dataset is a collection of data used to train and evaluate machine learning models. In natural language processing, datasets typically consist of text and labels, such as sentences, articles, and annotated sentiments.
Datasets in CSGHub
CSGHub's dataset library offers a variety of publicly available datasets covering a wide range of topics and tasks. Users can download, load, and use these datasets for their own model training. These datasets may include:
- Text Classification Datasets: for training sentiment analysis models
- Parallel Translation Datasets: for training translation models
- Question-Answering Datasets: for use in Q&A systems
- Conversational System Datasets: for training chatbots
How to Use Datasets?
Users can easily load datasets using the Hugging Face datasets library. Here's an example of loading and viewing a dataset:s
from datasets import load_dataset
# Load a sentiment analysis dataset
dataset = load_dataset("imdb")
# View the structure of the dataset
print(dataset)
What is a Space?
Definition
A Space is a service provided by CSGHub for quickly building and hosting applications, allowing users to showcase their machine learning models interactively. Users can create web applications to demonstrate the capabilities of their models, allowing others to experience the model's real-world performance.
Features of Space in CSGHub
- Interactivity: Provides an interface for user interaction with the model
- Simple Deployment: Users only need to upload code and models, and CSGHub takes care of building, deploying, and hosting the application
- Privacy: Can be used in enterprise or personal environments
How to Create a Space?
Users can create their own Space with the following steps:
- Log in to your CSGHub account
- Click on the "Create New Space" button in the upper right corner
- Select the type of application (e.g., Gradio or Streamlit)
- Fill in the required code and configuration files
- Publish and use
Example Code (Gradio)
import gradio as gr
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
def predict_sentiment(text):
return classifier(text)
iface = gr.Interface(fn=predict_sentiment, inputs="text", outputs="label")
iface.launch()
What is a Code Repository?
Definition
A code repository is a place for storing and version-controlling code, typically used for managing project code, documentation, and other resources. On the CSGHub platform, code repositories allow users to store, share, and collaborate on machine learning projects.
Features of Code Repositories in CSGHub
- Version Control: Users can view code modification history and revert to previous versions
- Collaborative Development: Supports multi-user collaboration, enhancing project management efficiency
- Public & Private: Users can set repositories as public or private
How to Use a Code Repository?
Users can create and manage code repositories through the CSGHub interface or web interface. The steps are as follows:
- Create your CSGHub account
- Click on the "Create New Repository" button in the upper right corner
- Fill in the repository name and description
- Upload code files or documentation
To upload local code, you can use git commands:
git clone https://huggingface.co/username/repository_name
cd repository_name
# 添加你的代码文件
git add .
git commit -m "Initial commit"
git push
Why Use CSGHub?
In this era of rapid evolution and diversification of LLMs, data and models have gradually become the most critical digital assets for both enterprises and individual users. However, the current challenges, such as scattered management tools, simplistic management methods, and isolated deployment, bring potential security risks and hinder the continuous innovation and application capabilities in LLM technology.
We believe that LLMs will be a driving force behind the information technology revolution. Therefore, exploring a more efficient, secure, and reliable management strategy to optimize and protect core assets (i.e., models, data, and LLM application code) has become a significant task for both individuals and enterprises. In response, the CSGHub project aims to provide practical solutions to these challenges.
CSGHub can offer you the following capabilities:
- Unified Asset Management: Unified hub to manage model files, datasets, and LLM application code in one place.
- Development Ecosystem Compatibility: Supports Git commands and web operations over HTTPS and SSH protocols; provides a Hugging Face SDK-compatible development ecosystem for ease of use.
- Extensive LLM Features: Native support for version management, model format conversion, automatic data preprocessing, and dataset previewing.
- Security and Permissions: Integration with enterprise user systems, asset visibility settings, and internal/external API authentication to meet security requirements.
- Private Deployment Support: No dependence on the internet or cloud providers; private deployment can be initiated with one click.
- Native LLM Design: Supports natural language interaction, one-click model deployment, and asset management for Agent and Copilot App.
Technical Features of CSGHub
CSGHub's technical features include:
- CSGHub integrates multi-source Git server, Git LFS large file storage protocol, and object storage (OSS), providing a reliable data storage layer, flexible infrastructure integration layer, and high compatibility with development tools.
- CSGHub offers service-based architecture, providing the CSGHub Server backend service and CSGHub Web Service management interface. Ordinary users can quickly start services using Docker Compose or Kubernetes Helm Chart, achieving production-level asset management. Users with development capabilities can use CSGHub Server for secondary development to integrate management functions into external systems or customize advanced features.
- Leveraging outstanding open-source projects like Apache Arrow and DuckDB, CSGHub supports previewing Parquet data file formats, making it convenient for researchers and enthusiasts to manage localized datasets.
- CSGHub provides an intuitive web interface and enterprise-oriented permission design, allowing users to manage version control, browse and download online, set dataset and model file visibility for data security isolation, and initiate topic discussions for models and datasets.
Tutorial Content
This tutorial aims to provide a comprehensive introduction to CSGHub from practical operations, quick deployment, basic concepts, and application perspectives so that you can efficiently master CSGHub and LLM capabilities. Even if you have no deployment experience, this tutorial will help you get started quickly. For more advanced content and features, we also provide additional documentation aimed at advanced users and developers, offering detailed explanations and guidance.
Contact Us
If you encounter any problems during use, you can contact us through any of the following methods:
- Open an Issue on GitHub
- Scan the WeChat QR code below to add our assistant, reply "Open Source" to join our WeChat group
- Join our Discord channel: OpenCSG Discord Channel
- Join our Slack channel: OpenCSG Slack Channel