Skip to main content

Tools

Tools module provides a variety of data processing tools that users can choose freely based on their needs.

tools

Tools List

Here is a brief description of each tool.

Name	Description
analysis common	This Analyzer class is used to analyze a specific dataset. It will compute stats for all filter ops in the config file, apply multiple analysis (e.g. OverallAnalysis, ColumnWiseAnalysis, etc.) on these stats, and generate the analysis results (stats tables, distribution figures, etc.) to help users understand the input dataset better.
dataset spliter by language preprocess	Load dataset from the source directory, then apply language identification using the operation filter called LanguageIDScoreFilter, finally, split the dataset by language and save it.
prepare dataset from repo preprocess	prepare_dataset_from_repo_preprocess_internal_dec": "Prepare dataset from code repo with format like this: Repository Name, Filepath in the Repository, File Contents.
raw alpaca cot merge add meta preprocess	This tool is used for converting the raw Alpaca-Cot data downloaded from HuggingFace to jsonl files, merge instruction/input/output to text for process, and add meta info.
raw arxiv to jsonl preprocess	Convert the raw arXiv data (gzipped tar file) into the jsonl format.
raw stackexchange to jsonl preprocess	Convert the raw Stack Exchange data downloaded from Archive (ref: https://archive.org/download/stackexchange) to several jsonl files.
reformat csv nan value preprocess	Reformat csv or tsv files that may contain Nan values using HuggingFace to load with extra args, e.g. set keep_default_na to False.
reformat jsonl nan value preprocess	Reformat the jsonl files which may contain Nan values. Traverse jsonl files to find the first object that does not contain Nan as a reference feature type, then set it for loading all jsonl files.
serialize meta preprocess	Serialize all the fields in the jsonl file except the fields specified by users to ensure that the jsonl file with inconsistent text format for each line can also be load normally by the dataset.
count token postprocess	Count the number of tokens for given dataset and tokenizer. Only support 'jsonl' now.
data mixture postprocess	Mix multiple datasets into one dataset. Randomly select samples from every dataset and mix these samples, then export to a new mixed dataset. Supported suffixes include: ["jsonl", "json", "parquet"].
deserialize meta postprocess	Deserialize the specified field in the jsonl file.
quality classifier common	This Quality Classifier class is used to predict document scores on dataset.It will compute scores for all rows, and give 2 columns: score and should_keep for each row to help user decide which row should be removed. By default, mark row as should_keep=1 if score is higher than 0.9.
opencsg data extraction preprocess	A high-quality tool for convert PDF to Markdown and JSON
opencsg scrape url data preprocess	Data scrape tool based on large language model for websites and native documents (XML, HTML, JSON, etc.).

Tools List