DataFlow Release Notes v202411
New Feature: Toolkit Pool
In the latest version of DataFlow, we have introduced a potent Toolkit Pool designed for addressing various aspects of data handling. Here are the details of the 12 newly added tools:
Analysis Tools
analysis_common_internal
This Analyzer class is utilized to analyze a specific dataset. It calculates stats for all filter operations in the config file, applies multiple analyses (e.g., OverallAnalysis, ColumnWiseAnalysis, etc.) on these stats, and generates analysis results (stats tables, distribution figures, etc.) to aid users in better understanding the input dataset.quality_classifier_common_internal
This Quality Classifier class is used to predict document scores on dataset. It will compute scores for all rows, and give 2 columns score and should_keep for each row to help user decide which row should be removed. By default mark row as should_keep=1 if score is higher than 0.9.
Preprocessing Tools
dataset_spliter_by_language_preprocess_internal
Loads a dataset from the source directory, applies language identification using theLanguageIDScoreFilter
operation filter, and finally splits the dataset by language and saves it.prepare_dataset_from_repo_preprocess_internal
Prepares a dataset from code repo with the following format: Repository Name, File Path in the Repository, File Contents.raw_alpaca_cot_merge_add_meta_preprocess_internal
Converts the raw Alpaca-Cot data into jsonl files, mergesinstruction
/input
/output
totext
for processing, and adds meta info.raw_arxiv_to_jsonl_preprocess_internal
Converts the raw arXiv data (gzipped tar file) into the jsonl format.raw_stackexchange_to_jsonl_preprocess_internal
Converts the raw Stack Exchange data downloaded from Archive (ref: https://archive.org/download/stackexchange) into several jsonl files.reformat_csv_nan_value_preprocess_internal
Reformats csv or tsv files that may contain Nan values with extra arguments, e.g., settingkeep_default_na
to False.reformat_jsonl_nan_value_preprocess_internal
Reformats jsonl files that may contain Nan values. Traverses jsonl files to find the first object that does not contain Nan as a reference feature type, then sets it for loading all jsonl files.serialize_meta_preprocess_internal
Serializes all the fields in the jsonl file except the fields specified by users to ensure that the jsonl file with inconsistent text format for each line can also be loaded normally by the dataset.
Postprocessing Tools
count_token_postprocess_internal
Counts the number of tokens for a given dataset and tokenizer. Currently, only supports 'jsonl' format.data_mixture_postprocess_internal
Mixes multiple datasets into one dataset. Randomly selects samples from every dataset and mixes these samples, then exports to a new mixed dataset. Supported suffixes include: ["jsonl", "json", "parquet"].deserialize_meta_postprocess_internal
Deserializes the specified field in the jsonl file.