Data Processing Algorithm Templates

DataFlow provides a variety of built-in data processing templates, including Basic Data Processing, Advanced Data Processing, and Data Augmentation templates. The platform will continue to expand with new templates to enhance data processing capabilities.

Users can also create custom templates or modify existing ones to build personalized data processing pipelines tailored to specific needs.

Default Template

Creating a New Algorithm Template

Modify a Built-in Template: Click Create on a built-in template card to open the template creation page.

Modify Template

Create a Template from Scratch: Click Create Template button on the right to begin.

Create New Template

Fill in the Template Name, Task Type, and Template Description fields, and select the necessary operators and their execution order.

Note: Some operators require parameter configuration.

Once configured, click Creation Completed to start using the new template for data processing tasks.

Template Field Select Ops

Operators Supported by the Platform

ID	Name	Type	Description
chinese_convert_mapper	Chinese Converter	Mapper	Mapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.
clean_copyright_mapper	Copyright Cleaner	Mapper	Mapper to clean copyright comments at the beginning of the text samples.
clean_email_mapper	Email Cleaner	Mapper	Mapper to clean email in text samples.
clean_html_mapper	HTML Code Cleaner	Mapper	Mapper to clean html code in text samples.
clean_ip_mapper	IP Cleaner	Mapper	Mapper to clean ipv4 and ipv6 address in text samples.
clean_links_mapper	Link Cleaner	Mapper	Mapper to clean links like http/https/ftp in text samples.
expand_macro_mapper	Expand Macro Definitions	Mapper	Mapper to expand macro definitions in the document body of Latex samples.
generate_code_qa_pair_mapper	Convert code to QA pair	Mapper	Mapper to generate new instruction data based on code.
extract_qa_mapper	QA pair extractor	Mapper	Mapper to extract question and answer pair from text samples.
fix_unicode_mapper	Unicode Corrector	Mapper	Mapper to fix unicode errors in text samples.
nlpaug_en_mapper	English Augment	Mapper	Mapper to simply augment samples in English based on nlpaug library.
nlpcda_zh_mapper	Chinese Augment	Mapper	Mapper to simply augment samples in Chinese based on nlpcda library.
optimize_instruction_mapper	Instruction Optimizer	Mapper	Mapper to optimize instruction.
punctuation_normalization_mapper	Unicode Punctuations Normalizor	Mapper	Mapper to normalize unicode punctuations to English punctuations in text samples.
remove_bibliography_mapper	Bibliography Cleaner	Mapper	Mapper to remove bibliography at the end of documents in Latex samples.
remove_comments_mapper	Comments Cleaner	Mapper	Mapper to remove comments in different kinds of documents. Only support 'tex' for now.
remove_header_mapper	Remove Header	Mapper	Mapper to remove headers at the beginning of documents in Latex samples.
remove_long_words_mapper	Long Words Cleaner	Mapper	Mapper to remove long words within a specific range.
remove_non_chinese_character_mapper	Non Chinese Cleaner	Mapper	Mapper to remove non chinese Character in text samples.
remove_repeat_sentences_mapper	Sentence De-duplication	Mapper	Mapper to remove repeat sentences in text samples.
remove_specific_chars_mapper	Specific Chars Cleaner	Mapper	Mapper to clean specific chars in text samples. now support: ◆●■►▼▲▴∆▻▷❖♡□
remove_table_text_mapper	Table Texts Cleaner	Mapper	Mapper to remove table texts from text samples. Regular expression is used to remove tables in the range of column number of tables.
remove_words_with_incorrect_substrings_mapper	Incorrect Substring Cleaner	Mapper	Mapper to remove words with incorrect substrings.
replace_content_mapper	Content Replacement	Mapper	Mapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.
sentence_split_mapper	Sentence Spliter	Mapper	Mapper to split text samples to sentences.
whitespace_normalization_mapper	Whitespace Normalizor	Mapper	Mapper to normalize different kinds of whitespaces to whitespace ' ' (0x20) in text samples.
alphanumeric_filter	Alphabet/Numeric Ratio Filter	Filter	Filter to keep samples with alphabet/numeric ratio within a specific range.
average_line_length_filter	Average Line Length Filter	Filter	Filter to keep samples with average line length within a specific range.
character_repetition_filter	Char-Level Repetition Ratio Filter	Filter	Filter to keep samples with char-level n-gram repetition ratio within a specific range.
flagged_words_filter	Flagged-Word Ratio Filter	Filter	Filter to keep samples with flagged-word ratio less than a specific max value.
language_id_score_filter	Specific Language Filter	Filter	Filter to keep samples in a specific language with confidence score larger than a specific min value.
maximum_line_length_filter	Maximum Line Length Filter	Filter	Filter to keep samples with maximum line length within a specific range.
perplexity_filter	Perplexity Score Filter	Filter	Filter to keep samples with perplexity score less than a specific max value.
special_characters_filter	Special-Char Ratio Filter	Filter	Filter to keep samples with special-char ratio within a specific range.
specified_field_filter	Specified Field Information Filter	Filter	Filter based on specified field information. If the specified field information in the sample is not within the specified target value, the sample will be filtered.
specified_numeric_field_filter	Specified Numeric Field Filter	Filter	Filter based on specified numeric field information. If the specified numeric information in the sample is not within the specified range, the sample will be filtered.
stopwords_filter	Stopword Ratio Filter	Filter	Filter to keep samples with stopword ratio larger than a specific min value.
suffix_filter	Specified Suffix Filter	Filter	Filter to keep samples with specified suffix.
text_action_filter	Texts Contain Actions Filter	Filter	Filter to keep texts those contain actions in the text..
text_entity_dependency_filter	Texts Containing Entities Filter	Filter	Identify the entities in the text which are independent with other token, and filter them. The text containing no entities will be omitted.
text_length_filter	Total Text Length Filter	Filter	Filter to keep samples with total text length within a specific range.
token_num_filter	Total Token Number Filter	Filter	Filter to keep samples with total token number within a specific range.
word_repetition_filter	Word-Level Repetition Ratio Filter	Filter	Filter to keep samples with word-level n-gram repetition ratio within a specific range.
words_num_filter	Total Words Number Filter	Filter	Filter to keep samples with total words number within a specific range.
document_deduplicator	Document Deduplicator(MD5 Hash)	Deduplicator	Deduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
document_minhash_deduplicator	Document Deduplicator(MinHashLSH)	Deduplicator	Deduplicator to deduplicate samples at document-level using MinHashLSH.
Different from simhash, minhash is stored as bytes, so they won't be kept in the final dataset.
document_simhash_deduplicator	Document Deduplicator(SimHash)	Deduplicator	Deduplicator to deduplicate samples at document-level using SimHash.
frequency_specified_field_selector	Sorted Frequency Selector	Selector	Selector to select samples based on the sorted frequency of specified field.
random_selector	Random Selector	Selector	Selector to random select samples.
range_specified_field_selector	Sorted Range Selector	Selector	Selector to select a range of samples based on the sorted specified field value from smallest to largest.
topk_specified_field_selector	Top Samples Selector	Selector	Selector to select top samples based on the sorted specified field value.

Creating a New Algorithm Template​

Operators Supported by the Platform​

Creating a New Algorithm Template

Operators Supported by the Platform