Skip to main content

Data Processing Algorithm Templates

DataFlow provides a variety of built-in data processing templates, including Basic Data Processing, Advanced Data Processing, and Data Augmentation templates. The platform will continue to expand with new templates to enhance data processing capabilities.

Users can also create custom templates or modify existing ones to build personalized data processing pipelines tailored to specific needs.

Default Template

Creating a New Algorithm Template

  • Modify a Built-in Template: Click Create on a built-in template card to open the template creation page.

Modify Template

  • Create a Template from Scratch: Click Create Template button on the right to begin.

Create New Template

Fill in the Template Name, Task Type, and Template Description fields, and select the necessary operators and their execution order.

Note: Some operators require parameter configuration.

Once configured, click Creation Completed to start using the new template for data processing tasks.

Template Field Select Ops

Operators Supported by the Platform

IDNameTypeDescription
chinese_convert_mapperChinese ConverterMapperMapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.
clean_copyright_mapperCopyright CleanerMapperMapper to clean copyright comments at the beginning of the text samples.
clean_email_mapperEmail CleanerMapperMapper to clean email in text samples.
clean_html_mapperHTML Code CleanerMapperMapper to clean html code in text samples.
clean_ip_mapperIP CleanerMapperMapper to clean ipv4 and ipv6 address in text samples.
clean_links_mapperLink CleanerMapperMapper to clean links like http/https/ftp in text samples.
expand_macro_mapperExpand Macro DefinitionsMapperMapper to expand macro definitions in the document body of Latex samples.
generate_code_qa_pair_mapperConvert code to QA pairMapperMapper to generate new instruction data based on code.
extract_qa_mapperQA pair extractorMapperMapper to extract question and answer pair from text samples.
fix_unicode_mapperUnicode CorrectorMapperMapper to fix unicode errors in text samples.
nlpaug_en_mapperEnglish AugmentMapperMapper to simply augment samples in English based on nlpaug library.
nlpcda_zh_mapperChinese AugmentMapperMapper to simply augment samples in Chinese based on nlpcda library.
optimize_instruction_mapperInstruction OptimizerMapperMapper to optimize instruction.
punctuation_normalization_mapperUnicode Punctuations NormalizorMapperMapper to normalize unicode punctuations to English punctuations in text samples.
remove_bibliography_mapperBibliography CleanerMapperMapper to remove bibliography at the end of documents in Latex samples.
remove_comments_mapperComments CleanerMapperMapper to remove comments in different kinds of documents. Only support 'tex' for now.
remove_header_mapperRemove HeaderMapperMapper to remove headers at the beginning of documents in Latex samples.
remove_long_words_mapperLong Words CleanerMapperMapper to remove long words within a specific range.
remove_non_chinese_character_mapperNon Chinese CleanerMapperMapper to remove non chinese Character in text samples.
remove_repeat_sentences_mapperSentence De-duplicationMapperMapper to remove repeat sentences in text samples.
remove_specific_chars_mapperSpecific Chars CleanerMapperMapper to clean specific chars in text samples. now support: ◆●■►▼▲▴∆▻▷❖♡□
remove_table_text_mapperTable Texts CleanerMapperMapper to remove table texts from text samples. Regular expression is used to remove tables in the range of column number of tables.
remove_words_with_incorrect_substrings_mapperIncorrect Substring CleanerMapperMapper to remove words with incorrect substrings.
replace_content_mapperContent ReplacementMapperMapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.
sentence_split_mapperSentence SpliterMapperMapper to split text samples to sentences.
whitespace_normalization_mapperWhitespace NormalizorMapperMapper to normalize different kinds of whitespaces to whitespace ' ' (0x20) in text samples.
alphanumeric_filterAlphabet/Numeric Ratio FilterFilterFilter to keep samples with alphabet/numeric ratio within a specific range.
average_line_length_filterAverage Line Length FilterFilterFilter to keep samples with average line length within a specific range.
character_repetition_filterChar-Level Repetition Ratio FilterFilterFilter to keep samples with char-level n-gram repetition ratio within a specific range.
flagged_words_filterFlagged-Word Ratio FilterFilterFilter to keep samples with flagged-word ratio less than a specific max value.
language_id_score_filterSpecific Language FilterFilterFilter to keep samples in a specific language with confidence score larger than a specific min value.
maximum_line_length_filterMaximum Line Length FilterFilterFilter to keep samples with maximum line length within a specific range.
perplexity_filterPerplexity Score FilterFilterFilter to keep samples with perplexity score less than a specific max value.
special_characters_filterSpecial-Char Ratio FilterFilterFilter to keep samples with special-char ratio within a specific range.
specified_field_filterSpecified Field Information FilterFilterFilter based on specified field information. If the specified field information in the sample is not within the specified target value, the sample will be filtered.
specified_numeric_field_filterSpecified Numeric Field FilterFilterFilter based on specified numeric field information. If the specified numeric information in the sample is not within the specified range, the sample will be filtered.
stopwords_filterStopword Ratio FilterFilterFilter to keep samples with stopword ratio larger than a specific min value.
suffix_filterSpecified Suffix FilterFilterFilter to keep samples with specified suffix.
text_action_filterTexts Contain Actions FilterFilterFilter to keep texts those contain actions in the text..
text_entity_dependency_filterTexts Containing Entities FilterFilterIdentify the entities in the text which are independent with other token, and filter them. The text containing no entities will be omitted.
text_length_filterTotal Text Length FilterFilterFilter to keep samples with total text length within a specific range.
token_num_filterTotal Token Number FilterFilterFilter to keep samples with total token number within a specific range.
word_repetition_filterWord-Level Repetition Ratio FilterFilterFilter to keep samples with word-level n-gram repetition ratio within a specific range.
words_num_filterTotal Words Number FilterFilterFilter to keep samples with total words number within a specific range.
document_deduplicatorDocument Deduplicator(MD5 Hash)DeduplicatorDeduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
document_minhash_deduplicatorDocument Deduplicator(MinHashLSH)DeduplicatorDeduplicator to deduplicate samples at document-level using MinHashLSH.
Different from simhash, minhash is stored as bytes, so they won't be kept in the final dataset.
document_simhash_deduplicatorDocument Deduplicator(SimHash)DeduplicatorDeduplicator to deduplicate samples at document-level using SimHash.
frequency_specified_field_selectorSorted Frequency SelectorSelectorSelector to select samples based on the sorted frequency of specified field.
random_selectorRandom SelectorSelectorSelector to random select samples.
range_specified_field_selectorSorted Range SelectorSelectorSelector to select a range of samples based on the sorted specified field value from smallest to largest.
topk_specified_field_selectorTop Samples SelectorSelectorSelector to select top samples based on the sorted specified field value.