Creating and Running Data Processing Task

This tutorial uses the sample dataset file data_sample.jsonl to demonstrate DataFlow’s data processing capabilities. Supported dataset formats include: jsonl, json, parquet, csv, txt, tsv, and jsonl.zst.

DataFlow Entry Points

DataFlow offers two entry points on the CSGHub platform for efficient data management:

Entry 1: In the dataflow-dataset repository and click Data Processing button to create a task.

Note: Make sure the dataset is under your personal repository; otherwise, the "Data Processing" button will not be available.

Dataflow Entry 1

Entry 2: Access DataFlow from the Data Pipelines option under your avatar.

Dataflow Entry 2

Creating a Data Processing Task

When entering through Entry 1, the Data Source field is pre-filled. If using Entry 2, you will need to select it manually.

There are two types of data processing tasks: operator-based and tool-based.

Operator-based tasks Fill in the task name, data source and branch, then select an algorithm template. You can also adjust operator parameter settings as needed.
Tool-based tasks Fill in the task name, data source and branch, then select a tool. Different tools require different parameter configurations; you can adjust them as needed.