Skip to main content

Creating and Running Data Processing Task

This tutorial uses the sample dataset file data_sample.jsonl to demonstrate DataFlow’s data processing capabilities. Supported dataset formats include: jsonl, json, parquet, csv, txt, tsv, and jsonl.zst.

DataFlow Entry Points

DataFlow offers two entry points on the CSGHub platform for efficient data management:

  • Entry 1: In the dataflow-dataset repository and click Data Processing button to create a task.

    Note: Make sure the dataset is under your personal repository; otherwise, the "Data Processing" button will not be available.

Dataflow Entry 1

  • Entry 2: Access DataFlow from the Data Pipelines option under your avatar.

Dataflow Entry 2

Creating a Data Processing Task

When entering through Entry 1, the Data Source field is pre-filled. If using Entry 2, you will need to select it manually. Then, fill in other fields such as Task Name, Data Source Branch, Processing Text, and Task Template. Adjust operator parameters as needed, and click Creation Completed button to finalize the task creation.

The fields in this tutorial will be configured as shown below:

Create Task Ops Setting

Viewing Task Details

After completion, click Details to view the task's status and results.

Task List

  • Processing Details: Displays operator information, running status, and processed data volume for each step.

Task Details

  • Session Processing Result: Compare session data before and after processing to analyze performance.

Session Result

  • Task Log: View complete logs to track execution steps. Logs can be downloaded for further analysis.

Task Log