Creating and Running Data Processing Task
This tutorial uses the sample dataset file data_sample.jsonl
to demonstrate DataFlow’s data processing capabilities. Supported dataset formats include: jsonl
, json
, parquet
, csv
, txt
, tsv
, and jsonl.zst
.
DataFlow Entry Points
DataFlow offers two entry points on the CSGHub platform for efficient data management:
- Entry 1:
In the
dataflow-dataset
repository and clickData Processing
button to create a task.Note: Make sure the dataset is under your personal repository; otherwise, the "Data Processing" button will not be available.
- Entry 2:
Access DataFlow from the
Data Pipelines
option under your avatar.
Creating a Data Processing Task
When entering through Entry 1, the Data Source
field is pre-filled. If using Entry 2, you will need to select it manually. Then, fill in other fields such as Task Name
, Data Source Branch
, Processing Text
, and Task Template
. Adjust operator parameters as needed, and click Creation Completed
button to finalize the task creation.
The fields in this tutorial will be configured as shown below:
Viewing Task Details
After completion, click Details to view the task's status and results.
- Processing Details: Displays operator information, running status, and processed data volume for each step.
- Session Processing Result: Compare session data before and after processing to analyze performance.
- Task Log: View complete logs to track execution steps. Logs can be downloaded for further analysis.