Dataset

Datasets map to underlying storage media in [R]DP.

Overview

Datasets are primarily used in the Transformation Engine for two reasons:

  1. To store intermediate or final results of Transformations for later ingest or visualization.

  2. To allow cross-Transformer coordination at high data volumes while providing fault tolerance.

Many Transformers work entirely with Datasets; that is, all of their inputs and outputs are Dataset connections. The exception to this rule is [R]DP Workflows, which represent a fundamentally different Transformer paradigm.

There are several types of Datasets available:

  • INTERNAL_KAFKA Datasets map to streaming topics in [R]DP and allow configuration of the underlying topic settings.

  • INTERNAL_MINIO Datasets map to buckets and/or key prefixes in [R]DP’s blob storage.

  • INTERNAL_ICEBERG Datasets map to Iceberg tables within [R]DP.