Dataset
Datasets map to underlying storage media in [R]DP.
Overview
Datasets are primarily used in the Transformation Engine for two reasons:
-
To store intermediate or final results of Transformations for later ingest or visualization.
-
To allow cross-Transformer coordination at high data volumes while providing fault tolerance.
Many Transformers work entirely with Datasets; that is, all of their inputs and outputs are Dataset connections. The exception to this rule is [R]DP Workflows, which represent a fundamentally different Transformer paradigm.
There are several types of Datasets available:
-
INTERNAL_KAFKADatasets map to streaming topics in [R]DP and allow configuration of the underlying topic settings. -
INTERNAL_MINIODatasets map to buckets and/or key prefixes in [R]DP’s blob storage. -
INTERNAL_ICEBERGDatasets map to Iceberg tables within [R]DP.