Understand repartitioning logic

Pentaho Data Integration

Version
9.3.x
Audience
anonymous
Part Number
MK-95PDIA003-15

Data distribution in the steps is shown in the following table.


Data distributions by steps

As you can see, the CSV file input step divides the work between two step copies and each copy reads 50 rows of data. However, these 2 step copies also need to make sure that the rows end up on the correct count by state step copy where they arrive in a 43/57 split. Because of that, it is a general rule that the step performing the repartitioning (row redistribution) of the data (a non-partitioned step before a partitioned one) has internal buffers from every source step copy to every target step copy, as shown below.


Work division between step copies with partitioning

This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily. In the example below, a partition schema called State was applied to the count by state step and the Remainder of division partitioning rule was applied to the State field. Now, the count by state aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule, as shown in the preview data.


Partitioning data using rule-based aggregation

Note: To view this transformation in the PDI client, open the Pentaho/…/design-tools/data-integration/samples/transformations/General - parallel reading and aggregation.ktr sample file.