Cluster tab

Pentaho Data Integration

Version
9.3.x
Audience
anonymous
Part Number
MK-95PDIA003-15


Cluster tab, Pentaho MapReduce

The following table describes the options for setting up configurations for the Hadoop cluster connection:

Option Definition
Hadoop job name Enter the name of the Hadoop job you are running. It is required for the Pentaho MapReduce entry to work.
Hadoop Cluster Specify the configuration of your Hadoop cluster through the following options:

See the Install Pentaho Data Integration and Analytics document for general information on Hadoop cluster configurations.

Number of Mapper Tasks Enter the number of mapper tasks you want to assign to this job. The size of the inputs should determine the number of mapper tasks. Typically, there should be between 10-100 maps per node, though you can specify a higher number for mapper tasks that are not CPU-intensive.
Number of Reducer Tasks Enter the number of reducer tasks you want to assign to this job. Lower numbers mean that the reduce operations can launch immediately and start transferring map outputs as the maps finish. The higher the number, the quicker the nodes will finish their first round of reduces and launch a second round. Increasing the number of reduce operations increases the Hadoop framework overhead, but improves load balancing.
Note: If this is set to 0, then no reduce operation is performed, and the output of the mapper becomes the output of the entire job. Combiner operations will also not be performed.
Logging Interval Enter the number of seconds between log messages.
Enable Blocking Select to forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status.
Note: If this option is not selected, the Hadoop job blindly executes, and PDI will move on to the next job entry. Error handling and routing will not work unless this option is selected.