The following table describes the options for setting up configurations for the Hadoop cluster connection:
Option | Definition |
---|---|
Hadoop job name | Enter the name of the Hadoop job you are running. It is required for the Pentaho MapReduce entry to work. |
Hadoop Cluster | Specify the configuration of your Hadoop cluster through the following options:
See the Install Pentaho Data Integration and Analytics document for general information on Hadoop cluster configurations. |
Number of Mapper Tasks | Enter the number of mapper tasks you want to assign to this job. The size of the inputs should determine the number of mapper tasks. Typically, there should be between 10-100 maps per node, though you can specify a higher number for mapper tasks that are not CPU-intensive. |
Number of Reducer Tasks | Enter the number of reducer tasks you want to assign to this job. Lower numbers
mean that the reduce operations can launch immediately and start transferring map
outputs as the maps finish. The higher the number, the quicker the nodes will finish
their first round of reduces and launch a second round. Increasing the number of
reduce operations increases the Hadoop framework overhead, but improves load balancing. Note: If this is set to
0, then no reduce operation is performed, and the output
of the mapper becomes the output of the entire job. Combiner operations will also
not be performed.
|
Logging Interval | Enter the number of seconds between log messages. |
Enable Blocking | Select to forces the job to wait until each step completes before continuing to
the next step. This is the only way for PDI to be aware of a Hadoop job's status. Note: If this option is not selected, the Hadoop job blindly executes, and PDI will move on to the next
job entry. Error handling and routing will not work unless this option is
selected.
|