General

Pentaho Data Integration

Version
9.3.x
Audience
anonymous
Part Number
MK-95PDIA003-15

Set up information for the Spark Submit job entry is detailed below. For additional information about configuring the Spark-submit utility tools, see the Cloudera documentation.

The following table describes the fields for setting up your Spark job:

Field Description
Entry Name Specify the name of the entry. You can customize it or leave it as the default.
Spark Submit Utility Specify the name of the script that launches the Spark job, which is the batch/shell file name of the underlying spark-submit tool. For example, Spark2-submit.
Master URL Choose a master URL for the cluster from the drop-down:
  • yarn-cluster: Runs the driver program as a thread of the YARN application master (one of the node managers in the cluster). This option is similar to the way MapReduce works.
  • yarn-client: Runs the driver program on the YARN client. Tasks are still executing in the node managers of the YARN cluster.
Type

Select the file type of the Spark job you want to submit. Your job can be written in Java, Scala, or Python. The fields displayed in the Files tab will depend on what language option you select.

Python support on Windows requires Spark version 2.3.x or higher.

Enable Blocking Select Enable Blocking to have the Spark Submit entry wait until the Spark job finishes running. If this option is not selected, the Spark Submit entry proceeds with its execution once the Spark job is submitted for execution. Blocking is enabled by default.

We support the yarn-cluster and yarn-client modes. For descriptions of the modes, see the Spark documentation.

Note: If you have configured your Hadoop cluster and Spark for Kerberos, a valid Kerberos ticket must already be in the ticket cache area on your client machine before you launch and submit the Spark Submit job.