Complete the following steps to install and
configure the Spark client:
- On the client, download the Spark distribution of the same or higher version as the one used on the cluster.
-
Set the HADOOP_CONF_DIR environment variable to a folder containing
cluster configuration files as shown in the following sample for an
already-configured driver:
- <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>
- Navigate to <SPARK_HOME>/conf and create the spark-defaults.conf file using the instructions outlined in https://spark.apache.org/docs/latest/configuration.html.
- Create a ZIP archive containing all the JAR files in the SPARK_HOME/jars directory.
- Copy the ZIP file from the local file system to a world-readable location on the cluster.
-
Edit the spark-defaults.conf file to set the
spark.yarn.archive property to the world-readable
location of your ZIP file on the cluster as shown in the following
examples:
- spark.yarn.archive hdfs://NameNode hostname:8020/user/spark/lib/your ZIP file
-
Add the following line of code to the spark-defaults.conf
file:
- spark.hadoop.yarn.timeline-service.enabled false
-
If you are connecting to an HDP cluster, add the following lines in the
spark-defaults.conf file:
- spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557
- spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557
Note: The -Dhdp version should be the same as Hadoop version used on the cluster. -
If you are connecting to an HDP cluster, also create a text file named
java-opts in the
<SPARK_HOME>/conf folder and add your HDP version to
it as shown in the following example:
- -Dhdp.version=2.3.0.0-2557
Note: Run the hdp-select status Hadoop client command to determine your version of HDP. -
If you are connecting to a supported version of the HDP or CDH cluster, open
the core-site.xml file, then comment out the
net.topology.script.file property as shown in the
following code block:
<!-- <property> <name>net.topology.script.file.name</name> <value>/etc/hadoop/conf/topology_script.py</value> </property> -->
The Spark client is now ready for use with Spark
Submit in PDI.