To add a cluster connection manually, you
need access to the location of the required site.xml files, which
are typically provided by your cluster administrator. If you are using high availability
(HA) clusters, you must manually add the connection information using this method.
This task assumes you are in the PDI client.
Perform the following steps to manually add a named connection in the Hadoop Clusters dialog box.
- In the PDI client, create a new job or transformation or open an existing one.
-
Click the
View tab and then right-click the Hadoop
Clusters folder.
-
From the menu that displays,
click New cluster.
The Hadoop Clusters dialog box appears.
-
Enter the connection information from your cluster administrator in the
Hadoop Clusters dialog box.
Note: As a best practice, use Kettle variables for each connection parameter value to reduce risks associated with running jobs and transformations in environments that are disconnected from the repository.
Option Description Cluster Name Enter the name you want to assign to the cluster connection. Note: Valid cluster names may include uppercase and lowercase letters, numbers, and hyphens. However, the cluster name cannot end with a hyphen. To ensure a valid cluster name, do not use any other symbols, punctuation characters, or blank spaces.After you create the connection, you can locate this named connection in the View tab on the PDI client.Driver and Version Select the distribution of Hadoop on your cluster and its version number. Pentaho ships with supported versions of Amazon EMR, Cloudera, Google Dataproc, and Hortonworks that you can install. Where are your site XML files? (Optional) Enter the location of the site.xml files provided by your cluster administrator. Click Browse to select file(s) and browse to the directory containing your site.xml files. Pentaho creates the applicable directory on the machine where the PDI client is located and copies the site.xml files to that directory. If you leave this option blank, Pentaho creates the directory for the distribution and version of Hadoop you selected in the Driver and Version options. You must then copy the site.xml files to that directory. Hostname (HDFS) Enter the hostname for the HDFS node in your Hadoop cluster. Port (HDFS) Enter the port for the HDFS node in your Hadoop cluster.
Note: If your cluster is enabled for high availability (HA), then you do not need a port number. Clear the port number.Username (HDFS) and Password (HDFS) Enter the user name and password for the HDFS node, which are provided by your cluster administrator. Hostname (JobTracker )and Port ( JobTracker) Enter the hostname and port for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, enter the hostname here. Hostname (ZooKeeper) and Port (Zookeeper) Enter the hostname and port for the Zookeeper node in your Hadoop cluster. Supply these options only if you want to connect to a Zookeeper service. URL (Oozie) Enter the Oozie client address. Supply this address only if you want to connect to the Oozie service. Bootstrap servers (Kafka) Enter the host/port pair(s) for the initial connection to the Kafka cluster. Use a comma-separated list for multiple servers, for example, ‘host1:port1,host2:port2’. Although you do not need to include all the servers used for Kafka, you might want to include more than one in the event that a server is down. -
Click Next and specify the security option for your
cluster.
- If your Hadoop cluster is non-secure, select None and click Next to test your connection.
- If your Hadoop cluster is secure, you need to add security to your cluster connection. See Add security to cluster connections for instructions.