The HBase Row Decoder step is designed specifically for use in MapReduce transformations to decode the key and value data that is output by the TableInputFormat. The key output is the row key from HBase. The value is an HBase result object containing all the column values for the row.
First, create a Pentaho MapReduce job entry that includes a transformation which uses a MapReduce Input step and an HBase row decoder step, as shown below:
In the transformation, open the MapReduce
Input step. Configure the Key field and Value
field to produce a serialized result by selecting
Serializable in the Type field:
Next, open the HBase row decoder step and set the Key field to use the key and the HBase result field to use the value produced by the MapReduce Input step.
Then, define or load a mapping in the Create/Edit mappings tab. Note that once defined (or loaded), this mapping is captured in the transformation metadata.
Next, configure Pentaho MapReduce job entry to ensure that input splits are created using the TableInputFormat. Define the Input Path and Input format fields in the Job Setup tab, as shown below.
Finally, in the User Defined tab, assign a Name and Value for each property shown in the table below to configure the scan performed by the TableInputFormat:
Name | Value |
---|---|
hbase.mapred.inputtable | The name of the HBase table to read from. (Required) |
hbase.mapred.tablecolumns | The space delimited list of columns in ColFam:ColName format. Note that if you want to read all the columns from a family, omit the ColName. (Required) |
hbase.mapreduce.scan.cachedrows | (Optional) The number of rows for caching that will be passed to scanners. |
hbase.mapreduce.scan.timestamp | (Optional) Time stamp used to filter columns with a specific time stamp. |
hbase.mapreduce.scan.timerange.start | (Optional) Starting time stamp to filter columns within a given starting range. |
hbase.mapreduce.scan.timerange.end | (Optional) End time stamp to filter columns within a given ending range. |