Steps extend and expand the functionality of Pentaho Data Integration (PDI) transformations. You can use the following steps in PDI.
Steps: A - F
Name | Category | Description |
---|---|---|
Abort | Flow | Abort a transformation. |
Add a Checksum | Transform | Add a checksum column for each input row. |
Add constants | Transform | Add one or more constants to the input rows. |
Add sequence | Transform | Get the next value from a sequence. |
Add value fields changing sequence | Transform | Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence. |
Add XML | Transform | Encode several fields into an XML fragment. |
AMQP Consumer | Streaming | Pull streaming data from an AMQP broker or clients through an AMQP transformation. |
AMQP Producer | Streaming | Publish messages in near-real-time to an AMQP broker. |
Analytic query | Statistics | Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST). |
Annotate stream | Flow | Refine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream fields. |
Append streams | Flow | Append two streams in an ordered way. |
ARFF output | Data Mining | Write data in ARFF format to a file. |
Automatic Documentation Output | Output | Generate documentation automatically based on input in the form of a list of transformations and jobs. |
Avro Input | Big Data | Decode binary or JSON Avro data and extracts fields from the structure it defines, either from flat files or incoming fields. |
Avro input (deprecated) | Deprecated | Replaced by Avro Input. |
Avro Output | Big Data | Serialize data into Avro binary or JSON format from the PDI data stream, then writes it to file. |
Block this step until steps finish | Flow | Block this step until selected steps finish. |
Blocking step | Flow | Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row to this step. |
Calculator | Transform | Create new fields by performing simple calculations. |
Call DB Procedure | Lookup | Get back information by calling a database procedure. |
Call Endpoint | Pentaho Server | Call API endpoints from the Pentaho Server within a PDI transformation. |
Cassandra Input | Big Data | Read from a Cassandra column family. |
Cassandra Output | Big Data | Write to a Cassandra column family. |
Catalog Input | Catalog | Read CSV text file formats of a Pentaho Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or Amazon S3 ecosystem, and then output the data as table rows that can be used by a transformation. |
Catalog Output | Catalog | Encode CSV text file formats by using the schema that is defined in PDI to create or replace a data resource in Pentaho Data Catalogand add metadata to the data resource. |
Change file encoding | Utility | Change file encoding and create a new file. |
Check if a column exists | Lookup | Check if a column exists in a table on a specified connection. |
Check if file is locked | Lookup | Check if a file is locked by another process. |
Check if webservice is available | Lookup | Check if a webservice is available. |
Clone row | Utility | Clone a row as many times as needed. |
Closure Generator | Transform | Generate a closure table using parent-child relationships. |
Combination lookup/update | Data Warehouse | Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields. |
Concat Fields | Transform | Concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step. |
Copybook Input | Discovery | Reads binary data files that are mapped by a fixed-length COBOL copybook definition file. |
Copy rows to result | Job | Write rows to the executing job. The information will then be passed to the next entry in this job. |
CouchDB Input | Big Data | Retrieve all documents from a given view in a given design document from a given database. |
Credit card validator | Validation | Determines if a credit card number is valid (uses LUHN10 (MOD-10) algorithm), and which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...). |
CSV File Input | Input | Read from a simple CSV file input. |
Data Grid | Input | Enter rows of static data in a grid, usually for testing, reference or demo purpose. |
Data Validator | Validation | Validates passing data based on a set of rules. |
Database join | Lookup | Execute a database query using stream values as parameters. |
Database lookup | Lookup | Look up values in a database using field values. |
De-serialize from file | Input | Read rows of data from a data cube. |
Delay row | Utility | Output each input row after a delay. |
Delete | Output | Permanently removes a row from a database. |
Detect empty stream | Flow | Output one empty row if input stream is empty (I.e. when input stream does not contain any row). |
Discover metadata from a text file | Input | Determines the structure of delimited text files. |
Dimension lookup/update | Data Warehouse | Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension. |
Dummy (do nothing) | Flow | Does not do anything. It is useful, however, when testing things or in certain situations where you want to split streams. |
Dynamic SQL row | Lookup | Execute dynamic SQL statement build in a previous field. |
Edi to XML | Utility | Convert an Edifact message to XML to simplify data extraction. |
Elasticsearch REST Bulk Insert | Bulk loading | Perform bulk inserts into Elasticsearch. |
Email messages input | Input | Read POP3/IMAP server and retrieve messages. |
ESRI Shapefile Reader | Input | Read shape file data from an ESRI shape file and linked DBF file. |
ETL metadata injection | Flow | Inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions. |
Execute a process | Utility | Execute a process and return the result. |
Execute Row SQL Script | Scripting | Execute an SQL statement or file for every input row. |
Execute SQL Script | Scripting | Execute an SQL script, optionally parameterized using input rows. |
Extract to rows | Input | Parses hierarchical data type fields coming from a previous step. |
File exists (Step) | Lookup | Check if a file exists. |
Filter Rows | Flow | Filter rows using simple equations. |
Fixed file input | Input | Read from a fixed file input. |
Formula | Scripting | Calculate a formula using Pentaho's libformula. |
Fuzzy match | Lookup | Find the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream. |
Steps: G - L
Name | Category | Description |
---|---|---|
Generate random credit card numbers | Input | Generate random valid (luhn check) credit card numbers. |
Generate random value | Input | Generate random value. |
Generate Rows | Input | Generate a number of empty or equal rows. |
Get data from XML | Input | Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field. |
Get File Names | Input | Get file names from the operating system and send them to the next step. |
Get files from result | Job | Read filenames used or generated in a previous entry in a job. |
Get Files Rows Count | Input | Get files rows count. |
Get ID from slave server | Transform | Retrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file. |
Get records from stream | Streaming | Return records that were previously generated by another transformation in a job. |
Get repository names | Input | List detailed information about transformations and/or jobs in a repository. |
Get rows from result | Job | Read rows from a previous entry in a job. |
Get Session Variables | Pentaho Server | Retrieve the value of a session variable. |
Get SubFolder names | Input | Read a parent folder and return all subfolders. |
Get System Info | Input | Get information from the system like system date, arguments, etc. |
Get table names | Input | Get table names from database connection and send them to the next step. |
Get Variables | Job | Determine the values of certain (environment or Kettle) variables and put them in field values. |
Google Analytics (deprecated | Deprecated | Fetch data from google analytics account. Replacement step is Google Analytics v4. Note: This step will only work for Universal Analytics 360 customers until July 1, 2024.
|
Google Analytics v4 | Input | Fetch data from Google Analytics account. |
Greenplum Load | Bulk loading | Bulk load Greenplum data. |
Group By | Statistics | Build aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly. |
GZIP CSV Input | Input | Read in parallel from a GZIP CSV file. |
Hadoop File Input | Big Data | Read data from a variety of different text-file types stored on a Hadoop cluster. |
Hadoop File Output | Big Data | Write data to a variety of different text-file types stored on a Hadoop cluster. |
HBase Input | Big Data | Read from an HBase column family. |
HBase Output | Big Data | Write to an HBase column family. |
HBase row decoder | Big Data | Decodes an incoming key and HBase result object to a mapping. |
Hierarchical JSON input | Input | Load JSON data into PDI from a previous step or from a file. |
Hierarchical JSON Output | Output | Converts hierarchical data from a previous step into JSON format. |
HL7 Input | Input | Read data from HL7 data streams. |
HTTP client | Lookup | Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically. |
HTTP Post | Lookup | Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically. |
Identify last row in a stream | Flow | Mark the last row. |
If field value is null | Utility | Set a field value to a constant if it is null. |
Infobright Loader | Bulk loading | Load data to an Infobright database table. |
Ingres VectorWise Bulk Loader | Bulk loading | Interface with the Ingres VectorWise Bulk Loader "COPY TABLE" command. |
Injector | Inline | Inject rows into the transformation through the java API. |
Insert / Update | Output | Update or insert rows in a database based upon keys. |
Java Filter | Flow | Filter rows using java code. |
JMS Consumer | Streaming | Receive messages from a JMS server. |
JMS Producer | Streaming | Send messages to a JMS server. |
Job Executor | Flow | Run a PDI job, and passes parameters and rows. |
Join Rows (cartesian product) | Joins | Output the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams. |
JSON Input | Input | Extract relevant portions out of JSON structures (file or incoming field) and output rows. |
JSON output | Output | Create JSON block and output it in a field to a file. |
Kafka consumer | Streaming | Run a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time. |
Kafka Producer | Streaming | Publish messages in near-real-time across worker nodes where multiple, subscribed members have access. |
Kinesis Consumer | Streaming | Extract data from a specific stream located within the Amazon Kinesis Data Streams service. |
Kinesis Producer | Streaming | Push data to an existing region and stream located within the Amazon Kinesis Data Streams service. |
Knowledge Flow | Data Mining | Executes a Knowledge Flow data mining process. |
LDAP Input | Input | Read data from LDAP host. |
LDAP Output | Output | Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name). |
LDIF Input | Input | Read data from LDIF files. |
Load file content in memory | Input | Load file content in memory. |
Steps: M - R
Name | Category | Description |
---|---|---|
Utility | Send e-mail. | |
Mail Validator | Validation | Check if an email address is valid. |
Mapping | Mapping | Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface. |
Mapping Input Specification | Mapping | Specify the input interface of a mapping. |
Mapping Output Specification | Mapping | Specify the output interface of a mapping. |
MapReduce Input | Big Data | Enter Key Value pairs from Hadoop MapReduce. |
MapReduce Output | Big Data | Exit Key Value pairs, then push into Hadoop MapReduce. |
MaxMind GeoIP Lookup | Lookup | Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization. |
Memory Group By | Statistics | Build aggregates in a group by fashion. This step doesn't require sorted input. |
Merge Join | Joins | Join two streams on a given key and outputs a joined set. The input streams must be sorted on the join key. |
Merge Rows (diff) | Joins | Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged. |
Metadata structure of stream | Utility | Read the metadata of the incoming stream. |
Microsoft Access input | Input | Read data from a Microsoft Access file |
Microsoft Access Output | Output | Store records into an MS-Access database table. |
Microsoft Excel Input | Input | Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS). |
Microsoft Excel Output | Output | Store records into an Excel (XLS) document with formatting information. |
Microsoft Excel Writer | Output | Write or appends data to an Excel file. |
Modified Java Script Value | Scripting | Run JavaScript programs (and much more). |
Modify values from a single row | Input | Build complex hierarchical data. |
Modify values from grouped rows | Input | Modifies hierarchical data to form nested JSON key-value pairs. |
Mondrian Input | Input | Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian). |
MonetDB Agile Mart | Agile | |
MonetDB Bulk Loader | Bulk loading | Load data into MonetDB by using their bulk load command in streaming mode. |
MongoDB Execute | Big Data | Connects to a MongoDB cluster and executes Mongo shell-style commands. |
MongoDB Input | Big Data | Read all entries from a MongoDB collection in the specified database. |
MongoDB Output | Big Data | Write to a MongoDB collection. |
MQTT Consumer | Streaming | Pull streaming data from an MQTT broker or clients through an MQTT transformation. |
MQTT Producer | Streaming | Publish messages in near-real-time to an MQTT broker. |
Multiway Merge Join | Joins | Join multiple streams. This step supports INNER and FULL OUTER joins. |
MySQL Bulk Loader | Bulk loading | Load data over a named pipe (not available on MS Windows). |
Null if... | Utility | Set a field value to null if it is equal to a constant value. |
Number range | Transform | Create ranges based on numeric field. |
OLAP Input | Input | Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j. |
OpenERP object delete (deprecated) | Deprecated | Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function. |
OpenERP object input (deprecated) | Deprecated | Retrieve data from the OpenERP server using the XMLRPC interface with the 'read' function. |
OpenERP object output (deprecated) | Deprecated | Update data on the OpenERP server using the XMLRPC interface and the 'import' function |
Oracle Bulk Loader | Bulk loading | Use Oracle Bulk Loader to load data. |
ORC Input | Big Data | Read fields data from ORC files into a PDI data stream. |
ORC Output | Big Data | Serialize data from the PDI data stream into an ORC file format and writes it to a file. |
Output steps metrics | Statistics | Return metrics for one or several steps. |
Parquet Input | Big Data | Decode Parquet data formats and extracts fields from the structure it defines. |
Parquet Output | Big Data | Map fields within data files and choose where you want to process those files. |
Pentaho Reporting Output | Output | Execute an existing report file (.prpt). |
PostgreSQL Bulk Loader | Bulk loading | Bulk load PostgreSQL data. |
Prioritize streams | Flow | Prioritize streams in an order way. |
Process files | Utility | Process one file per row (copy or move or delete). This step only accept filename in input. |
Properties Output | Output | Write data to properties file. |
Property Input | Input | Read data (key, value) from properties files. |
Python Executor | Scripting | Map upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script. |
Query metadata from a database | Metadata Discovery | Retrieves metadata from a database connection. |
Query HCP | Input | Uses the Metadata Query Engine (MQE) to query your Hitachi Content Platform (HCP) repository for objects, their URLs, and system metadata properties. |
R script executor | Statistics | Execute an R script within a PDI transformation. |
Read metadata from HCP | Input | Identifies an HCP object by its URL path then specifies a target annotation name to read. |
Read metadata from Copybook | Metadata Discovery | Reads a binary fixed-length copybook definition file and outputs the file and column descriptor information as fields to PDI rows. |
Read metadata | Catalog | Search and retrieve metadata in the Pentaho Data Catalog that is associated with specific data resources that are registered in Data Catalog. |
Regex Evaluation | Scripting | Evaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups. |
Replace in String | Transform | Replace all occurrences a word in a string with another word. |
Reservoir Sampling | Statistics | Transform Samples a fixed number of rows from the incoming stream. |
REST client step | Lookup | Consume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs |
Row Denormaliser | Transform | Denormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields. |
Row Flattener | Transform | Flatten consecutive rows based on the order in which they appear in the input stream. |
Row Normaliser | Transform | Normalise de-normalised information. |
RSS Input | Input | Read RSS feeds. |
RSS Output | Output | Read RSS stream. |
Rule Executor | Scripting | Execute a rule against each row (using Drools). |
Rule Accumulator | Scripting | Execute a rule against a set of rows (using Drools). |
Run SSH commands | Utility | Run SSH commands and returns result. |
Steps: S - Z
Name | Category | Description |
---|---|---|
S3 CSV Input | Input | Read from an S3 CSV file. |
S3 File Output | Output | Export data to a text file on an Amazon Simple Storage Service (S3). |
Salesforce bulk operation | Bulk loading | Perform bulk operations on Salesforce objects |
Salesforce Delete | Output | Delete records in a Salesforce module. |
Salesforce Input | Input | Read information from Salesforce. |
Salesforce Insert | Output | Insert records in a Salesforce module. |
Salesforce Update | Output | Update records in a Salesforce module. |
Salesforce Upsert | Output | Insert or update records in a Salesforce module. |
Sample rows | Statistics | Filter rows based on the line number. |
SAP input (deprecated) | Deprecated | Read data from SAP ERP, optionally with parameters. |
SAS Input | Input | Reads file in sas7bdat (SAS) native format. |
Select Values | Transform | Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision. |
Send message to Syslog | Utility | Send message to Syslog server. |
Serialize to file | Output | Write rows of data to a data cube. |
Set Field Value | Transform | Replace value of a field with another value field. |
Set Field Value to a Constant | Transform | Replace value of a field to a constant. |
Set files in result | Job | Set filenames in the result of this transformation. Subsequent job entries can then use this information. |
Set Session Variables | Pentaho Server | Set the value of session variable. |
Set Variables | Job | Set environment variables based on a single input row. |
SFTP Put | Experimental | Upload a file or a stream file to a remote host via SFTP. |
Shared dimension | Flow | Refine your data for the Streamlined Data Refinery through the creation of dimensions which can be shared. |
Simple Mapping (sub-transformation) | Mapping | Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation). |
Single Threader | Flow | Execute a sequence of steps in a single thread. |
Socket reader | Inline | Read a socket. A socket client that connects to a server (Socket Writer step). |
Socket writer | Inline | Write a socket. A socket server that can send rows of data to a socket reader. |
Sort rows | Transform | Sort rows based upon field values (ascending or descending). |
Sorted Merge | Joins | Merge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields. |
Split field to rows | Transform | Split a single string field by delimiter and creates a new row for each split term. |
Split Fields | Transform | Split a single field into more then one. |
Splunk Input | Transform | Read data from Splunk. |
Splunk Output | Transform | Write data to Splunk. |
SQL File Output | Output | Output SQL INSERT statements to a file. |
Stream lookup | Lookup | Look up values coming from another stream in the transformation. |
SSTable Output | Big Data | Write to a filesystem directory as a Cassandra SSTable. |
String Operations | Transform | Apply certain operations like trimming, padding, and others to string value. |
Strings cut | Transform | Cut out a snippet of a string. |
Switch / Case | Flow | Switch a row to a certain target step based on the case value in a field. |
Synchronize after merge | Output | Perform insert/update/delete in one go based on the value of a field. |
Table Agile Mart | Agile | |
Table Compare | Utility | Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it. |
Table exists | Lookup | Check if a table exists on a specified connection. |
Table Input | Input | Read information from a database table. |
Table Output | Output | Write information to a database table. |
Teradata Fastload Bulk Loader | Bulk loading | Bulk load Teradata Fastload data. |
Teradata TPT Insert Upsert Bulk Loader | Bulk loading | Bulk load via TPT using the tbuild command. |
Text File Input | Input | Read data from a text file in several formats. This data can then be passed to your next step(s). |
Text File Output | Output | Write rows to a text file. |
Transformation Executor | Flow | Run a PDI transformation, sets parameters, and passes rows. |
Unique Rows | Transform | Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly. |
Unique Rows (HashSet) | Transform | Remove double rows and leave only unique occurrences by using a HashSet. |
Univariate Statistics | Statistics | Compute some simple stats based on a single input field. |
Update | Output | Update data in a database table based upon keys. |
User Defined Java Class | Scripting | Program a step using Java code. |
User Defined Java Expression | Scripting | Calculate the result of a Java Expression using Janino. |
Value Mapper | Transform | Map values of a certain field from one value to another. |
Vertica Bulk Loader | Bulk loading | Bulk load data into a Vertica table using their high performance COPY feature. |
Web services lookup | Lookup | Look up information using web services (WSDL). |
Write metadata to HCP objects | Output | Write custom metadata fields to a Hitachi Content Platform object. |
Write metadata | Catalog | Add new metadata to metadata in the Pentaho Data Catalog that is associated with specific data resources. |
Write to log | Utility | Write data to log. |
XBase input | Input | Read records from an XBase type of database file (DBF). |
XML Input Stream (StAX) | Input | Process very large and complex XML files very fast. |
XML Join | Joins | Join a stream of XML-Tags into a target XML string. |
XML Output | Output | Write data to an XML file. |
XSD Validator | Validation | Validate XML source (files or streams) against XML Schema Definition. |
XSL Transformation | Transform | Transform XML stream using XSL (eXtensible Stylesheet Language). |
Yaml Input | Input | Read YAML source (file or stream) parse them and convert them to rows and writes these to one or more output. |
Zip File | Utility | Create a standard ZIP archive from the data stream fields. |