Input tab

Pentaho Data Integration

Version
9.3.x
Audience
anonymous
Part Number
MK-95PDIA003-15


Input tab

Use the options in this tab to define your input source for the Snowflake COPY INTO command:

Option Description
Source Choose from one of the following input source types:
S3
The input source is an S3 bucket.
Snowflake Staging Area
The input source is files on a Snowflake staging area.

Click Select to specify the file, folder, prefix, or variable of the S3 bucket or staging location to use as the input for the Snowflake COPY INTO command. See "Syntax" in the Snowflake documentation for more details specifying this option.

What file type is your source? Select the file type of the input source. You can select one of the following types:
Delimited text
The input source is character-delimited UTF-8 text.
Avro
The input source is an Avro data serialization protocol.
JSON
The input source is a JavaScript Object Notation (JSON) data file containing a set of either objects or arrays.
ORC
The input source is an Optimized Row Columnar (ORC) file containing Hive data. See the Administer Pentaho Data Integration and Analytics document for further configuration information when using Hive with Spark on AEL.
Parquet
The input source is a Parquet file of nested data structures in a flat columnar format.
XML
The input source is a file in XML format.
Compression Select the type of compression applied to your input source:
  • None
  • Auto
  • BZIP2
  • GZIP
  • Deflate
  • Raw deflate
  • Brotli
  • Zstd
For Parquet files, the Compression options are:
  • None
  • Auto
  • Snappy

Depending on what file type you selected for the What file type is your source option, the following file settings appear at the bottom of this tab:

File Type File Settings
Delimited text Specify the following settings for a delimited text file:
Leading rows to skip
Specify the number of rows to use as an offset from the beginning of the file. This option is useful to skip header lines.
Delimiter
Specify the character used to separate a data field. Default value is semicolon (;).
Quote character
Specify the character used to enclose a data field. Default value is double-quotation mark (″).
Remove quotes
Select one of the following values to indicate whether quotation characters should be removed from a data field during the bulk load:
  • Yes: Remove the quotation characters.
  • No: Retain the quotation characters.
Empty as null
Select one of the following values to indicate whether empty data values should be set to null during the bulk load:
  • Yes: sets empty data values to null.
  • No: leaves data values as empty.
Trim whitespace
Select one of the following values to remove trailing and leading whitespace from the data during the bulk load:
  • Yes: Remove the whitespace.
  • No: Retain the whitespace.
Note: For delimited text files, you must have a table in your database with all the columns you need defined.
Avro No additional settings.
JSON
Ignore UTF8 errors
Select one of the following values to ignore UTF8 errors in the data during the bulk load:
  • Yes: Ignore UTF8 errors.
  • No: Do not ignore UTF8 errors.
Allow duplicate elements
Select one of the following values to allow duplicate elements in the data during the bulk load:
  • Yes: Allow duplicate elements.
  • No: Do not allow duplicate elements.
Note: Snowflake only uses the last duplicate value and discards the others.
Strip null values
NULL values are stored as null in JSON files. Select one of the following values to indicate whether to delete NULL values from the data during the bulk load:
  • Yes: Strip the NULL values.
  • No: Store the NULL values in a variant column.
Parse octal numbers
Select one of the following values to indicate whether to parse octal numbers during the bulk load:
  • Yes: Parse octal numbers.
  • No: Do not parse octal numbers.
ORC

Additional file settings for ORC files.

Parquet

Additional file settings for Parquet files.

XML
Ignore UTF8 errors
Select one of the following values to indicate whether to replace UTF-8 encoding errors during the bulk load:
  • Yes: Replace invalid UTF-8 sequences with Unicode character U+FFFD.
  • No: Invalid UTF-8 sequences produce an encoding error (default).
Preserve space
Select one of the following values to indicate whether to preserve leading and trailing spaces in element content during the bulk load:
  • Yes: Preserve spaces.
  • No: Do not preserve spaces (default).
Strip outer element
Select one of the following values to indicate whether to remove the outer XML element, and expose the second level elements as separate documents during the bulk load:
  • Yes: Remove the outer XML element.
  • No: Do not remove the outer XML element (default).
Enable Snowflake data
Select one of the following values to indicate whether to enable recognition of Snowflake semi-structured data tags from the data during the bulk load:
  • Yes: Enable recognition of Snowflake tags (default).
  • No: Disable recognition of Snowflake tags.
Auto convert
Select one of the following values to indicate whether to convert numeric and Boolean values from text to native representation during the bulk load:
  • Yes: Convert numeric and Boolean values (default).
  • No: Do not convert numeric and Boolean values.
Note: If you have unstructured data, you must have a variant column in your database table to store the data for the following file types:
  • JSON
  • ORC
  • Parquet
  • XML