Processing unstructured data

Use Pentaho Data Catalog

Version
10.1.x
Audience
anonymous
Part Number
MK-95PDC000-02

Perform the following steps to process the unstructured data and delimited files:

  1. Select the unstructured resource you want to investigate in Data Canvas.
    This can be a file or a folder.
  2. Click Process.
    The Choose Process pane opens with Metadata Ingest, Data Discovery, and Data Identification options.
    Unstructured data processing options
  3. In the Metadata Ingest card, click Start to begin the metadata ingestion.
    You can view the status of the Metadata Ingestprocess on the Manage Workers page.
  4. To perform the data discovery, click the Data Discovery card.
    The Data Discovery page opens with the following options to configure data and scanning the content:
    Note: When configuring data discovery, it is recommended to use the default settings as they are suitable for most situations.
    Section Field Description
    Checksum Calculation Compute Checksum of document content Calculates checksums for each file.
    Document Metadata Extract document properties Collects additional document properties from the file, such as the owner, page count, number of paragraphs, and so on. It applies only to Office365 or PDF files.
    Content Scan for String Detection Detect if the string exists Based on the applied dictionary, if the dictionary value exists in the file, it applies the actions defined in the dictionary and returns true in the metadata store (mds).
    Detect the string count Based on the applied dictionary, if the dictionary value exists in the file, it returns the aggregate count of the dictionary values within the file in the metadata store and applies the actions defined in the dictionary.
    String Detection Add Dictionary Select and add available dictionaries to use in string detection and to apply actions specified in the dictionary.
    Note: During the string detection process, it ignores the rules defined in the dictionaries.
    Data Profiling Treat First Row as Header (only for delimited files) When you set the flag during profiling, the Data Discovery step considers the first row of the data as a header and assigns its values to the column names in the profiled data.

    If you don't set the flag, theData Discovery step assigns default names like column-0, column-1, column-2, and so on to the profiled data.

    Advanced Options Files Modified More Than Day(s) Ago Filters file processing by modification timestamp.
    Files Accessed More Than Day(s) Ago Filters file processing by access timestamp.
    Include File Extensions Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling will be performed for the specified extension.

    Leave empty to use all supported extensions.

    Restrict Processing to Max File Size of Fileslarger in size than this amount will be skipped. For example, 100 MB.
    File Processing Threads Number of processing threads for file processing per job (should keep this low if running many jobs).
    Persistence Threads Number of persistence writing per job (should keep this low if running many jobs).
    Include Patterns* Specifies global patterns to apply during profiling.
    Exclude Patterns* Specifies global patterns to exclude during profiling.
    Note: If files or folders match both include and exclude patterns, then profiling excludes the patterns.

    * For more information about patterns and limitations, see Java documentation.

  5. Click Start.
    You can view the status of the Data Discovery process on the Manage Workers page.
  6. (Optional) To perform data identification on delimited files, click the Data Identification card.
    Important: You must perform Data Discovery process before proceeding with the Data Identification process. If the Data Discovery process was not completed previously, Data Catalog highlights it as Required. You can start Data Discovery process from the Data Identification card by clicking Start.


  7. Click Select Methods, select the Dictionaries and Patterns, click Apply, and then click Start.
    You can view the status of the Data Identification process on the Manage Workers page.
  8. Go to Data Canvas and select the processed file to view its properties.
The unstructured data is processed, and the document properties are displayed in the Document Properties pane.
Note: The unsctructproperties displayed will vary according to the type of unstructured data selected.