Processing unstructured data

Use Pentaho Data Catalog

Version
10.0.x
Audience
anonymous
Part Number
MK-95PDC000-00

Perform the following steps to process the unstructured data:

You must perform Metadata Ingest and Data Profiling to process unstructured data and view its properties.
  1. Select the unstructured resource you want to investigate in Data Canvas.
    This can be a file or a folder.
  2. Click Process.
    The Choose Process pane opens with Metadata Ingest and Data Profiling options.
    Choose process
  3. In the Metadata Ingest tile, click Start to begin the metadata ingest process.
    You can view the status of metadata ingest on the Manage Workers page.
  4. To perform the data profiling, click the Data Profiling tile.
    The Profiling page opens with the following options to configure data profiling:
    Note: When configuring data profiling, it is recommended to use the default settings as they are suitable for most situations.
    Field Description
    Ingest Properties Parses document metadata from files.
    Compute Checksum Calculates checksums for each file.
    Files Modified

    More Than Day(s) Ago

    Filters file processing by modification timestamp.
    Files Modified

    More Than Day(s) Ago

    Filters file processing by access timestamp.
    Extensions

    Enter to add value. Leave empty to use all extensions

    Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling will be performed for the specified extension.
    Additional File Processing Threads Number of processing threads for file processing per job (should keep this low if running many jobs).
    Persistence Threads Number of persistence writing per job (should keep this low if running many jobs).
    Supported Max File Size Files larger in size than this amount will be skipped. Example: 100 MB
  5. Click Start.
    You can view the status of metadata ingest on the Manage Workers page.
  6. Go to Data Canvas and select an unstructured document to view its properties.
The document properties are displayed in the Document Properties pane.
Note: The properties displayed will vary according to the type of unstructured data selected.