Processing data

Use Pentaho Data Catalog

Version
10.1.x
Audience
anonymous
Part Number
MK-95PDC000-02
You can extract meaningful insights and ensure the effective utilization of data with Data Catalog processing. The significant stages in the processing data are:
  1. Metadata Ingest
  2. Data Profiling (for structured data) and Data Discovery (for unstructured data)
  3. Data Identification for structured data including delimited files

Metadata Ingest

The Metadata Ingest step updates Data Catalog to reflect current metadata changes. The Metadata Ingest step scans the data source for new or modified files since the last run, updating the existing metadata. In addition, it removes metadata for deleted files, ensuring Data Catalog represents the data source accurately.

Data Profiling and Data Discovery

The Data Profiling and Data Discovery steps are crucial for analyzing both structured and unstructured data, respectively.

Data Profiling
The process in which Data Catalog examines structured data within JDBC data sources and gathers statistics about the data. It profiles data in the cluster and uses its algorithms to compute detailed properties, including field-level data quality metrics, data statistics, and data patterns.
Data Discovery
In this process Data Catalog examines unstructured data by scanning file contents to compile data statistics, which involves the following steps::
  • Calculating checksums to identify duplicates.
  • Extracting document properties from Office365 and PDF files.
  • Using dictionaries to scan documents for specific strings and keywords, triggering predefined actions.
  • Profiling data within the cluster to ascertain detailed attributes, including quality metrics, statistics, and patterns for delimited files.
These processes ensure a thorough understanding and assessment of both structured and unstructured data, setting a solid foundation for subsequent analysis.

Data Identification

The Data Identification is an essential process helps to manage your structured data, including delimited files. It involves tagging data to make it easier to search, retrieve, and analyze. By associating dictionaries and data patterns with tables and columns, you can ensure that data is appropriately categorized and easily accessed when needed.

CAUTION:
You must run Data Profiling (for structured data) or Data Discovery (for unstructured data) before proceeding with any Data Identification activities.

Usage Statistics

Note: Usage Statistics process is only available for the Microsoft SQL and Oracle databases if the auditing feature in these databases is enabled.

When processing Microsoft SQL or Oracle databases, Data Catalog gives an additional feature to gather usage statistics and store them in the Business Intelligence Database (BIDB). During this process, the Entity Usage Worker job fetches various usage metrics, like how many times an entity is read, written, and altered, along with the timestamps, from an audit database and stores them under Entity Usage Statistic View collection within BIDB. You can use this repository for analysis and visualization of the data with third-party BI tools. For more information, see Business Intelligence Database.