Data inventory

Use Pentaho Data Catalog

Part Number

Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating terms created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse securely.

Most of the data-curating process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.
Data profiling is the process in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, like JSON, and Parquet, and files compressed with supported compression algorithms such as gzip.
Sensitive data discovery
Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. Data Catalog’s algorithms identify sensitive data throughout the data clusters as a part of profiling with minimal additional overhead. Identification is the first step, and often the hardest step, in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.