Step 2: Profile data

Get started with Pentaho Data Catalog

Part Number

After you ingest the schema for a data source, the data is limited to just the database metadata. Data profiling provides additional information. Data profiling is the process of examining the data of the selected data objects and collecting statistics and informative summaries about that data. The results of this process are available almost immediately, as each individual column, table, or schema is processed.

Data profiling is a pre-requisite for most data analytic processes within Data Catalog. If the data profile is not valid, you must re-profile the data prior to proceeding with any data identification activities.
Tip: As a best practice, keep your selection scope “reasonable.” For example, do not try to process 100,000 tables at once, since this process can take some time depending on the nature of the data. Use the default settings on the Configure Data Profiling page, as they are suitable for most situations.