In Data Catalog, you can view metadata in graphical formats like value histograms and unique value counts to help you analyze data quickly. You can also view sample values, and profiled samples.
To open a data type profile, navigate to the column in the resource you want to view and click it to explore the field-level data.
When viewing column details, you can see the resource field-level metadata along with data analysis, cardinality for fields, and sample values. To show metadata in the resource field, you need native access to the resource or metadata level as governed by the RBAC settings for your user role.
Depending on the selected resource level or data element, you can view different summaries of information, including the following resource metrics:
- Description
- Displays a description of the resource that is imported from the source. You can contribute resource information to the knowledge base to write content and include links to other articles in Data Catalog. To edit the description, click Edit Description, which will open a dialog box where you can format the text using tools like bold, italic, underline, and strikeout. You can also align text, insert code blocks, and add links as needed.
- System Information
- When you choose an unstructured file, it displays the timestamps for file creation, modification, and last access.Note: In certain file systems, when a file's modification date is less than its creation date, certain APIs, like the SMB network client, may display the more recent date as the modification date.
- Statistics
- When you select a table, you can view the Field Count and Row Count statistics. The following table identifies the key details available in the Statistics pane when you select a column in a table to view:
Feature Description Null Count Number of entries that are null. Cardinality The number of unique values in a field, where a low cardinality number indicates many repeated values. HLL An estimate of cardinality of the data, with a roughly ~2% margin of error. Blank Count The number of entries that are blank. Min Width The minimum number of character count in a value in the column. Max Width The maximum number of character count in a value in the column. Avg Width The average number of character count in a value in the column. - Data Patterns
- In Data Catalog, data pattern analysis offers insightful recommendations based on detected patterns and their frequency. These recommendations include RegEx expressions, catering to different levels of pattern matching precision: loose, moderate, and strict. Data Cataloggives you the flexibility to choose the most appropriate patterns. Simplifying the patterns by focusing on just the characters 'A,' 'a,' 'n,' and 's' reveals the underlying data patterns more clearly. After obtaining a set of simplified patterns along with their respective frequency counts, candidate RegEx expressions can be generated. The following options demonstrate possible RegEx expressions tailored to the desired level of strictness:
Pattern Description ^\w{2}\d{5}$ Loose Pattern: This pattern is less strict and excludes the last value in the example with 80% confidence. ^[K]\w\d{5}$ Strict first letter and five digits: This expression maintains strict criteria for the first letter while allowing for variability in the subsequent characters. ^[K]\w\d{5,6}$ Loose on the second character: This pattern ensures 100% confidence but introduces flexibility for the second character. ^[K][A,L,T,W]\d{5,6}$ More Strict Pattern: This expression imposes stricter conditions while maintaining 100% confidence. ^[A-Z][A-Z]\d{5,6}$ Another 100% confidence pattern that differs in its structure. CAUTION:If your user role does not grant access to the field or viewing level of the information, the Data Patterns pane does not appear. - Sample Data
- Shows the random values for the field along with the frequency and distribution when viewing a column. Text names and values are truncated after 200 characters. You can identify resources that have been sample-profiled and other resource-level information.
- To view this pane, your role must allow Sample Data Access through native system permissions. If your user role has administrative privileges, you can configure these values. If not, contact your administrator for details.
- Properties panel
- Displays a summary of the resource properties, like the last update time stamp, name, version, and type of the resource.
- Business Terms panel
- Lists associated business terms for the resource. You can also click Add Term to open the Business Terms dialog box and add terms to the resource. For more information, see the Administer Pentaho Data Catalog document.
- Tags panel
- Lists the tags associated with the resource. In addition, you can click and start adding tags like “quality:45” (the key should be unique) to the resource, which helps to identify the resource with tagged keywords.
- Custom Properties panel
- Lists the first five custom properties associated with the resource. Custom properties refer to user-defined metadata attributes or fields that can be associated with various data assets, such as databases, tables, files, or documents, to provide additional context and information about those assets. To add a custom property, click Add Custom Property and provide the required information. In addition, go to the Properties tab to see the complete list of custom properties added to the resource.