Data Catalog supports various file formats like Avro, ORC (Optimized Row Columnar), Parquet, CSV (Comma Separated Values), TSV (Tab Separated Values), XLS, and XLSX files, while profiling the structured data. The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.
The intermediate data generated for each column of data includes:
- Roaring Bitset
- A bitmap of the hash values for all entries in the column.
- HyperLogLog (HLL)
- Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.
- Data Pattern Analysis
- Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.
- Data Quality Pre-Analysis
- Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.
- Statistics
- Data Catalog gathers the following statistics when examining all the data:
- Minimum and Maximum values (for numeric columns)
- Widest and Narrowest (non-null) string widths
- Null count
- Total row count
- Data Sampling
- Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.