Semantic Layer with Unified Star Schema

A novel idea completes the generated star schema.

Aug 12, 2024

I previously discussed Generating Star Schemas from Data Vault, which served as the core foundation of our semantic layer. With a data vault containing all the entities and their relationships, we can generate dimension tables directly from entities (hubs/satellites) and fact tables from events (transactional links).

However, a manual operation is still required. Most events only contain references to their belonging entities. For analysis purposes, indirectly referenced entities often need to be identified and added to the generated fact tables. This is usually handled by ETL or ELT processes. Our semantic layer allows users to navigate the entity relationship graphs to specify the paths to the entities to be included in the generated fact tables. Then the corresponding ELT jobs are created by the system. It works great but adds some mental complexity to users and also creates an opportunity to introduce errors.

Generating Fact Tables with Indirect References

First, let's illustrate the approach with a simplified real-world example. Imagine we want to measure the power usage of manufacturing equipment in the factory. We can use intelligent electricity meters to collect the measurements. The uploaded IoT payload from these meters, which contains the meter's ID and power consumption reading, serves as the event data we will use to create our star schema.

Some business systems manage equipment's asset record (like the correlation between electricity meters and equipment, the location and organization of the equipment, the operator of the equipment, etc.). We have replicated them from business systems to build a data vault containing all the entities and their relationships necessary for the analysis.

In the diagram below, the green paths starting from the event "evt_meter" in the graph must be specified to bring equipment, organization, and equipment type to the generated fact table.

This approach supports incremental data loading and can be triggered by the incoming batches of meter events written into "evt_meter." The joins along these green paths dominate the performance overhead and can be adequately managed. The relationship graph does apply some limitations but also offers several options to eliminate loops and duplications.

What is the Unified Star Schema?

The primary intended use of the Unified Star Schema is to enable self-service BI. As its name suggests, the main idea is to use a single "unified" star schema to serve queries. So, instead of having many fact tables each surrounded by some dimensions, a single Puppini bridge table sits in the middle, surrounded by all dimensions and facts.

The Puppini bridge table has all primary key columns from dimension and fact tables. As for rows, it is the "union" of all dimension and fact tables. Naturally, it is sparse and filled with many null values.

One tweak is that for rows coming from fact tables, any "indirect" foreign keys also need to be filled appropriately. The dash-edged areas in the diagram above are the original fact rows' filled "indirect" foreign keys.

This method addresses several commonly seen problems when using dimensional modeling.

First, fan and chasm traps1. The Unified Star Schema natively prevents both kinds of traps from happening. No data can ever be duplicated because it uses union instead of join.

Second, business users usually do not fully comprehend the details of all schemas. It is challenging to get queries correct, especially with multiple fact tables involved (drill-across). The second advantage of the Unified Star Schema is the increased ease of use. Queries always start from the Puppini bridge table, followed by one-level joins to any needed dimensions and facts. A single fixed query pattern makes it considerably easier for business users. Not only is the query more straightforward to write, but without traps.

Third, data-loss problems. Because all dimension keys are also included as separate rows in the bridge table, aggregation does not lose data, even those without transaction or event data within the queried range.

Finding and Saving Indirect References Automatically

You may have noticed the common operation from the previous two paragraphs: finding and saving indirect references. The Puppini Bridge table got me started on thinking about automating this operation.

Initially, we let users manually specify the green arrows in the first diagram. This is not trivial, though. With dozens of entities and all the relationships among them, users often feel lost in the graph. Most of the time, it became trial and error.

Thanks to our data vault approach, all the necessary information is available to find correlated entities. So, we later automated this manual operation. Instead of manually asking users to choose what entities to include, all reachable entities via any number of links are included by default. Every event table generates a fact table with all usable dimensions filled in.

It's important to note that our system supports all relationship cardinalities, including one-to-one, many-to-one, many-to-many, and one-to-many relationships. Assuming we start from the left, one-to-one and many-to-one relationships work fine since one row on the left matches, at most, one row on the right. No row 'expanding' is possible, i.e., one event row can never become more than one row in the generated fact table.

However, one-to-many and many-to-many relationships may expand rows. For example, a power usage measure of 10 could be duplicated multiple times with a sum of more than 10 (which is, of course, wrong).

We provide an option for these two relationship cardinality to specify the "allocation"2 rule (per-measure level). You specify a percentage for each "n" side to distribute and allocate the measure's amount. Let's see a few use cases:

An electricity meter is connected to and used by multiple machines. The power usage reported from the meter must be adequately "split" among the connected machines. Otherwise, it would be double-counted. In this case, the percentage for each "n" side should have a sum of 100%.
A sensor measures the indoor temperature. The reading is used for all equipment placed in the same room. In this case, 100% is set for each "n" side.

This option provides the flexibility to support all kinds of use cases. The measure is multiplied by the allocation percentage for that target's row when generating the fact table. Let's see a many-to-many example:

In the above diagram, meter1's power usage reading is evenly split between equip1 and equip2. This is the case for a single meter used by multiple equipment. As another interesting use case, equip1 uses two meters (potentially of two power sources), so its power usage is equal to the sum of meter1 and meter2 (20 + 5 = 25).

Semantic Layer with Unified Star Schema

Our automated fact table generation is similar to that of the Puppini Bridge table generation. Both find and save indirect foreign keys in fact tables. The difference is that our approach keeps fact tables independent and does not mix them physically.

When I first learned about the Unified Star Schema, one immediate concern was the joining between the (super large) bridge and (large) fact table(s). The bridge table only contains primary keys. It must join the fact table(s) to get to measures. Proper partitioning and fine-tuning of the bridge table seem challenging. One approach is to include measures in the bridge table as well, effectively duplicating all fact tables.34 Another approach is to have multiple physical stage tables, one for each fact table, and then create a view to union all the stage tables as the final bridge.

That is where a semantic layer shines. Implementation details do not matter to users. Internally, we use multiple physical fact tables. The semantic layer knows how to generate correct multi-pass SQL against those tables when doing drill-across queries. No fan or chasm trap exists, and there is no data loss either because the semantic layer knows how to avoid the traps.

Conclusion

The Unified Star Schema is quite interesting. Its novel idea of creating a Puppini Bridge table eliminates some of the most common problems in dimensional modeling and enables a more effortless self-service BI experience.

Inspired by the Puppini Bridge table idea, our star schema generation is fully automated and has complete dimensional references. This shows the beauty of the data vault.

A semantic layer provides the best of both worlds. To users, it shields the internal details and helps avoid the most commonly encountered problems when they directly write queries against tables. Implementation-wise, it can use the most optimized schema without worrying whether its users can use it effectively.

https://showmethedata.blog/sql-traps-unified-star-schema

From Kimball method's "allocated facts."

I feel that might be a viable solution, particularly if the original fact tables can be safely dropped.

https://www.youtube.com/watch?v=ZxjhplaQWRA mentioned a scenario of using Puppini Bridge with PowerBI, which seemed to have problems without measures included.

Bryan's Reflective Path

Discussion about this post