Originally authored by Damir Sudarevic
Logging tables are usually at the periphery of a design effort, sometimes added as an after-thought. Usually, a logging table has a large number of rows and many optional columns, most containing NULLs. In addition, because ALTER TABLE statements become expensive in a production system, these tables often include some kind of a serialized array (JSON, CSV) for all the optional fields. In otherwise well engineered systems, these logging tables stick out as an embarrassing mess.
A logging table may look something like this one, a few mandatory and many optional (1,2,3…X) columns.
One way to tackle the problem is to design away storage of nulls by vertically partitioning the table – quite efficient if the DB supports join elimination, also known as table elimination.
Join elimination is a feature of the query optimizer, which excludes tables from the execution plan – provided they are not needed to obtain the result set.
To illustrate join elimination, suppose we have a User table with an optional UserHeight column, expected to be NULL for most of the records.
Storage of nulls can be designed away; in this example, only records with known UserHeight have a row in the User_2 table.
So, if we have a query,
select FirstName , LastName , UserHeight from User_1 as a left join User_2 as b on b.UserID = a.UserID ;
then the query plan looks like:
If a query does not need any columns from the second table, like:
select FirstName from User_1 as a left join User_2 as b on b.UserID = a.UserID ;
then the plan does not include the second table.
It is important to note that the first table is always included, for example:
select UserHeight from User_1 as a left join User_2 as b on b.UserID = a.UserID ;
The above uses only columns from the second table; however, the plan is the same as in the first case.
It may not be obvious that this feature properly resolves views, for example:
create view vUser as select a.UserID as UserID , FirstName , LastName , UserHeight from User_1 as a left join User_2 as b on b.UserID = a.UserID ;
select FirstName , LastName , UserHeight from vUser ;
the result is as expected:
And, if there are no columns from the second table in the select list:
select FirstName from vUser ;
then only the first table is included in the execution plan.
The principle may now be applied to the logging table problem. Instead of the first logging example, we could use something like:
and create a view:
create view vLog as select a.LogID as LogID , LogTime , LogCategory , Optional_1 , Optional_2 , Optional_3 , Optional_X from Log as a left join Log_1 as a1 on a1.LogID = a.LogID left join Log_2 as a2 on a2.LogID = a.LogID left join Log_3 as a3 on a3.LogID = a.LogID left join Log_X as aX on aX.LogID = a.LogID ;
For example, if a query requires only Log and Log_3 tables then:
select LogID , LogTime , Optional_3 from vLog ;
The plan includes only these two tables.
To add a new column (property), the following has to be done:
- Create a new table with LogID and the new column (not very expensive to create an empty table).
- Modify the view to include the new table (fairly simple).
- Modify the logging code (has to be done in any case).
For this technique to work, the DB optimizer must support table elimination, and the number of optional tables should be manageable. What is manageable? My guess would be something like 10, 20, 50 – simply try it out. All newer versions of MS SQL Server, Oracle, DB2, PostgeSQL support the feature, but MySQL 5.x still does not; MariaDB does.About Damir Sudarevic
digs data, likes cats.