Monthly Archives: January 2011

Referential Integrity (Part 2: triggers and other methods)

Last week we discussed foreign key constraints as the way to implement referential integrity. Today I’d like to focus on the other implementation approaches.

First of all, let’s try to outline the situations when foreign key constraints would not work for you

  1.  If you need to reference the table involved in partition switch. SQL Server simply does not allow it.
  2. When you have the situation when detail and master data can be inserted/deleted out of order. For example, in my system we have 2 streams of transactional data and one stream is referencing another. Those streams are collected and processed differently so we cannot guarantee that master data row would be inserted prior detail data row. So foreign key constraints are not the options for us.
  3. When additional index on detail (referencing) column is not appropriate. This one is interesting. There are some cases when you don’t want to maintain/support another index on the table. This approach would not work very well if you want to delete detail rows same time with the master rows. If this requirement is not critical, you can purge detail rows once per night even if it forces scan(s) of detail table). In some cases it makes sense.

How to accomplish it? First, you can handle it in the client code. Generally this solution is not really good and could easily become nightmare, especially in the case if  system does not have dedicated data access layer/tier code. With data access layer (especially if it’s done on the database side via stored procedures) it’s not so simple. On one hand it gives you all control possible you don’t have with the triggers. On the other, you need to make sure that there are no code, especially legacy code, that does not use data access layer/tier code. And you also need to be sure that same would be true at the future. Again, in some cases it could make sense. It depends.

Second, obvious method, is using triggers. Frankly I don’t see any benefits of using triggers in compare with actual foreign key constraints in the case, if you have deletion statement in the trigger. Although, something like that can make sense (it uses the same tables created last week):

As you can see, trigger simply inserts list of deleted OrderIds to the queue. Next, you can have sql server job running during off-peak hours that deletes the data from detail table.

That example covers the case with deletion of the master rows. As for detail (referencing) side, there are a couple things you can do. First is the trigger:

Second is using user-defined function and check constraint.

This approach could be better than trigger because it does not fire the validation if OrderId has not been changed.

In any case – to summarize:

  •  Referential integrity is generally good.
    • It makes sure that data is clean
    • It helps optimizer in some cases
    • It helps to detect the errors on the early stages
  • Referential integrity always introduces performance implications regardless of implementation. Although in most part of the systems those implications are minor. If you cannot afford to have referential integrity implemented in your system, always enable it during DEV/QA stages. Be careful and disable it for the performance testing though because foreign key constraints could change the plan
  • Use foreign keys unless you have specific cases described above
  • Do not use referential integrity in the code/data access tier only UNLESS you have absolute control over the database and absolutely sure that nothing would change at the future. And think twice about it.

Referential Integrity (Part 1: Foreign Keys)

If you asked the database developer about referential integrity, everybody would agree that it’s “must have” feature. Implementation though is the different question. There are quite a few ways how you can implement it and, of course, it’s impossible to say which way is the right one. As usual, it depends.

I’d like to talk about referential integrity in general and discuss pros and cons of the different implementation methods. I’m not trying to cover every possible implementation method and focus on a few most obvious ones.

So first of all, why referential integrity is good? Quite a few obvious reasons:

  1. It makes sure that data is clean
  2. It helps query optimizer (in some cases)
  3. It helps to detect some bugs in the code on the early stages.

As for the negative side, it always introduces some performance implications. In most part of the cases those implications are minor. Although in some cases you need to take them into the consideration.

So let’s take a look at the different implementation approaches. First, foreign key constraints. Probably most known method. When you use this method, everything is automatic. SQL Server checks the existence of the master (referenced) row when detail (referencing) row is inserted. And vice verse. Now let’s think about the implications. Let’s create the tables and populate it with some data.

Now let’s create the foreign key constraint.

Now let’s enable “display actual execution plan” and insert data to the OrderItems table.

Obviously SQL Server needs to make sure that master row exists. So it adds clustered index seek on the Orders table to the picture. Now let’s try deletion of the master row:

Oops. Clustered index SCAN on the OrderItems table. This is the most common mistake database developers do – when you create the foreign key, SQL Server requires to have unique index on the referenced (Master) column. But it does not automatically create the index on the referencing (Detail) column. And now think if you have millions of the rows in OrderItems table.. Ohh..

Let’s add the index and try again:

As you can see, non-clustered index seek. Better now. So at the end, when you have foreign keys in the database, every deletion from the master table introduces index seek in the detail table. Every insert into the detail table introduces index seek in the master table. Plus you need to maintain the index on the detail table. Most likely, those issues would not be critical unless the system is really big and under heavy load. But you need to be aware of it.

Another limitation we need to be aware of is partition switch. You cannot switch the partition if table is referenced by other tables. So if this is the case, you need to implement other approaches we will discuss next week.

Code is available for download

Sunday T-SQL Tip: Purging Data

Last week we discussed a few things we have to keep in mind implementing sliding window scenario. Unfortunately table partitioning available in SQL Server Enterprise edition only. How should we purge the transactional data with other editions of SQL Server? Obviously there is no such thing as the golden bullet. Same time there is one particular scenario a lot of the systems have – transactional table with clustered index on identity column. Let’s see how we can optimize the purge process in such case.

Assuming we have a table with identity ID column and DateCreated datetime column. Assuming you need to purge the data based on that DateCreated column. It could happen daily, monthly – i.e. with some time interval. So let’s create such table and populate it with 1M rows. Let’s create an index on DateCreated column.

Now let’s try to purge data a few times and see the results. Please ignore begin tran/rolback – only purpose of those is to preserve the data between the test runs.

As you can see, purging about 40% of the rows takes about 3.5 seconds. There are 2 possible execution plans based on the number of the rows we need to delete – either using clustered index scan (our 40% case) or non-clustered index seek (if % of the row is smaller).

In any case, let’s think about it in more details. In both cases, SQL Server needs to process a lot of data and acquire/hold U/X locks for the duration of the execution. Not really good in terms of concurrency.

What can we do in order to improve it? In the table design like that our DateCreated column is increasing/populating same way with identity. So instead of deleting data based on DateCreated column (that most likely uses non-clustered index seek), let’s get ID of the row we want to keep (i.e min(ID) where DateCreated > @). Next step, instead of deleting everything at once, we can delete data in batches (10-100K rows each) in the individual transactions. In such case locks would be kept only within the batch deletion.

Let’s take a look. Again, please ignore outer begin tran/rollback.

As you can see, it runs about 10% faster but more importantly, it reduces the blocking in the table. Let’s take a look at the plan.

Non-surprisingly it uses clustered index seek/delete. But as you see – a lot of time wasted on NCI maintenance. And here is the interesting point. Does system really need to have the index on DateCreated for any other purpose than purge? If this is the case, we can safely drop it. Yes, it would take more time to find initial @MaxRecId but on other hand this would be either (S) locks or even no-locks at all if read uncommitted is acceptable. And in those cases we more concern about the locking instead of the execution time.

So let’s give it a try without the index.

As you can see, it runs more than 2 times faster. This solution and design are not always good. First of all, making CI on identity column on the large transactional table is bad idea by itself. But I saw a lot of systems and designs like that. In such case, this solution could benefit the purge process.

Code is available here.