Category Archives: SQL Server 2005

Referential Integrity (Part 1: Foreign Keys)

If you asked the database developer about referential integrity, everybody would agree that it’s “must have” feature. Implementation though is the different question. There are quite a few ways how you can implement it and, of course, it’s impossible to say which way is the right one. As usual, it depends.

I’d like to talk about referential integrity in general and discuss pros and cons of the different implementation methods. I’m not trying to cover every possible implementation method and focus on a few most obvious ones.

So first of all, why referential integrity is good? Quite a few obvious reasons:

  1. It makes sure that data is clean
  2. It helps query optimizer (in some cases)
  3. It helps to detect some bugs in the code on the early stages.

As for the negative side, it always introduces some performance implications. In most part of the cases those implications are minor. Although in some cases you need to take them into the consideration.

So let’s take a look at the different implementation approaches. First, foreign key constraints. Probably most known method. When you use this method, everything is automatic. SQL Server checks the existence of the master (referenced) row when detail (referencing) row is inserted. And vice verse. Now let’s think about the implications. Let’s create the tables and populate it with some data.

Now let’s create the foreign key constraint.

Now let’s enable “display actual execution plan” and insert data to the OrderItems table.

Obviously SQL Server needs to make sure that master row exists. So it adds clustered index seek on the Orders table to the picture. Now let’s try deletion of the master row:

Oops. Clustered index SCAN on the OrderItems table. This is the most common mistake database developers do – when you create the foreign key, SQL Server requires to have unique index on the referenced (Master) column. But it does not automatically create the index on the referencing (Detail) column. And now think if you have millions of the rows in OrderItems table.. Ohh..

Let’s add the index and try again:

As you can see, non-clustered index seek. Better now. So at the end, when you have foreign keys in the database, every deletion from the master table introduces index seek in the detail table. Every insert into the detail table introduces index seek in the master table. Plus you need to maintain the index on the detail table. Most likely, those issues would not be critical unless the system is really big and under heavy load. But you need to be aware of it.

Another limitation we need to be aware of is partition switch. You cannot switch the partition if table is referenced by other tables. So if this is the case, you need to implement other approaches we will discuss next week.

Code is available for download

Sunday T-SQL Tip: Purging Data

Last week we discussed a few things we have to keep in mind implementing sliding window scenario. Unfortunately table partitioning available in SQL Server Enterprise edition only. How should we purge the transactional data with other editions of SQL Server? Obviously there is no such thing as the golden bullet. Same time there is one particular scenario a lot of the systems have – transactional table with clustered index on identity column. Let’s see how we can optimize the purge process in such case.

Assuming we have a table with identity ID column and DateCreated datetime column. Assuming you need to purge the data based on that DateCreated column. It could happen daily, monthly – i.e. with some time interval. So let’s create such table and populate it with 1M rows. Let’s create an index on DateCreated column.

Now let’s try to purge data a few times and see the results. Please ignore begin tran/rolback – only purpose of those is to preserve the data between the test runs.

As you can see, purging about 40% of the rows takes about 3.5 seconds. There are 2 possible execution plans based on the number of the rows we need to delete – either using clustered index scan (our 40% case) or non-clustered index seek (if % of the row is smaller).

In any case, let’s think about it in more details. In both cases, SQL Server needs to process a lot of data and acquire/hold U/X locks for the duration of the execution. Not really good in terms of concurrency.

What can we do in order to improve it? In the table design like that our DateCreated column is increasing/populating same way with identity. So instead of deleting data based on DateCreated column (that most likely uses non-clustered index seek), let’s get ID of the row we want to keep (i.e min(ID) where DateCreated > @). Next step, instead of deleting everything at once, we can delete data in batches (10-100K rows each) in the individual transactions. In such case locks would be kept only within the batch deletion.

Let’s take a look. Again, please ignore outer begin tran/rollback.

As you can see, it runs about 10% faster but more importantly, it reduces the blocking in the table. Let’s take a look at the plan.

Non-surprisingly it uses clustered index seek/delete. But as you see – a lot of time wasted on NCI maintenance. And here is the interesting point. Does system really need to have the index on DateCreated for any other purpose than purge? If this is the case, we can safely drop it. Yes, it would take more time to find initial @MaxRecId but on other hand this would be either (S) locks or even no-locks at all if read uncommitted is acceptable. And in those cases we more concern about the locking instead of the execution time.

So let’s give it a try without the index.


As you can see, it runs more than 2 times faster. This solution and design are not always good. First of all, making CI on identity column on the large transactional table is bad idea by itself. But I saw a lot of systems and designs like that. In such case, this solution could benefit the purge process.

Code is available here.

SQL Server and Table Partitioning (Part 3: Implementing sliding window partitioning)

As we already discussed, implementing of sliding window scenario is one of the biggest benefits you can get from table partitioning. Technically this is quite simple process – when time comes, you have to do 2 different actions:

  1. Create the new partition for the future data
  2. Switch the oldest partition to the temporary table. After that you can either truncate or archive that data. Possibly even switch this temporary table as the new partition in the historical table.

There are quite a few resources that can show how you can achieve that from the technical standpoint. You can see the example in technet. I’m not going to focus on the implementation details and rather talk about a few less obvious aspects of the implementation.

Usually, we are implementing partitions on the Date basis. We want to purge data daily, monthly or yearly. The problem with this approach is that partition column needs to be included to the clustered index (and as result every non-clustered index). It means extra 4-8 bytes per row from the storage prospective and less efficient index because of bigger row size -> more data pages -> more IO. Obviously there are some cases when nothing can be done and you have to live with it. Although, in a lot of cases, those transaction tables have autoincremented column as part of the row and CI already. In such case, there is the great chance that date column (partition column candidate) would increment the same way with autoincremented ID column. If this is the case, you can think about partitioning by ID instead. This introduces a little bit more work – most likely you’ll
need to maintain another table with (Date,ID) correlations although it would pay itself off because of less storage and better performance requirements. Of course, such assumption cannot guarantee that row always ends up on the right partition. If, for example, insert is delayed, ID could be generated after the split. But in most part of the cases you can live with it.

Second thing is creating the new partition. Let’s take a look at the example. First, let’s create the partitioned table and populate it with some data.

This table has 2 partitions – left with ID <= 100,000 and right, with ID > 100,000. Let’s assume we want to create another partition for the future data (let’s even use ID=200,000 as the split point).

At this point, SQL needs to find out if some data needs to be moved to the new partition. As you can see, it produces a lot of reads. Now let’s try to create another partition with 300,000 as the split point.

As you can see – no IO in such case. So the rule is to always have one additional empty partition reserved. If you decide to accomplish partitioning by ID like we already discussed, make sure:

  1. You reserved enough space on the second right-most partition for the time period
  2. If ID is identity, insert the dummy row with identity insert on after split in order to start placing the new data to the new partition. Obviously you’ll have holes in the sequence.

Last thing I’d like to mention is locking. Partition switch is metadata operation. Although it will require SCHEMA MODIFICATION (Sch-M) lock. This lock is not compatible with schema stability (SCH-S) lock SQL Server issues during query compilation/execution. If you have OLTP system with a lot of concurrent activity, make sure the code around partition switch/split handles lock timeouts correctly.

Source code is available here

Also – check this post – it could give you some tips how to decrease the possibility of deadlocks during partition/metadata operations.

SQL Server and Table Partitioning (Part 2: When partitioning is you enemy)

Last week we started to discuss table partitioning and specified a few cases when partition is really useful. Today I want to focus on the downsides of the process.

First, column used in the partition function needs to be included into the clustered index (and as result all non-clustered indexes). Unless this column is ALREADY part of the clustered index, it means you’re increasing non-clustered index row size. And think about it – you typically partition the large table with millions of rows (what is the reason to partition the small tables?). With such number of rows, index size matters. From both – storage size prospective and index performance.

Another interesting thing – that unless you have rare case when you don’t want index to be aligned, partitioned column should be included into the index definition. It would not increase the row size (CI values are part of the NCI rows anyway), but it could break the indexes you’re using to check uniqueness of the data.

Next, you cannot do partition switch if table is replicated with SQL Server 2005. Fortunately SQL Server 2008 addresses this issue.

Next, you cannot do partition switch if table is referenced by other tables. Typically it’s not an issue for the large transaction tables but still – it could happen.

And now let’s see the worst one. When index is aligned, the data is distributed across multiple partitions. And regular TOP operator does not work anymore. Let’s see that in example. First let’s create the table and populate it with some data.

Now, let’s run the select that get TOP 100 rows:

As you can see the plan is straight-forward – TOP and INDEX SEEK.

Now let’s partition that table by DateCreated.

And run the same select.

Surprisingly it uses clustered index scan. Let’s try to force SQL to use index:

As you can see, it introduces 100 times more reads. Let’s think about it for a minute. Why does it happen?

Think about first example with non-partitioned table. Index data is sorted by DateModified and ID.

So when you run the select like above, SQL Server needs to locate the first row in the index and continue to scan index forward for the next 100 rows. Sweet and simple.

Now let’s think about partition case. We partition index by DateCreated – 3rd column in the index. This column affects the placement of the index row to specific partition. And index is sorted by DateModified and ID only within the partition. See example below – you can see that the row with DateModified = 2010-01-02 can be placed on the first partition because it controlled by DateCreated.

Now let’s think how SQL Server runs the select if it uses the index. For each partition on the table it selects the first row that matches the predicates and scan forward entire partition up to the end. Next, it merges results from all partitions, does the sort and next selects TOP N rows from all the partitions. As you can see, it introduces huge performance hit. I wish SQL Server would be smart enough to scan only N row from each partition, but unfortunately it’s not the case. It scans everything.

Unfortunately there is no simple workaround. This is something you need to live with.

Update (2011-12-18) – Well, there is the workaround that would work in some cases. Check it out here: http://aboutsqlserver.com/2011/12/18/sunday-t-sql-tip-select-top-n-using-aligned-non-clustered-index-on-partitioned-table/

Update (2012-03-11) – Another, more clean and generic workaround – http://aboutsqlserver.com/2012/03/11/sunday-t-sql-tip-aligned-non-clustered-indexes-on-partitioned-table-again/

Source code is available for download.

Next week we will talk about sliding window implementation.

SQL Server Table Partitioning (Part 1: When partitioning is your friend)

If you’re the lucky user of SQL Server 2005/2008 Enterprise Edition, there is the good chance that you already have table partitioning implemented. Granted, this is one of the best new features introduced in SQL Server 2005 and there were a lot of talks when it was released. Same time, it introduces a few issues and limitations you need to be aware of before the implementation.

I’d like to talk about it in more details. There will be 4 parts in that talk. Today I’m going to be positive – or tell about a few examples when the partitioning benefits you. Next time I’ll show when partitioning is not the best candidate and demonstrate a few interesting side effects you can have in the system. Next, I’ll talk about specific implementation case – Sliding Window and show a couple tricks that can make it more efficient. And at last, I’ll focus on a few other techniques that can replace partitioning if you’re using other editions of SQL Server.

I’d like to talk about it in more details. There will be 4 parts in that talk. Today I’m going to be positive – or tell about a few examples when the partitioning benefits you. Next time I’ll show when partitioning is not the best candidate and demonstrate a few interesting side effects you can have in the system. Next, I’ll talk about specific implementation case – Sliding Window and show a couple tricks that can make it more efficient. And at last, I’ll focus on a few other techniques that can replace partitioning if you’re using other editions of SQL Server.

So what is the table partitioning? There are ton of documentations and white papers (http://msdn.microsoft.com/en-us/library/ms345146%28SQL.90%29.aspx) that can give you a lot of details on implementation level. But let’s keep it simple. Partitioning is the internal storage separation for the subset of the data in the entity. Think about it as about one “virtual” table and multiple “internal” tables which store and combine data seamlessly to the client. Again, I’m not going to focus on implementation and internal details – let’s think about it from the logical standpoint.

Let’s think about specific functional use-cases when it’s beneficial.

  1. Sliding window scenario. Assuming you have production database and you need to keep one year worth of the data in the table. Every 1st day of the month you want to purge the data that is more than 1 year old. Without partitioning, you’ll have to use delete statement that would be time consuming and introduce extensive locking in the database (There are a couple ways to avoid it though). With partitioning, you simply switch oldest partition to the temporary table and truncate it. This is metadata operation that takes milliseconds. We will talk about it in more details later
  2. Production/historical tables pattern. Think about the previous example when you need to move data to historical table in the same database. Similar approach – you technically move the partition from production to historical tables.
  3. Bulk data load. Sometimes it can work in the opposite direction. The data can be loaded to the temporary table and next, that table becomes the new partition. This minimizes the locking during the load. This one has the limited usage though – when system rarely loads the data. You don’t want to do it on daily basis and have hundreds of partitions

The biggest point though is manageability. Different partitions can reside on the different filegroups. It means, you can:

  1. Put old, rarely accessed data to the slower drives.
  2. Make historical filegroups read-only (helps with backup strategy and improve the performance because SQL Server does not need to obtain locks).
  3. Do piecemeal restore and so on.
  4. Do index maintenance operations on partition level. You cannot do online index rebuild though

Performance though is the interesting question. SQL Server (especially 2008) has improved algorithms for the parallel execution on partitioned tables. This is definitely good thing for reporting/data warehouse type systems. As for OLTP – it should not be the case. If OLTP system constantly accesses the data from the multiple partitions in one statement, most likely there are some issues in the system. Same time, there are some performance implications I’ll show you next time.

So partitioning is good when you need partitioning. Otherwise it makes the situation worse. Will talk about some performance implications next time.

‘A’ = N’A’ or how to kill the index seek

Let’s start with the example. Create table, populate it and create an index on varchar column.


Next, let’s run 2 selects. First one with varchar parameter. Plan is perfect.

Second, let’s use unicode parameter.

Oops – index scan. It does not look like the big deal – who is going to use unicode for varchar field search. But..

We deal with the client applications. Unfortunately, modern development languages implement String as the unicode. And if you don’t specify the parameter type, it would be unicode.

Let’s take a look:

As you can see, .Net application defines @UntypedParam as nvarchar(5). And it produces 21 times more reads. Another thing worth to mention – the size of the parameter matches the string length. As result, if next statement uses “100” as the parameter value, it would be nvarchar(3) – recompilation, potentially additional plan in the cache and so on.

Keep that in mind – always specify parameter type and size at the client application.

Script is available for download here. Source code is available here

Sunday T-SQL Tip: Equality predicate on large varchar column

Sometimes you need to do the query based on equality predicate on large text column. It does not happen quite often but it happens.Unfortunately sometimes it’s impossible to create the index on such column. First, the size of the column could be greater than 900 bytes (so index would not be created). Another possibility – you have a lot of data and simply don’t want to use a lot of space for such index.

How to optimize the query in such case? Let’s take a look at the example. First, let’s create the table and add some data there. As you can see, LargeField would have a size of 1080 bytes.

Now let’s try to run the select with the clustered index scan. 9150 reads – quite bad.

Index cannot be created – 900 bytes is the limitation.

The solution is persisted calculated column with CHECKSUM() function that generates hash of the string. Let’s take a look:


There are 2 things we need to keep in mind. First – hash is not guaranteed to be unique. So statement still needs to include the predicate on the field. Second, unicode and regular strings obviously return different hash. It should not be generally a problem although make sure that client application specify correct parameter data type.

Source code is available for download

Sunday T-SQL Tip: Insert/Update pattern

Let’s think about the system that collects and processes the data. Quite often processing happens after data have been inserted into the system. Typically the rows have set of the columns inserted as NULL. Processing routine loads the rows, update those columns and save rows back.

Looks simple but this pattern introduces major page split/fragmentation issues. Think about that – SQL Server tried to fill pages up to 100%. So when you update the rows, pages don’t have enough free space to keep the data. Let’s look at the example:

First – let’s create the table. There are only 3 columns in this table – identity, char column as the placeholder for the columns populated during insert stage and post processing placeholder.

Second, let’s check the physical index statistics. Index is practically perfect.

Now let’s execute update and populate post processing placeholder column. Now check the statistics – terrible fragmentation and a lot of space wasted.

Obviously the workaround is simple – reserve the space for the post processing columns on the insert stage.

Source code is available here

Sunday T-SQL Tip: Union vs. Union All

What could be simpler than union? Probably “union all”.

“Union” removes all duplicates from the combined row set. Union all simply concatenate the row sets. Obviously, if original row sets are unique, results would be the same. But plans are not.

Take a look here. Let’s create 2 tables with some data.


Now let’s do the union and union all and see the plans.

Union introduces another expensive “Distinct Sort” operator. As result the execution plan with union is almost 2 times more expensive than with union all. Of course, your mileage may vary – in the case of the large and complex queries the higher cost of “Union” operator could be hidden but in any case, it’s more expensive. So don’t use it if rowsets don’t have duplicates.

Speaking of the duplicates – there is another thing you need to keep in mind. When union removes the duplicates from combined rowsets, it does not really matter where duplicates come from. So you don’t need to use distinct as part of the selects. And it affects the plan too. Look here:


Source code is available for download from here

Sql Server Statistics – update and management #2

Last time we saw when SQL Server creates statistics and what database properties affects statistics update behavior. Today let’s talk what triggers statistics update and how can it be updated manually.

As we know, statistics is based on the table data. And obviously statistics is outdated when underlying data is changed. As you can guess at this point, SQL Server tracks the number of the modifications in the table and updates statistics when number of the modifications reach the threshold. Let’s see what is the threshold. Of course, auto update statistics needs to be enabled on the database level.

For the regular tables update is triggered when:

  1. When row count changes from 0 to non-zero
  2. When table has less than 500 rows and there were more than 500 updates on the leading column of the stats
  3. When table has more than 500 rows and there were more than 500+20% of the number of rows changes on the leading column of the stat

For the temporary table there is one additional threshold at 6 rows. This behavior can be altered with OPTION (KEEP PLAN) query hint.

As the side note, temporary table variables do not keep statistics. That can produce quite inefficient plans in some cases. On the other hand update statistics on the temporary tables can introduce extensive number of recompilations. If this is the case, it could be fixed with OPTION (KEEPFIXED PLAN) query hint that prevents query recompilations based on statistics update.

Statistics also updates during index rebuild stage. It does not update during index reorganize operation through..

A few more methods:

  1. UPDATE STATISTICS statement allows you to update either single statistics or all statistics on the table. You can control sampling method there
  2. sp_updatestats update all statistics in the database
  3. Statistics update could be done via maintenance plan

As I mentioned before, in most part of the cases default behavior is OK. Although it’s useful to understand how it works internally.