Category Archives: SQL Server 2005

Cautionary tale about triggers, version store and fragmentation

A while ago triggers were slow. Very slow. In the versions prior to SQL Server 2005, triggers scanned transaction log in order to build internal inserted and deleted tables. That approach has been changed in SQL Server 2005 when triggers started to use version store – similarly to what optimistic locking is using. That helps with performance but same time introduces a few interesting issues. Let’s take a look.

First of all, let’s create a table and populate it with some data. I would like to point that I’m using tempdb in that example mainly to show that this behavior is completely independent from transaction isolation levels and optimistic locking database options.

Now let’s take a look at what we have in the index (click on the image to open it in the different tab)

As you can see, we have 65 bytes rows without any fragmentation. So far so good. Now let’s update Value column.

We updated fixed-width column. Row size has not changed. No fragmentation here. As the next step let’s create a trigger.

As you see, trigger itself is empty – we don’t even access inserted/deleted tables there. Now let’s update our table again

Still the same thing. But if we look at performance monitor counters – we can see that update now generates version store activity.

As the next step I’d like to add LOB column to the table. I’m using varchar(max) but you can replace it with nvarchar, binary, xml or clr data type. It would be the same.


Column is nullable so alteration here is pure metadata operation. And now let’s run the update statement.

Oops – now we have 14 bytes version store pointer added to the row. That introduces heavy fragmentation (new versions are larger so they don’t fit into the pages) and double our storage space. Well, a little bit unexpected development. Generally speaking this is very similar problem with what we have with data modifications when we use optimistic isolation levels.

OK, it happens with LOB columns. Now let’s check the case when ROW OVERFLOW is possible. Let’s drop LOB column and add to varchar(5000) columns here.


Again, we rebuilt the index – 14 bytes pointers are went away. Now let’s run update statement again.

Same thing – extra 14 bytes and fragmentation.

And let’s prove that trigger is the problem. First, let’s drop the trigger and rebuild the index

And next – run update again.

As you can there are no 14 bytes version store pointer nor fragmentation.

At last, let’s take a look at the very interesting thing – let’s see what happens when we have ON DELETE trigger. Generally speaking, when we delete the row, only thing that SQL Server initially does is mark row as the “ghost” row. Just the change in the header. But with the trigger the situation could be different.

You can see, that in our case we ended up with page split on delete!

This behavior adds extra arguments to the discussion why triggers are bad in general. Triggers add extra load to tempdb. And in case if you have the possibility of row overflow or LOB allocations, you’d end up with 14 bytes version store pointer and extra fragmentation.

Source code is available for download

P.S. I’d like to thank Paul White for his help with that post.

Locking in Microsoft SQL Server (Part 14) – Deadlocks during DDL operations (alteration, partition switch, etc)

Today I’d like us to talk about one practical problem related with the locking – how to deal with deadlocks during DDL operations. For example, table and partition functions alteration, partition switch and others. This problem is very common in OLTP environment when table is constantly querying by the multiple sessions. One of the common examples there is sliding window pattern when the table has been partitioned based on some time interval (daily, weekly, monthly, etc) and data has been purged on the regular basis by switching partition to the temporary table. Let’s try to understand why we have deadlocks here and what we can do to solve the problem.

First of all, as the pre-requirement, let’s assume that we implemented our partitioning correctly. Most important factor is that our operation does not require any data movement nor scans. For example, we don’t split ranges in partition function in the way that requires  physically move the data to another partition (e.g. we are pre-allocating empty partition rather than splitting the range with the value at the middle of the partition that has some data). In that case partition operations are purely metadata operations and only schema locks are involved.

Another thing to avoid is mixing DML and DDL statements in one transaction. Let’s think about following example (let’s ignore page locks and schema stability (SCH-S) locks for simplicity sake):

  1. Session 1 starts transaction and updates 1 row in the table. E.g. at this stage we will have (X) lock on the row and (IX) lock on the table
  2. Session 2 is trying to select data from the table and being blocked by Session 1. So now we have Session 2 waiting for (S) lock on the row and also have (IS) lock acquired on the table
  3. Now session 1 is trying to implement partition switch. For that operation (SCH-M) lock needs to be acquired on the table and it would be blocked by (IS) lock from the Session 2.

Classic deadlock. But let’s assume that this is not the case. What happens when we have multiple locks on the same object? Generally speaking, lock requests would be serialized. In some cases existing locks could be even re-used. Let’s see that. First, let’s create a table and add a few records there.  Next, let’s have session 1 to read one row in repeatable read mode. As we remember it would hold (S) lock on the resource. Again, for simplicity sake let’s look at the row (key) level locks only

Now let’s try to update the row in another session. This session will try to acquire (X) lock on the resource and would be blocked because (X) lock is not compatible with (S) lock.

Now let’s try to read the same row in read committed mode from the 3rd session. As we can see everything works without blocking.

Let’s try to look at what locks were acquired using SQL Profiler.

As you see session acquired intent (IS) locks but did not acquire shared (S) lock. There is (S) lock from the session 1 already which guarantees that row has not been modified by uncommitted transactions. No reasons to acquire another (S) lock because in read committed isolation level (S) locks on the rows have been released immediately.

But what will happen if we try to read the row in repeatable read mode again? In that isolation level session needs to keep the (S) lock until end of transaction and it’s a reason why it’s trying to acquire the lock.

As you can see – it’s being blocked because there is (X) lock in the queue. So in order to be granted lock needs to be compatible with the all locks on that resource – does not matter granted or now.

It’s worth to mention that the first scenario, when session 3 ran in read committed mode and did not acquire the lock on the resource, can be considered as internal optimization. In some cases SQL Server still acquires another shared (S) lock there even if there is another shared (S) lock held. In such case the query would be blocked similarly to repeatable read transaction isolation level.

Keeping all this in mind we can ask – why would we have deadlocks if locks are serialized? Would not the session with schema modification (sch-m) lock wait till other locks are released and proceed from there? Yes and no. Let’s put one other thing to the mix – lock partitioning. This is quite confusing term and does not have anything to do with table partitioning. In nutshells when server has 16 or more CPUs (think about dual quad-core with hyperthreading enabled) SQL Server starts to partition intent (IS, IX, IU) and schema stability (SCH-S) object locks on per-cpu basis. Those (intent and schema-stability) locks would be acquired and stored within 1 (or a few) lock partitions only. And when another session needs to acquire S, X, U or SCH-M locks on the object, those locks would have to be acquired across all lock partitions. As I mentioned, this behavior is on when system has 16 or more CPUs and cannot be
disabled. So let’s take a look at one of the scenarios:

  1. Session 1 executes the complex query that acquired (IS) and/or (SCH-S) lock on the lock partition = 5. (just an example)
  2. Session 2 tries to alter the table and starts to acquire (SCH-M) locks on each individual lock partition. It successfully acquired locks on partitions 1-4 and blocked on partition 5 by (IS) or (SCH-S) lock held by Session 1
  3. Session 1 tries to acquire another (IS) and/or (SCH-S) lock on lock partition = 2. It’s blocked by (SCH-M) lock held by Session 2.

Classic deadlock again.

Unfortunately there is very little we can do about it. Lock partitioning cannot be disabled with documented approaches. There is the undocumented trace flag T1229 that does the trick although using undocumented trace flags are dangerous and not recommended in production.

In case if you have dedicated data access tier (which is always the great idea) the good option would be serializing access to the table via application locks. Main idea of this method is that any code that accessing the object is acquiring application locks. DDL code acquires exclusive lock and DML code acquires shared lock. Locks would not be partitioned and as result requests would be serialized. Let’s take a look. First – let’s define the stored procedure that alter the metadata.

This is just an example. Obviously you need to think about timeouts, exception handling, retry logic and other things – but hopefully it would give you an idea.

Now let’s try to create the stored procedure that reads the data from the table. But first, we need to define what should we do in case if we are blocked and have timeouts. Of course, we can return the error code to the client but another method that can be useful in some cases is to return empty result set. In order to do that let’s create another (empty) table of the same structure with original one.

Now let’s define the stored procedure.

As you can see, if SP cannot obtain application lock within 3 seconds, it reads data from the empty table and basically returns empty result set to the client. Otherwise it would read the main table. Obviously the biggest “downside” of this method – it would not work well in the case if there is no dedicated data access tier in place and object accessed from the various places. Well, in such case old classic approach with try/catch and retry logic is your friend. And, of course, we can use SET DEADLOCK_PRIORITY to reduce the chance that the session with DDL statement would be chosen as the deadlock victim.

Source code is available for download 

Next: When Transaction Start

Table of content

Statement level recompilation with OPTION (RECOMPILE)

Today I’d like us to talk about statement level recompilation and how it could help us in a few particular scenarios. Such as parameter sniffing, search with optional parameters and filtered indexes. But first of all, let’s talk why do we need to worry about that at all.

Let’s look at extremely oversimplified picture. When we submit the query to SQL Server, one of the first things SQL Server is doing is compiling the query. Compilation itself is not cheap, especially in case of complex queries. As result, SQL Server tries to cache execution plan and reuse it when the same query runs again. The interesting thing happens in case if query has  parameters. SQL Server is looking at actual parameter values during compilation/recompilation stage and using them for cardinality estimations. This called parameter sniffing. As result, the plan (which will be cached) would be optimal for specific set of parameters provided during compilation stage. Next time, when SQL Server reuses the cached plan, there is the chance that plan would not be good for another set of values.

Let’s look at specific example. Let’s think about company that is doing some business internationally although most part of the customers are in USA. Let’s create the table and populate it with some data with ~99% of the customers in US. In addition to that let’s create an index on Country column. Images below are clickable.

Now let’s execute the query that selects customers for particular country. Most part of the client libraries would generate code like that.

When we run the first statement with @Country = ‘Canada’, SQL Server chooses to use non-clustered index on Country and perform key lookup. This makes sense – we have only ~1% of canadian customers – value is selective enough. But the problem is that plan has been cached. And when we run select for @Country = ‘USA’, it re-uses the same plan which is extremely inefficient. Don’t be confused by query cost – let’s look at Statistics IO:

Just to prove inefficiency – let’s run the query again but use constant instead of parameter. You can see that SQL Server chooses clustered index scan which introduces ~50 times less logical reads in compare with original non-clustered index seek.

The same thing happens if we use stored procedures.

Now we can see the opposite effect – when we run this SP with @Country = ‘USA’, it generates clustered index scan. And reuses the same plan for Canadian customers. And this is real problem if we have data that distributed unevenly. Something forces recompilation and if we are “lucky enough” to have first call with untypical set of parameters – the inefficient plan would be cached and used for all queries.

One of the ways to solve the problem is to use statement level recompile. If we add OPTION (RECOMPILE) to our code, SQL Server would not bother to cache the plan – it would recompile it every time optimizing it for the current set of parameters. Let’s take a look:

Obviously the downside of that particular method would be recompilation cost. Recompilation would happen every time the query executes.

Another area when statement level recompilation could be quite useful is the case when we need to select data based on optional (or dynamic) parameter set. I’ve already blogged about it a long time ago and today want to show you another way of doing that with statement level recompilation. But let’s first take a look at the problem:

As we can see, because plan is cached SQL Server is unable to generate the plan that would be valid regardless of parameter value. So clustered index scan is the option here. Now, if we add OPTION (RECOMPILE) to the statement, it would change the picture – as you see, it generates optimal plan in every case.

Is it better than the method with dynamic SQL I demonstrated in the old post – it depends. One of the downsides I have with recompilation is that plan would not be cached (which is, of course, expected) and would not be present in results of sys.dm_exec_query_stats DMV. I’m using this view all the time during performance tuning to find most expensive queries. Of course, I can catch those statements with other tools but that is less convinient for me.

Last example I’d like to show you related to filtered indexes. Let’s assume that our table hosts Customer data for both, Businesses and Consumers. We can have specific columns that belong only to specific category – granted this is questionable design but I saw it more than a few times. Systems like that could have different reports for different types of the customers and one of the ways to optimize those reports is to create covered filtered indexes based on customer type. Something like that:

But what if we want to have the shared code – assuming we would like to create SP that returns us customer by type and by name:

If we look at the filtered indexes we have – those are would be the perfect match for index seek. The problem is that SQL Server cannot generate and cache plan that relies on the filtered index. Choice of the index would depend on the @CustomerType parameter. And recompilation could help us here:

It’s questionable if OPTION (RECOMPILE) here is better than IF statement. But the point I want to make is if you’re using filtered indexes and filter value is provided as parameter, you should avoid plan caching if you want filtered indexes to be used.

Source code is available for download

Locking in Microsoft SQL Server (Part 15) – When transaction starts

What is the right answer to the question: “When does my explicit transaction start”? The answer looks obvious – when I run “begin tran” statement. Well, not exactly. Yes, BEGIN TRAN statement marks the logical point when you start the transaction in your code. But does it do anything? Let’s see.

First, let’s create the test database for simplicity sake. Next, let’s create the table in this database and after that look at transaction log. We will use non-documented but well known function fn_dblog. Again, for simplicity let’s not filter any content – just see entire log there sorted by LSN in descending order together with record count.

As we can see, we have 445 records there and the last one ends on e3:000b. Now let’s open another session and run BEGIN TRAN there. Even with PRINT statement.

Now let’s look at transaction log again. We can see – nothing changes – still 445 records with the last one ends on e3:000b.

Now let’s insert one row there.

And now we finally have new entries in transaction log.

25 more records. And the first one inserted after e3:000b is LOP_BEGIN_XACT which corresponds to begin transaction. Now let’s commit the transaction and look at Books OnlineAlthough BEGIN TRANSACTION starts a local transaction, it is not recorded in the transaction log until the application subsequently performs an action that must be recorded in the log, such as executing an INSERT, UPDATE, or DELETE statement. An application can perform actions such as acquiring locks to protect the transaction isolation level of SELECT statements, but nothing is recorded in the log until the application performs a modification action.

And that’s what we just saw. Well, that’s interesting (and documented) behavior but there is another very interesting aspect related with SNAPSHOT transactions. As we know, SNAPSHOT isolation level provides transaction level consistency. The key point – what is the starting point for consistency. Let’s take a look. First, let’s enable snapshot isolation level on database level.

Next, let’s create another table and insert 1 row there.

Now let’s start transaction and look at DMV that shows us active snapshot transactions

As we can see – nothing. Now let’s open another session and insert another row to this table

Now let’s come back to the first session where we have active transaction and select from that table.

As you see – we have 2 rows. And if we look at active snapshot transactions again, we would have our transaction listed.


Now, let’s come back to the second session and delete all data

In snapshot transaction we still see 2 rows in that table.

So that’s interesting – we have transactional level consistency in snapshot transaction but that consistency starts at the point when we access the data for the first time, not at the moment of BEGIN TRAN statement. We does not necessarily need to access the same table – just any data to have our LSN for snapshot transaction logged. Let’s see that. As we remember, TEST2 table is empty now. Again, commit transaction first.

Let’s run transaction again and read data from another table (TEST):

As we can see, transaction is enlisted already. Now let’s go to the second session and add row to TEST2 table.

And now come back to our original session and select from TEST2.

No rows. As it happened to be at the moment when we ran select against another table.

This is interesting behavior that has not been fully documented. If we need to have consistency started from the moment of BEGIN TRAN, we have to access the data immediately after we issued this statement. Keep in mind though that from that moment SQL Server keeps all old versions of the rows from all tables in the version store. So long-running snapshot transactions increase the load and size of TEMPDB.

Source code is available for download.

Next: Monitoring Blocked Processes with Event Notifications

Table of content:

Locking in Microsoft SQL Server (Part 13 – Schema locks)

We did not talk much about schema locks while back when we were discussing locking. Those locks are a little bit different than other locks – they are acquired on objects (and metadata) level and protecting the metadata. Think about it from the following prospective – you don’t want to have 2 sessions altering the table simultaneously. And you don’t want the table being dropped when you are selecting data from there.

There are 2 types of schema locks. SCH-S – schema stability locks. Those locks are kind of “shared” locks and acquired by DML statements and held for duration of the statement. They are compatible with each other and with other lock types (S, U, X, I*). The caveat is that those locks acquired regardless of transaction isolation level – so you’d have those locks even when your queries are running in read uncommitted or snapshot isolation levels. Let’s take a look.

First, let’s create the table and populate it with some data. You, perhaps, need to adjust number of rows in the table based on your hardware.

Second, let’s run long-running select in read uncommitted mode (I even use NOLOCK hint to make it a little bit more “visible”).

And while it’s running, let’s run the statement that shows currently acquired locks (click on the image to open it in the new window). That statement is basically using sys.dm_tran_locks DMV – I just added a few other things to make it easier to see/understand. You can get it in the script file (see the link below).

And here are the results:

As you can see, even if statement uses READ UNCOMMITTED transaction isolation level, we still have SCH-S acquired.

Second type of the lock is schema modification lock – SCH-M. This lock type is acquired by sessions that are altering the metadata and live for duration of transaction. This lock can be described as super-exclusive lock and it’s incompatible with any other lock types including intent locks. Let’s take a look. Let’s run the previous select again and in another session let’s try to alter the table – add new column.

Even if adding nullable column is pure metadata operation, as we can see below it’s blocked.

I’m pretty sure you saw that behavior when you tried to update metadata in production on the live system 🙂 And interesting thing, that while you have SCH-M lock waiting, you can easily get other SCH-S locks acquired on the same object without any problems. Let’s take a look – again, let’s run first select, next alter table statement, and finally run the third select (I renamed CTE to CTE2 here):

As you see, there are 2 granted SCH-S locks and one SCH-M lock request waiting for both sessions. This is a reason why your DDL statement can wait for quite a long time on the system under heavy load.

The biggest possible issue with schema modification lock is that it’s incompatible even with intent locks. Let’s take a look – same scenario as before, but let’s run INSERT statement in the third session.

And let’s check the locks.

Insert statement is trying to place intent lock (IX) on the table and it’s incompatible with SCH-M lock. So insert is blocked. Kind of interesting situation when select in read uncommitted isolation level blocks insert. And if we think about it, there are more operations counted as metadata modifications than we think. For example, index rebuild would do exactly the same thing. And of course partitioning related operation.

If we think about SCH-M locks, they are acquired on the object level – deadlock possibilities are endless. And unfortunately, that’s extremely annoying when you’re dealing with operations related to table partitioning. Well, there is no easy way to solve that problem – good error handling could help though. I will show some examples shortly.

Source code is available for download

Next: Deadlocks during DDL operations (alteration, partition switch, etc)

Table of content

Sunday T-SQL Tip: Aligned non-clustered indexes on partitioned table – again..

If you read my blog for some time, you could see that I’m a bit addicted to that subject. But I saw quite a few cases when people started to have performance issues and very suboptimal plans with non-clustered indexes after they partitioned the tables. I’ve already blogged about that problem in general here as well as demonstrated one of the workarounds for one particular case. Today I want to show how to make that workaround a little bit more clean and clear as well as generalize that approach a little bit.

Before we begin, please take a few minutes to read and refresh my previous post. It defines the problem and explains why we have the issue with the index. You also need the script to recreate the data.

First, I’d like to show another method how we can achieve the similar goal. Similarly to the previous example we would like to force index seek on each individual partition using $Partition function and merge the results. But instead of union data from multiple CTEs we can use sys.partition_range_values view. Main idea is to find how many partitions do we have there and then use cross apply with filter on $Partition function. Something like that (click on the image to open it in the different window):

As you can see, the plan looks quite different than before but at the end it achieves the same results with the method with multiple CTEs. And it has the benefit – technically with that approach you don’t need to be limited by number of partitions or, better say, large number of partitions don’t generate very large select statement. On the other hand, there is the potential problem with statistics

As you can see, SQL Server expected to execute index seek just 2 times but we have 6 partitions here. Well, even if sysobjvalues table has column level statistics on value column, it would not help us much here. We can, of course, try to help SQL by removing join with sys.partition_functions; but still.. It could be incorrect.

If we know that we have static and small number of partitions we can do something like that – just use numbers:

Of course, even if number of partitions is large, there is always dynamic SQL if you want to use it. You can construct numbers in the Boundaries CTE with that.

That $Partition function is the great way to force index seek within the each partitions. And I would like to show you another example that can be beneficial in the case when table is partitioned by “values”, not by “intervals”. E.g. when every partition boundary specifies actual and single value and the # of distinct partition values is the same with the boundaries. Sounds a bit confusing? Let’s see the example. Assuming you’re collecting the data by postal/zip codes (some transactions in the areas) and partition transaction table by postal code. Something like below:

As you can see, table has aligned non-clustered index on TranAmount. Now, let’s say we would like to find max transaction amount per postal code. On the physical level we would like to get Max TranAmount per partition. Below are 2 select statements – first is the classic approach. Second one is using $Partition function.

Now let’s look at the plans. First select just scans entire index – 2.2M rows needs to be processed

Second select introduces separate top 1 “scan” on each partition.

And you can compare Statistics IO here.

Again, $Partition function works great – just be very careful with that implementation. You need to make sure that SQL Server is unable to generate a good plan before you do any tricks. Otherwise you can end up in the worse situation than with the standard method. And of course, it’s nightmare to support. Or, perhaps, you want to call it job security 🙂

Source code is available for download

Store Custom Fields/Attributes in Microsoft SQL Server Database (Part 2 – Name/Value pairs)

Last time we discussed 2 design patterns that can be used when you store custom attributes in SQL Server database. Today I’d like to talk about another pattern known as Name/Value pairs and sometimes called as Entity-Attribute-Values.

This pattern is very old and well known. Something like that (click on the image to open it in the new window):

I’m pretty sure that 75% of developers tried to use it in one way or another. I’ve seen quite a few different implementations. I even saw the implementation where entire database consisted of just 2 tables: Objects, with 2 columns – ID and ObjectType, and Attributes – similarly to what we saw above with exception that value was the string (it was prior to sql_variant days). And system even worked – kind of, in development and QA. Funniest thing – that system had even been sold and first customer was a wholesales company that replaced their existing point-of-sale system. Poor customers and happy consultants who were hired to “solve” the problem.. 🙂

There is one killing factor though – you cannot store more than 8000 bytes in sql_variant. So (max) data types cannot be supported. If this is not the deal breaker – the design looks very flexible (and in fact it is). The general problem here is the attribute access cost. Classic approach produces 1 join per attribute. Something like that:

Could be inner join, could be outer – depends on the situation and design. But besides a lot of joins there is another problem. Developers are lazy. Everytime they need to write the statement like that, they would use cut and paste (see the nice red undeline above). And you can imagine amount of errors it could introduce.

Of course, when we talk about client application, we can select all attributes to the client as the rowset and pivot (remember this magic word) data there:

Unfortunately that would not solve the problem when we need to sort/page/filter by the attributes nor, more importantly, help us with reports. And customers demand reports.

I’m not going to analyze  that approach based on criteria we specified. I’ll show you the examples how attribute access cost in the implementation based on joins kills that. But there is another way. With SQL 2005 and above, you can use PIVOT which is part of T-SQL. So let’s take a look. First, let’s create Articles and ArticleAttributes table

As you probably noticed, I replaced Attribute Name with Index. This will add a little bit more complexity to the code but same time it saves us storage space. We are going to save a lot of records in that table. And that’s usual “It depends” question – is additional complexity worth that. There is also very good idea to keep some kind of “Metadata” table that stores some information about attributes and types. This is essential in case if you store attribute indexes but it also helps even in case if you store Attribute Names.

Now let’s populate it with the data

 

Let’s enable IO statistics and execution plan and see how it behaves when we need to access the data. First – classic approach with Joins:

 

 

 

 

 

 

 

 

As you can see this introduces the plan with a lot of joins and quite a lot of IO. Now let’s try to reshape the query to use PIVOT.

As you can see – that’s far far better. You can play with the shape of the query if you want to change execution plan – for example approach below gives you nested loop instead of merge join.

As you can see the difference in IO is dramatic.

Let’s play with a couple other scenarios. What if we want to search for specific value in one of the attributes? Well, we need to create the index in such case.

The biggest problem here is the size of the key. With index it cannot exceed 900 bytes. Value (which is SQL Variant) can go far above that. So we basically have 2 choices. First either do not include Value to the index (or have it as included column) or perhaps, use filtered index and  disable the (index) search for some fields. Even if first option does not look very promising, there is one thing to consider. Are there any other criteria for the search? If all your use cases include some additional columns in the query it could make sense to push those columns to Attributes table and make them part of the index. As the real-life example, assuming you’re collecting data and all your queries include time range. In such case you can push ATime column to Attributes table and make the index as (AttrIndex, ATime) include(Value). While it uses the range scan, it could be acceptable because of additional filter on ATime that limits number of records.

Another scenario – sorting/paging. Assuming you want to display 1 page of data (10 rows). You can do something like that:

Let’s go through the list of criteria for this approach:

  1. Multiple schemas support – Yes.
  2. Expandability. Yes.
  3. Online schema change. Yes. Although if customer needs to be able to change data type of the attribute some work is required.
  4. Storage cost. Medium to High depend on indexes and use cases
  5. Attribute access overhead. 1 join + PIVOT overhead
  6. Search-friendly. Yes with extra index. Possible issues with large values (900 bytes key size limitation)
  7. Sorting/Paging friendly. Same as above.
  8. Upfront knowledge about data schema. Required. Client needs to know about the schema in order to build PIVOT statement. On the server side dynamic SQL could be required

And the last one is the biggest limitation of the design. While it offers very good performance, you have to babysit the solution. You need to think about use cases to design queries and indexes. You need to maintain indexes – you’ll get excessive fragmentation there.

Next time will do some performance comparison of the various methods

Source code is available for download

P.S. I want to thank Vladimir Zatuliveter (zatuliveter _at_ gmail _dot_com) for his help with preparation of this post.

Store Custom Fields/Attributes in Microsoft SQL Server Database (Part 1)

Regardless how good and flexible the system is, it’s practically impossible to design it in the way that satisfies all customers. Don’t take me wrong – if you have internal development team that works on internal system, you could be fine. But as long as you start to sell the solution or, even better, design the hosting solution for the multiple customers – you are stuck. There is always some customization involved.

One of the very common examples of customization is custom attributes customer wants to store. For example, let’s think about shopping cart system and Article table there. If you put some time trying to define Article attributes you can end up with quite extensive set. Size, Weight, Dimension, Color.. So far so good. But one day sales department close the deal with auto part store and now you have to deal with cylinders, trim types, engine, battery amp-hours and other funny stuff. Next day the company closes the deal with grocery store and you have to deal with nutrition information.

Unfortunately there are no perfect ways to solve the problem. I’m going to show a few obvious and no-so-obvious design patterns that you can use and outline scenarios when those patterns are useful. There will be 3 posts on the suject:

  1. Today we will talk about storing attributes in the separate columns and about storing them in XML
  2. We will talk about Name/Value table  – there are 2 approaches – very very bad and very very interesting
  3. We will do some performance testing and storage analysis for those approaches.

But first, let’s define the set of criteria we are going to use evaluating the patterns:

  1. Multiple schema support. Can this solution be used in hosted environments where you have multiple different “custom fields” schema? For example, customer-specific attributes in the system that stores data from the multiple customers (remember auto parts shop and grocery store)
  2. Expandability. Does solution offer unlimited expandability in terms of numbers and types of the attributes?
  3. Online schema change. E.g. can schema be modified online with active users in the system?
  4. Storage cost. How much storage solution requires?
  5. Attribute access cost
  6. Search-friendliness. How easy is to search for specific value in the attributes scope
  7. Sorting/Paging friendliness. How easy is to sort by specific attribute and display specific page based on row number
  8. Upfront knowledge of the data schema. What client needs to know about attributes while selecting data.

And before we begin, as the disclaimer. I’m going to show a few patterns but by any means there are other solutions available. Every system is unique and you need to keep your own requirements in mind while choosing the design. Not the best pattern in general could be the best one for specific system and specific requirements.

Pattern 1: One column per attribute

This is probably one of the most common pattern you can find especially in the old systems. There are 3 most common ways how it’s get implemented. In first one, you predefine the set of the custom columns of different types up front and customer is limited by that predefined subset. Something like that (let’s call it 1.a):

Alternatively, there are the systems that dynamically alter the table when customer needs to create the new attribute. Something like that (let’s call it 1.b):

As you can see, I mentioned that rebuilding of clustered index is the good idea. It reduces the index fragmentation due increase of the row size. And if you drop the attribute, you’d need to reclaim the space.

Third variation (let’s call it 1.c) is very similar to 1.b with exception that it stores custom attributes in the separate table:

Let’s evaluate them:

  1.  Multiple schema support. It would work with the multiple schema if/when you have predefined set of the attributes (1.a) from above. Of course, in that case all customers will have the limitation on maximum number of attributes per type but it could be acceptable in some systems. In case if you dynamically alter the table only one schema could be supported. Of course, you can do some tricks with that – for example keep multiple tables (one per customer) or reuse the attribute columns created by other customers but either of those approaches would introduce a lot of complexity and management overhead. It’s simply not worth it.
  2. Expandability. 1.a obviously is not expandable. At least automatically. 1.b and 1.c offer practically unlimited expandability (subject of SQL Server limitations on max row size and max number of columns).
  3. Online schema change. 1.a does not require any physical schema changes. 1.b and 1.c require SCH-M lock acquired on the table during table alteration (which is basically exclusive table access) as well as user should have appropriate rights to execute the ALTER TABLE statement.
  4. Storage cost. 1.a – it increases the size of the row in Articles table by the size of all fixed-width data types used by attributes plus at least 2 bytes per variable width attribute regardless if attributes are used or not (see it in more details). This could be OK as long as the table is not transactional (does not store a lot of data) and we are not going crazy with total number of attributes we predefined, but still – it needs to be considered. Row size matters. 1.b and 1.c are much more efficient in that regard – attributes are created only when needed.
  5. Attribute access cost. 1.a and 1.b – no overhead at all. Attribute is in the regular column in the row. 1.c – there is the extra join between the tables
  6. Search-friendliness. Generally this would introduce the search clause like: where (CustomText1 = @P1) or  (CustomText2 = @P1) ..  Usually those patterns lead to clustered index scans unless there are predicates selective enough to utilize non clustered index. So this is more or less the question if system even need to allow search like that without any additional filters on other columns. One other thing to keep in mind – you need to be careful dealing with various data types and possible conversion errors.
  7. Sorting/Paging friendliness. That pattern is extremely friendly for sorting and paging as long as there are some primary filters that limit number of rows to sort/page. Otherwise attribute either needs to be indexed or scan would be involved.
  8. Upfront knowledge about the data schema generally is not required. While select * is not the best practice, it would work perfectly when you need to grab entire row with the attributes.

The biggest benefits of that design are simplicity and low access cost. I would consider it for the system that require single data schema (box product) or when multiple data schema would work with limited number of attributes (1.a). While it can cover a lot of systems, in general it’s not flexible enough. I would also be very careful with that pattern in case if we need to add attributes to transactional tables with millions or billions of rows. You don’t want to alter those tables on the fly nor have storage overhead introduced by predefined attributes.

One other possible option is to use SPARSE columns with 1.a and 1.b. SPARSE columns are ordinary columns that optimized for the storage of NULL values. That will technically allow you to predefine bigger set of the attributes than with regular columns without increasing the size of the row. Could be very useful in some cases. Same time you need to keep in mind that not null SPARSE column takes more space than regular column. Another important thing that tables with SPARSE columns cannot be compressed which is another very good way to save on the storage space.

And the last note about the indexing. If you need to support search and/or sort on every attribute you need to either limit the number of rows to process or index every (or most commonly used) attributes. While large number of indexes is not very good thing in general, in some system it’s perfectly OK (especially with filtered indexes that do not index NULL values), as long as you don’t have millions  of rows in the table nor very heavy update activity. Again, this is from “It depends” category.
Pattern 2. Attributes in XML

Well, that’s self-explanatory :). Something like that:

Let’s dive into the details.

  1. Multiple schema support. Not a problem at all. You can store whatever you want. As long as it’s the valid XML
  2. Expandability. The same. Not a problem at all. Just keep the valid XML and you’re golden
  3. Online schema change. Easy. There is no schema on the metadata level. Well, you can, of course, define XML Schema and it will help with performance but again, consider pros and cons of this step.
  4. Storage cost. And now we started to talk about negative aspects. It uses good amount of space. XML is basically LOB – SQL Server does not store it in plan text – there are some minor compression involved but still. It uses a lot of space. And if we need to index XML column, it would require even more space
  5. Attribute access cost. Heh, and this is another big one. That’s great that SQL Server has build-in XML support. But performance is far from ideal. We will do some performance testing in Part 3 of our discussion but trust me – shredding XML in SQL Server is slow and CPU intensive.
  6. Search-friendliness. Well, you need to shred it before the search – it’s very slow. XML Indexes would help but it’s still slower than regular columns and introduce huge storage overhead.
  7. Sorting/Paging friendly. Same as above. You need to shred data first.
  8. Upfront knowledge about data schema. All data stored in 1 column. So client does not need to jump through any hoops to access the data. But of course, it needs to know how to parse it.

Bottom line – storing custom attributes in XML is the perfect solution in terms of flexibility. Unfortunately it’s very storage-hungry and most importantly very slow in terms of performance. That solution is the best if the main goal is simple attribute storage and displaying/editing very small amount of rows on the client. For example Article detail page on the web site. Although if you need to shred, sort, filter the large number of rows – you’ll have performance issues even with XML Indexes.

Next time we will talk about Name/Value table that deserves the separate post.

Locking in Microsoft SQL Server (Part 12 – Lock Escalation)

I hope everyone had the great holiday season! 🙂

Today I’d like us to talk about Lock Escalation in Microsoft SQL Server. We will cover:

  1. What is Lock Escalation?
  2. How Lock Escalations affects the system
  3. How to detect and troubleshoot Lock Escalations
  4. How to disable Lock Escalation

What is Lock Escalation?
All of us know that SQL Server uses row level locking. Let’s think about scenario when system modifies the row. Let’s create the small table and insert 1 row there and next check the locks we have. As usual every image is clickable.

As you can see there are 4 locks in the picture. shared (S) lock on the database – e.g. indication that database is in use. Intent exclusive (IX) lock on the table (OBJECT) – e.g. indication that one of the child objects (row/key in our case) has the exclusive lock. Intent exclusive (IX) lock on the page – e.g. same indication about child object (row/key) exclusive lock. And finally exclusive (X) lock on the key (row) we just inserted.

Now let’s insert another row in the different session (let’s keep the original Session 1 transaction uncommitted).

When we check the locks we will see that there are 8 locks – 4 per session. Both sessions ran just fine and don’t block each other. Everything works smooth – that great for the concurrency. So far so good. The problem though is that every lock takes some memory space – 128 bytes on 64 bit OS and 64 bytes on 32 bit OS). And memory is not the free resource. Let’s take a look at another example. I’m creating the table and populating it with 100,000 rows. Next, I’m disabling the lock escalation on the table (ignore it for now) and clear all system cache (don’t do it in production). Now let’s run the transaction in repeatable read isolation level and initiate the table scan.

Transaction is not committed and as we remember, in repeatable read isolation level SQL Server holds the locks till end of transaction. And now let’s see how many locks we have and how much memory does it use.


As you can see, now we have 102,780 lock structures that takes more than 20MB of RAM. And what if we have a table with billions of rows? This is the case when SQL Server starts to use the process that called “Lock Escalation” – in nutshell, instead of keeping locks on every row SQL Server tries to escalate them to the higher (object) level. Let’s see how it works.

First we need to commit transaction and clear the cache. Next, let’s switch lock escalation for Data table to AUTO level (I’ll explain it in details later) and see what will happen if we re-run the previous example.

As you can see – just 2 locks and only 1Mb of RAM is used (Memory clerk reserves some space). Now let’s look what locks do we have:

As you can see there is the same (S) lock on the database and now we have the new (S) shared lock on the table. No locks on page/row levels are kept. Obviously concurrency is not as good as it used to be. Now, for example, other sessions would not be able to update the data on the table – (S) lock is incompatible with (IX) on the table level. And obviously, if we have lock escalation due data modifications, the table would hold (X) exclusive lock – so other sessions would not be able to read the data either.

The next question is when escalation happens. Based on the documentation, SQL Server tries to escalate locks after it acquires at least 5,000 locks on the object. If escalation failed, it tries again after at least 1,250 new locks. The locks count on index/object level. So if Table has 2 indexes – A and B you have 4,500 locks on the index A and 4,500 locks on the index B, the locks would not be escalated. In real life, your mileage may vary – see example below – 5,999 locks does not trigger the escalation but 6,999 does.

How it affects the system?

Let’s re-iterate our first small example on the bigger scope. Let’s run the first session that updates 1,000 rows and check what locks are held.

As you see, we have intent exclusive (IX) locks on the object (table) and pages as well as various (X) locks on the rows. If we run another session that updates completely different rows everything would be just fine. (IX) locks on table are compatible. (X) locks are not acquired on the same rows.


Now let’s trigger lock escalation updating 11,000 rows.

As you can see – now the table has exclusive lock. So if you run the session 2 query from above again, it would be blocked because (X) lock on the table held by session 1 is incompatible with (IX) lock from the session 2.

When it affects us? There are 2 very specific situations

  1. Batch inserts/updates/deletes. You’re trying to import thousands of the rows (even from the stage table). If your import session is lucky enough to escalate the lock, neither of other sessions would be able to access the table till transaction is committed.
  2. Reporting – if you’re using repeatable read or serializable isolation levels in order to have data consistent in reports, you can have (S) lock escalated to the table level and as result, writers will be blocked until the end of transaction.

And of course, any excessive locking in the system can trigger it too.

How to detect and troubleshoot Lock Escalations

First of all, even if you have the lock escalations it does not mean that it’s bad. After all, this is expected behavior of SQL Server. The problem with the lock escalations though is that usually customers are complaining that some queries are running slow. In that particular case waits due lock escalations from other processes could be the issue. If we look at the example above when session 2 is blocked, and run the script (as the session 3) that analyzes sys.dm_tran_locks DMV, we’d see that:

I’m very heavy on the wait statistics as the first troubleshooting tool (perhaps heavier than I need to be 🙂 ). One of the signs of the issues with lock escalations would be the high percent of intent lock waits (LCK_M_I*) together with relatively small percent of regular non-intent lock waits. See the example below:

In case if the system has high percent of both intent and regular lock waits, I’d focus on the regular locks first (mainly check if queries are optimized). There is the good chance that intent locks are not related with lock escalations.

In addition to DMVs (sys.dm_tran_locks, sys.dm_os_waiting_tasks, sys.dm_os_wait_stats, etc), there are Lock Escalation Profiler event and Lock Escalation extended event you can capture. You can also monitor performance counters related with locking and create the baseline (always the great idea)

Last but not least, look at the queries. As I mentioned before in most part of the cases excessive locking happen because of non-optimized queries. And that, of course, can also trigger the lock escalations.

How to disable Lock Escalation

Yes, you can disable Lock Escalations. But it should be the last resort. Before you implement that, please consider other approaches

  1. For data consistency for reporting (repeatable read/serializable isolation levels) – switch to optimistic (read committed snapshot, snapshot) isolation levels
  2. For batch operations consider to either change batch size to be below 5,000 rows threshold or, if it’s impossible, you can play with lock compatibility. For example have another session that aquires IS lock on the table while importing data. Or use partition switch from the staging table if it’s possible

In case if neither option works for you please test the system before you disable the lock escalations. So:

For both SQL Server 2005 and 2008 you can alter the behavior on the instance level with Trace Flags 1211 and 1224. Trace flag 1211 disables the lock escalation in every cases. In case, if there are no available memory for the locks, the error 1204 (Unable to allocate lock resource) would be generated. Trace flag 1224 would disable lock escalations in case if there is no memory pressure in the system. Although locks would be escalated in case of the memory pressure.

With SQL Server 2005 trace flags are the only options you have. With SQL Server 2008 you can also specify escalation rules on the table level with ALTER TABLE SET LOCK_ESCALATION statement. There are 3 available modes:

  1. DISABLE – lock escalation on specific table is disabled
  2. TABLE (default) – default behavior of lock escalation – locks are escalated to the table level.
  3. AUTO – if table is partitioned, locks would be escalated to partition level when table is partitioned or on table level if table is not partitioned

Source code is available for download

Part 13 – Schema locks

Table of content

Sunday T-SQL Tip: select top N using aligned non-clustered index on partitioned table

Almost one year ago I blogged about table partitioning in Microsoft SQL Server. I mentioned one specific case when table partitioning hurt the performance – case when you need to select top N rows using aligned non-clustered index. I said that there is no good workarounds for this particular case. Well, there is one. Kind of.

First, let’s take a look at the original problem. I adjusted the script I used an year ago a little bit. First, assuming we have non-partitioned table with clustered index on ID and non-clustered index on DateModified date. Let’s create that table an populate it with some data (if you click on the images below those would be opened in the new browser window).

Now let’s say we need to select top 100 rows based on DateModified column. This is quite typical scenario you’re using in production systems when you need to export and/or process the data.

As long as table is not partitioned, you can see that plan is very good. Basically SQL Server looks up the first row in the non-clustered index for specific DateModified value and do the ordered scan for the first 100 rows. Very efficient. Now, let’s partition the table based on DateCreated on quarterly basis.

And now – let’s run that statement again. As you can see, SQL Server started to use CI scan with SORT Top N. I explained why it happened in the previous post.

If we force SQL Server to use the index, the plan would be even worse in this particular case.

Although If you have the huge transactional table and # of rows with DateModified > ? is relatively small, the plan above could be more efficient than CI scan but SCAN/SORT TOP N would always be there.

Is there solution to this problem? Well, yes and no. I don’t know if there is generic solution that would work in all cases, although if you table has limited number of partitions and packet size is not huge there is one trick you can do.

Let’s take a look at the picture that shows how non-clustered index is aligned.

I just copied it from the old post, so dates are a little bit off. SQL Server cannot use the same efficient plan with non-partitioned/non-aligned index because data could reside on the different partitions. Although, we can still use ordered index scan within each partition. And next, if we select top N rows from each partition independently, union them all and next sort them all together and grab top N rows, we will have what we need. And we can do it using $Partition function. Let’s take a look:

Each PData CTE uses $Partition function that limits data search within the single partition so SQL Server can use ordered index scan there. In fact, it would be very similar to what we had when we did the select against non-partitioned table. Next, AllData CTE merges all results from PData CTEs and sort them based on DateModified and ID – returning top 100 rows. Last select joins the data from the main table with IDs returned from AllData CTE. One very important point I want to stress – as you can see, PData/AllData CTEs don’t select all columns from the table but only columns from the non clustered index. Data from the clustered index selected based on the join in the main select. This approach limits CTE operation to use index only and avoids unnecessary key lookups there.

If we look at result set, we can see that data is basically selected from partition 3 and 4.

And now let’s look at the execution plan.

As you can see, red rectangles represent PData CTEs. There is no key lookups until very last stage and those lookups are done only for 100 rows. One other thing worth to mention that SQL Server is smart enough to perform SORT as part of Concatenation operator and illuminate unnesesary rows there. As you can see, only 1 row is returned as part of PData5 – SQL Server does not bother to get other 99 rows.

This particular example has the data distributed very evenly (which usually happens with DateCreated/DateModified pattern). Generally speaking, cost of the operation will be proportional to the number of partitions multiplied by packet size. So if you have the table with a lot of partitions, that solution would not help much. On the other hand, there are usually some tricks you can use. Even in this particular case you don’t need to include PData6 to the select. This partition is empty. Also, you can put some logic in place – perhaps create another table and store most recent DateModified value per partition. In such case you can dynamically construct the select and exclude partitions where data has not been recently modified.

As the disclaimer, that solution is not the silver bullet especially if you have a lot of partitions and need to select large data packet. But in some cases it could help. And PLEASE TEST IT before you put it to production

Source code is available for download

UPDATE (2012-03-11): Look at that post to see how to implement that particular example in a different way