Category Archives: SQL Server 2008

Covering indexes

We already know the structure of the clustered and non-clustered indexes. We also know what is the main criteria for SQL Server to use non-clustered index.

Today I want to talk about covering indexes technique. Let’s start with the definition.

Covering index for the query is the index that has all required values in the index leaf rows and as result does not require key/bookmark lookup of the actual table data.

Let’s see the example. First of all, let’s create the table and populate it with some data

Now let’s try to run 2 selects – one that return all columns for the customer with id = 100 and second one that returns customerid and orderid only.

As you can see, the first select produces the plan with index seek and key lookup. The second one does not do the key lookup at all. Remember, non-clustered index has the values from the clustered index in the rows so OrderId is there. No needs for the key lookup.

2 Other things I would like to mention. First, you can see that cost of key lookup is 98% of the cost of the first select. Another thing that second select is about 50 times less expensive than the first one.

What should we do if our typical select requires to get a total amount of the orders placed by the customer grouped by date. Obviously we’re filtering (or grouping) by customerid and date in such case. We don’t really need to have amount as additional column in the index because we’re not using it in the predicated. Fortunately SQL Server has a way to add other columns and store it in the index leaf. You can do it with INCLUDE cause of the CREATE INDEX statement.

Below is the picture which show the structure of this index (on some of abstract tables). Click here to open it in the new window.

Let’s check how it works in the example with the orders table:

Plan is index seek, key lookup and next grouping. As you can see it produces 114 logical reads.

Now let’s create the index and run select one more time.

Simple index seek with 2 reads. Great improvement.

Last example I want to show you – let’s try to select max order amount.

As you can see in the first plan – it uses non-clustered index scan. SQL Server cannot use that index for seek operation – but it can scan the index. Because index leaf size is smaller than clustered index row size, it’s more efficient. Look at io statistics

This is one of the easiest optimization techniques you can use in the system – illuminating key lookup with the covering index.

One warning though – extra included columns in the index increase the leaf size, so index will require more pages and be less efficient. So don’t make it extremely wide.

You can download the code from here

Sunday T-SQL Tip: How to generate “Pseudo-identity” values

There are some cases when you don’t want to create physical identity column in the table and want to generate the similar value manually. There are a few methods how to do that. Let’s look at 2 of them.
The first method is the Counters table. Basically you’re creating the table like that:

When you need the new value, you simply get the current one and update it with the new one. It could be wrapped up to the stored procedure like that:

The beauty of this method is the locking – update statement places the exclusive lock on the row so until transaction is active, no other sessions would be able to update the same row. Be careful though with SNAPSHOT isolation level – it would produce an exception during simultaneous access rather than serialize the access.

Let’s test that:

Second method is using identity but from another dummy table:

Let’s see how we can get the single value:

And next – the group of values:

Obviously this method would not protect from the gaps in the values.

You can download the source code from here

When SQL Server uses non-clustered indexes

Last time we took a look at the structure of the indexes. Now let’s talk about index usage in general

Assume you run the query that requires the clustered index scan. Obviously it would require SQL Server to read every data page of the clustered index for the processing. Same time, number of physical reads would be much less than the logical reads. SQL Server uses technique called “Read ahead”. In the nutshells, SQL Server reads up to 64 data pages during the single read operation. So next time SQL Server needs to read the next data page, it most likely would be present in the buffer cache. It still introduces the logical read but obviously would not produce the physical read.

Now let’s think about the operation that scans non non-clustered index. As we know, SQL Server needs to traverse clustered index b-tree for every key/bookmark lookup operation. Even if read-ahead technique would work for the non-clustered index, actual table (clustered index) data could reside anywhere in the data files. So this is basically random IO operation.

Let’s see the example. First of all, let’s create the following table:

Let’s evenly distribute the data for the Name column with the combination of: AAA, AAB, … ZZZ adding 5 rows per each combination and create the non-clustered index on that field

Let’s see some statistics.

As you can see the table has 87880 rows. Clustered index leaf level has 4626 pages and non-clustered index leaf level has 296 pages. Let’s do the tests.

First test. Let’s select rows with Name = ‘ABC’

As you see it returns 5 rows only and leads to 18 logical reads. SQL Server would use non-clustered index for this case.

Second test. Let’s select rows with name starting with ‘AB’

As you see it returns 130 rows and produces 410 logical reads. SQL Server uses non-clustered index. Another thing worth to mention that 99% of the cost of the query is the key lookup.

Next test. Let’s select rows with name starting with ‘A’. Basically it returns 1/26th of the rows which is about 3.8%

As you can see, SQL Server decided to do the clustered index scan which leads to 4644 logical reads.

Let’s run the same select forcing SQL Server to use non-clustered index

As you can see, it produces 10375 logical reads – 2+ times more than clustered index scan. And without any help from read ahead.

So you see – threshold when non-clustered index scan becomes more expensive is really small. SQL Server will not use non-clustered index with key/bookmark lookup in the case if it expects iterator to return more than a few % of the total rows from the table.

Where does it leave us? Make the index highly selective.

Code can be downloaded from here

Sunday T-SQL Tip: Application Locks

There is the set of the tasks when you need to serialize access to some T-SQL code. Assume you have multiple instances of the data processing applications running simultaneously and does not want them to load the same data for the processing. Or you want to perform some bulk operation and want to stop the client from inserting the new data during this period.

Obviously one of the options is to use transactions in serializable isolation level. The problem with that – it could be more restrict than you want to. Supposedly you don’t want to block access to the data but rather introduce something similar to CriticalSection you have with the client development.

Fortunately SQL Server has the set of the stored procedures you can use for such purposes: sp_getapplock and sp_releaseapplock. sp_getapplock allows you to obtain shared or exclusive “user” lock on transaction or session context. If you run this SP in transaction scope, the lock would be released by the end of transaction. Otherwise when session ends.

Let’s see it in action. Let’s run the following statement from 2 different sessions

And here are the results from the both sessions:

As you can see it does the trick.

Update (2013-05-08): This post would provide more details about different methods of serialization available in SQL Server

Sunday T-SQL Tip: Context_Info

One of the interesting but not widely known features in SQL Server is Context_Info. This is basically connection/session specific 128 byte storage.

You can set it with: “Set Context_Info” set option. As for getting the value – there are a few methods available
1. Context_Info() function
2. context_info column in sys.dm_exec_requests and sys.sysprocesses views. Those value updates immediately within the batch
3. context_info column in sys.dm_exec_sessions view. This value is not updates until end of the patch

Let’s see that in action:


What can you do with that? Funny things, like recursion, for example:

As for the “more” practical example – we have the 3rd party system we have to integrate with our database and applications. The integration is done on triggers – when table is updated, trigger modifies 3rd party database and vise verse. In a few specific cases we should not propagate the changes – in such case context_info is ideal. We set this value before the update and check this in the trigger. If flag is set, update is not running.

Sunday T-SQL tip: Uniqueness of nullable field

Assuming you have the table with nullable field. What if you need to support uniqueness on not-null values only? For example, you can have the customers table with optional SSNs – not all customers provide it but if SSN is provided, you want to make sure it’s unique.

Regular index is not an option – SQL Server would not allow you to have more than 1 null value.

SQL Server 2008 provides great feature – filtered index – you will be able to create it on not null fields only.

Now when you insert duplicate value – it fails:

With SQL 2005 it’s not an option. Although you can use the indexed view:

Now when you insert the duplicate, it violates clustered index on the view and fails

What is the optimal size for variable width columns?

Yesterday we had the discussion during Tampa Bay SQL Server User Group meeting regarding the size of variable width columns. The question was if it matters how to define such columns – for example if column holds 100 characters should it be defined as varchar(100), varchar(1000) or varchar(max).

The answer is not so simple. From the storage prospective it does not really matter. Row stores actual data size + 2 bytes in offset array. The question though is how it affects performance.

When Query Optimizer generates the query plan, it needs to estimate how much memory needed for the query execution. It estimates the number of rows returned by each specific iterator (the valid statistics is the key) as well as the row size. And this is where it comes to play. SQL Server does not know if varchar(100) column holds in average 1 character or 100 characters. So it uses simple assumption – 50% of declared column size and 4000 bytes for (max) columns.

Correct estimate is especially critical for sort and hash operations – if operation cannot be done in memory, Sql Server flushes the data to tempdb and do the operations there. And this is the major performance hit. Let’s see this in action.

Let’s create the table with 2 varchar fields – 100 and 210 characters and populate this table with 1M rows. Let’s put 100 characters into both varchar fields.
create table dbo.Varchars
(
ID int not null,
Field1 varchar(100) not null,
Field2 varchar(210) not null,
primary key (ID)
)
go

declare
@I int = 0
begin tran
while @I < 1000000
begin
insert into dbo.Varchars(ID, Field1, Field2)
values(@I,REPLICATE(CHAR(65 + @I % 26),100),REPLICATE(CHAR(65 + @I % 26),100))
select @I += 1
end
commit

Now let’s run a few tests. Let’s run SQL Profiler and trace “Sort Warning” event. That event basically tells us when data is flushed to TempDb for the sorting purposes. Let’s run the simple query:
declare @A varchar(1000)
select @A = Field1
from dbo.Varchars
where ID < 42000
order by Field1

Let’s look at the Query Plan – as you see row size estimated incorrectly.

We also can see Sort Warning in SQL Profiler

Finally – we can see that query uses 6656Kb

Let’s run the second query:
declare @A varchar(1000)
select @A = Field2
from dbo.Varchars
where ID < 42000
order by Field2

Same query plan but different row size.

No Sort Warnings in SQL Profiler. As for the memory, this query uses 8088K.

Finally let’s run it 100 times in the loop and compare the execution statistics:

As you can see the first query is 25% slower than the second one. This is quite simple example – with complex queries the difference could be dramatic.

So on one hand it looks like that it’s good idea to always define variable width columns really wide. On other hand it introduces the different problem – you don’t want to overestimate row size. It will lead to the bigger memory consumption as well as waits for the large memory grants could be longer than to the small ones. I would suggest to define those columns in about 2 times bigger than actual average data size, so row size estimate would be as accurate as possible.

Sunday T-SQL tip: A few words about Table-Valued Parameters (TVP)

During my presentation on Sql Saturday #40 event I mentioned that one of the biggest improvements in SQL Server 2008 from development standpoint is table-valued parameters support (TVP). It’s not only simplifying development – think about passing temporary table variable as parameter from within t-sql code as well as from the client applications. It also gives great performance improvement especially if you need to save large data packet. Think about data processing tasks where client application loads large record set, processes it and saves it back in one transaction.

Let’s do some tests. Let’s create the table:

Table creationg statement

First, let’s use classical  approach – use SqlCommand and insert records in the loop

Classical approach

Now let’s use TVP. First of all we need to create a type:

TVP

Next step is creating stored procedure which we will use to insert new records

Insert procedure

Next, let’s modify the code to use TVP and stored procedure

TVP approach

It worth to mention that client code needs to create and populate DataTable object to assign it to the parameter. Obviously it requires some time, on other hand, this is client code – transaction has not been started yet.

Finally, let’s run a few tests:

For the package of 5000 records, avg execution time of the classical method  on my laptop is in around 2,600-2,700 milliseconds. TVP approach takes only about 180 milliseconds where 60 milliseconds is DataTable creation. So actual transaction time takes in around 130 milliseconds only – in about 20 times faster than classical method.

For the package of 50,000 records, avg execution times respectively are 20,500 ms and 1,300 ms with 300 ms as DataTable creation.

As you can see, it’s not only the speed, the transaction duration is about 20 times shorter. Huge difference.

Obviously, if you’re on SQL Server 2005, TVP is not an option. Although there are a couple methods which could help you to improve the classical method. I will show it to you in a few days.