Category Archives: General

SQL Server code-name “Denali” T-SQL enchancements

No SQL Tips today. A couple notes from PASS Summit 2010 where Microsoft announced SQL Server code name “Denali”. There are a few interesting enhancements in T-SQL for that version.

1. Exception handling

SQL Server finally enhanced exception handling to make it more or less useful. No, there are still no such thing like “Finally” block but there is THROW operator that allows to throw and re-throw exceptions. It always comes with severity 16 and works from within code blocks – you can throw an error from the inner stored procedure and catch it in the outer stored procedure catch block. Good enhancement in compare with raiserror
2. Sequences

Nice to have feature even if it’s easy to implement similar mechanism right now with the separate table.


3. Contained databases

SQL Server “Denali” introduces “Partial” self-contained database mode. The long time roadmap is to be able to create the database which is self-contained and can be moved/deployed from one server to another and “just works”. It should include logins, linked servers, sql broker settings, etc. “Denali” technically support only “PARTIAL” mode that disconnects logins from the database (You will be able to create database user with password and SQL Server will be able to authenticate against this user). Obviously in this mode you will not be able to access any resources outside of the database. Another thing in this mode – objects in tempdb would use/be created in the same collation with the database.

In addition to that it introduces the DMVs show the breaches for the contained mode. It would include dynamic sql, references to other databases, linked servers, service broker queues, etc. Good feature, but not really useful for ISV who needs to support multiple SQL Server versions.
4. Better paging support

T-SQL will support selects like that:
<code>select ..
from Table
Order By Field
OFFSET 100 ROWS
FETCH NEXT 50 ROWS ONLY  </code>

5. Ability to specify the shape of results from EXECUTE statement

You will be able to specify the format of the result sets that EXECUTE statement returns. Interestingly enough, Microsoft mentioned that you will be able to do it with sp_executesql but there is nothing about it in the documentation. Hope it would be supported especially because best practices suggest to use sp_executesql instead of EXEC

There are a few more features including UTF-16 support, different methods of metadata discovery, some enhancements in full-text search but nothing is really major

Will see and hope that new version of SQL Server would be really good.

Sql Server Statistics – update and management #2

Last time we saw when SQL Server creates statistics and what database properties affects statistics update behavior. Today let’s talk what triggers statistics update and how can it be updated manually.

As we know, statistics is based on the table data. And obviously statistics is outdated when underlying data is changed. As you can guess at this point, SQL Server tracks the number of the modifications in the table and updates statistics when number of the modifications reach the threshold. Let’s see what is the threshold. Of course, auto update statistics needs to be enabled on the database level.

For the regular tables update is triggered when:

  1. When row count changes from 0 to non-zero
  2. When table has less than 500 rows and there were more than 500 updates on the leading column of the stats
  3. When table has more than 500 rows and there were more than 500+20% of the number of rows changes on the leading column of the stat

For the temporary table there is one additional threshold at 6 rows. This behavior can be altered with OPTION (KEEP PLAN) query hint.

As the side note, temporary table variables do not keep statistics. That can produce quite inefficient plans in some cases. On the other hand update statistics on the temporary tables can introduce extensive number of recompilations. If this is the case, it could be fixed with OPTION (KEEPFIXED PLAN) query hint that prevents query recompilations based on statistics update.

Statistics also updates during index rebuild stage. It does not update during index reorganize operation through..

A few more methods:

  1. UPDATE STATISTICS statement allows you to update either single statistics or all statistics on the table. You can control sampling method there
  2. sp_updatestats update all statistics in the database
  3. Statistics update could be done via maintenance plan

As I mentioned before, in most part of the cases default behavior is OK. Although it’s useful to understand how it works internally.

Sql Server Statistics – update and management #1

As we already know, SQL Server relies on statistics on optimization stage. Obviously, up-to-date and accurate statistics is the key. Let’s see when and how SQL Server manages it.

First of all, there are 3 properties on the database level (Options tab)

“Auto Update Statistics” – controls if statistics is updating automatically. If this option is enabled, SQL Server detects if statistics is outdated and updates it during the query execution. In most part of the cases this is desired behavior. One of the possible examples when you can choose to disable that – if data in the table is fairly stable and updates are coming in the multiple batches nightly. In such case you can disable the auto update and update it manually after all batches are loaded.

“Auto Update Statistics Asynchronously” – By default, statistics is updating synchronously. It means when system executes the query and detects that statistics is outdated, it updates the statistics and recompiles the query. This could introduce additional delay during statistics update phase and queries with low timeout settings could timeout. If you decide to use asynchronous statistics update, sql server generates the plan based on the outdated statistics and updates statistics in another thread.

“Auto Create Statistics” – By default SQL Server creates statistics automatically. You should leave it as is in most part of the cases.

By default, when you create an index, SQL Server automatically creates the statistics with the same name with index.

Sometimes SQL Server can create statistics on column level especially if there is the join operation on non-indexed column

Based on the name convention you can see that this is automatically created statistics on the column 03 on the object with ID 0425A276 (hex). Let’s see that:

Next time we will talk when statistics is updating automatically and how to update it manually.

SQL Server Statistics or When SQL Server uses non-clustered indexes #2

In about a months ago we found that SQL Server does not use non-clustered non-covered indexes in the case, if it expects that it will need to perform the key lookup for more than a few percent of the total table rows. The cost of key lookup is really high, this is random IO operations so it’s cheaper to use the clustered index or table scan instead. The main question is how SQL Server estimates how many rows would be returned. Obviously if you have unique index and equal predicate, the answer is simple. But what about the other cases?

In order to be able to do the estimation, SQL Server maintains the statistics on the indexes and database columns. Good and up to date statistics allow Query Optimizer to assess the cost of the various execution plans and choose the least expensive one.

Let’s take a look at the statistics in more details. First, let’s create a table, populate it with some data and create the index on this table. As you can see the index has the values evenly distributed between 0 and 999.

Second, let’s refresh the statistics (we technically don’t need to do that) and call the dbcc_showstatistics stored procedure.

This procedure returns 3 result sets.

Let’s take a look at the first one:

This result set contains the main information. It includes:

  1. When statistics was updated
  2. How many rows are in the index
  3. How many rows have been sampled during statistics creation
  4. What is the average key length
  5. Is this the string index?
  6. Is this the filtered index (in SQL 2008 only).

The 2 most important columns are: what is the time of the last statistics update and how many columns were sampled. The first one shows how up-to-date is the statistics and the second one how full or accurate is the statistics. The last one is interesting. On one hand we don’t want to scan entire (large) table/index for the statistics creation. On other hand we would like to have an accurate statistics. So if data is distributed evenly, the partial scan would work just fine. In other cases it could introduce performance issues during uneven data distribution. One of the examples of such case is the hosting solution where one huge table stored the data for the multiple customers/accounts. Small account obviously has less data than large account. So plan that could be effective in one case could introduce the issues in another.

Let’s look at the second result set.

This result set shows the density (1.0 / # of unique values) and the length for the combination of the keys in the composite indexes. Don’t forget that technically every non-clustered index is the composite one – the leaf row (and non-leaf rows for non-unique indexes) includes clustered key values.

The last result set is the most interesting one. It calls the histogram and shows the actual distribution of the data. As the side note – only leftmost column values are included there.

So what does it mean?

  1. Range_Hi_Key – this is the upper-bound value for a step
  2. Range_Rows – Estimated number of rows in the step with Value < Range_Hi_Key. On other word – number of rows with value > Previus_Range_Hi_Key and < Range_High_Key
  3. EQ_Row – Estimated number of rows with value = Range_Hi_Key
  4. Distinct_Range_Rows – Estimated # of distinct rows
  5. Avg_Range_Rows – Estimated # of rows with duplicate values

Now let’s see what happens if we run the select:

Let’s see what SQL Server does in such case. The set of the values are within the step with Range_Hi_Key = 13.

There are 500 rows in the step, 5 distinct values and 100 rows per value. Assuming that data is distributed evenly, SQL Server estimates 400 rows to be returned. That matches the actual number of the rows.

So comparison between Estimated # of Rows and Actual # of rows is one of the first things need to be done during troubleshooting of inefficient plans. It’s completely OK if values are slightly off – for example estimated # = 400 and actual # = 1000, although if the difference is really big most likely there are some problems with the statistics.

The code can be downloaded from here

Composite Indexes

Last time we defined when single column index can be used for the index SEEKs. Now let’s talk about composite indexes.

Obviously, the first, and most important question – when composite index can be used for the SEEK operations. Let’s take a look at the example. Let’s create the table with 250,000 rows and create composite index on such table.

Now let’s run a few tests:

As you can see, SQL Server uses index seek each time when predicate on the Column1 (leftmost column) is SARG. The last select where we don’t have the predicate on this column produces the clustered index scan.

So it’s that simple – composite index can be used for the index SEEK as long as predicates on the index leftmost columns are SARG. Well, in real life it’s a little bit more complicated – you should not forget about really small threshold for non-clustered index usage. If SQL Server expects that iterator returns more than a few % of total table rows, index would not be used.

This leads us to quite interesting conclusion. If you have an index on (Column1, Column3), you don’t need to have the separate index on (Column1). Granted, seeks on the second single-column index would be faster because of the smaller row size, but performance benefits you gain in most part of the cases do not worth the price of the index maintenance.

Code can be downloaded from here

Index Scan and Index Seek

Both operations utilize the indexes. But what is the difference?

Scan operator (on clustered or non-clustered indexes) scans entire index – all data pages and all rows and applies the predicate on every row.

Seek operator “process” only subset of the rows in the index (0..N) – only rows that qualifies.

Assume you have a table:

Now let’s run the select.

As you see, it shows clustered index scan. You can think about that in the following way:

Now let’s create the index on ADate column.

Seek operation does not require to process entire index (rows there are sorted already), so SQL Server finds the first row with ADate = ‘2010-10-01’ and processes the rows until ADate reaches ‘2010-10-02’ and stops.

Obviously index seek operation is much more efficient. But not all predicates are Seekable. What is seekable:
ADate = '2010-10-01'
ADate < '2010-10-02'
ADate between '2010-10-01' and '2010-10-05'
OrderId in (1,3,5)
VarcharField like 'A%'

Basically predicate is seekable if Query Optimizer can map it to the subset of the rows in the index.

Now let’s think about opposite example:

Abs(OrderId) = 1 – Non-seekable predicate. Fortunately you can convert it to: OrderId in (-1, 1)
OrderId + 1 = 10 – Can be converted to: OrderId = 9
DateAdd(day,7,OrderDate) > GetDate() – Can be converted to: OrderDate > DateAdd(day,-7.GetDate())
datepart(year,OrderDate) = 2010 – Can be converted to: OrderDate between '2010-01-01' and '2010-12-31'
VarcharField like '%A%' – here you’re out of luck.

So as you see, functions, calculations on the column makes predicate non-seekable. Avoid that if possible

You can download the code from here

Covering indexes

We already know the structure of the clustered and non-clustered indexes. We also know what is the main criteria for SQL Server to use non-clustered index.

Today I want to talk about covering indexes technique. Let’s start with the definition.

Covering index for the query is the index that has all required values in the index leaf rows and as result does not require key/bookmark lookup of the actual table data.

Let’s see the example. First of all, let’s create the table and populate it with some data

Now let’s try to run 2 selects – one that return all columns for the customer with id = 100 and second one that returns customerid and orderid only.

As you can see, the first select produces the plan with index seek and key lookup. The second one does not do the key lookup at all. Remember, non-clustered index has the values from the clustered index in the rows so OrderId is there. No needs for the key lookup.

2 Other things I would like to mention. First, you can see that cost of key lookup is 98% of the cost of the first select. Another thing that second select is about 50 times less expensive than the first one.

What should we do if our typical select requires to get a total amount of the orders placed by the customer grouped by date. Obviously we’re filtering (or grouping) by customerid and date in such case. We don’t really need to have amount as additional column in the index because we’re not using it in the predicated. Fortunately SQL Server has a way to add other columns and store it in the index leaf. You can do it with INCLUDE cause of the CREATE INDEX statement.

Below is the picture which show the structure of this index (on some of abstract tables). Click here to open it in the new window.

Let’s check how it works in the example with the orders table:

Plan is index seek, key lookup and next grouping. As you can see it produces 114 logical reads.

Now let’s create the index and run select one more time.

Simple index seek with 2 reads. Great improvement.

Last example I want to show you – let’s try to select max order amount.

As you can see in the first plan – it uses non-clustered index scan. SQL Server cannot use that index for seek operation – but it can scan the index. Because index leaf size is smaller than clustered index row size, it’s more efficient. Look at io statistics

This is one of the easiest optimization techniques you can use in the system – illuminating key lookup with the covering index.

One warning though – extra included columns in the index increase the leaf size, so index will require more pages and be less efficient. So don’t make it extremely wide.

You can download the code from here

When SQL Server uses non-clustered indexes

Last time we took a look at the structure of the indexes. Now let’s talk about index usage in general

Assume you run the query that requires the clustered index scan. Obviously it would require SQL Server to read every data page of the clustered index for the processing. Same time, number of physical reads would be much less than the logical reads. SQL Server uses technique called “Read ahead”. In the nutshells, SQL Server reads up to 64 data pages during the single read operation. So next time SQL Server needs to read the next data page, it most likely would be present in the buffer cache. It still introduces the logical read but obviously would not produce the physical read.

Now let’s think about the operation that scans non non-clustered index. As we know, SQL Server needs to traverse clustered index b-tree for every key/bookmark lookup operation. Even if read-ahead technique would work for the non-clustered index, actual table (clustered index) data could reside anywhere in the data files. So this is basically random IO operation.

Let’s see the example. First of all, let’s create the following table:

Let’s evenly distribute the data for the Name column with the combination of: AAA, AAB, … ZZZ adding 5 rows per each combination and create the non-clustered index on that field

Let’s see some statistics.

As you can see the table has 87880 rows. Clustered index leaf level has 4626 pages and non-clustered index leaf level has 296 pages. Let’s do the tests.

First test. Let’s select rows with Name = ‘ABC’

As you see it returns 5 rows only and leads to 18 logical reads. SQL Server would use non-clustered index for this case.

Second test. Let’s select rows with name starting with ‘AB’

As you see it returns 130 rows and produces 410 logical reads. SQL Server uses non-clustered index. Another thing worth to mention that 99% of the cost of the query is the key lookup.

Next test. Let’s select rows with name starting with ‘A’. Basically it returns 1/26th of the rows which is about 3.8%

As you can see, SQL Server decided to do the clustered index scan which leads to 4644 logical reads.

Let’s run the same select forcing SQL Server to use non-clustered index

As you can see, it produces 10375 logical reads – 2+ times more than clustered index scan. And without any help from read ahead.

So you see – threshold when non-clustered index scan becomes more expensive is really small. SQL Server will not use non-clustered index with key/bookmark lookup in the case if it expects iterator to return more than a few % of the total rows from the table.

Where does it leave us? Make the index highly selective.

Code can be downloaded from here

Indexes structure

Ok, enough about tables for now. Let’s talk about indexes. As all of us know, there are 2 types of the indexes in SQL Server:

1st – Clustered indexes. This is one index per table and basically specifies the order how data is stored in the table. For example, if the table has the clustered key index on the integer field, it means data will be actually sorted by that integer field. Please don’t be confused – there is still no such thing like default sorting order for the queries – the order of the rows SQL Server returns would depend on the execution plan which could be different than clustered index scan.

SQL Server does not require you to create the clustered indexes – tables without such indexes called heap tables. We will talk about such tables later.

2nd type of the indexes are non-clustered indexes. SQL Server allows to have up to 249 non-clustered indexes per table in SQL 2005 and 999 non-clustered indexes in SQL 2008 (thanks to Chirag for pointing this out). This is “just an index”.

If you think about the book, page number is the clustered index. Alphabetical annotation at the end of the book – is the non-clustered index.

So what exactly is the index? This is B-tree. Let’s see what is that.

The image above shows a small table with ID as the primary key (and clustered index). As the side note, SQL Server creates clustered index on the primary key field by default.

Leaf level (the bottom one) contains actual table data sorted by ID. As you can see, the data pages are linked into the double-linked list so SQL Server can scan the index in both directions.

Levels above the leaf level called “intermediate levels”. Every index row on those levels points to the separate data pages in the level below. At the top level (root level) there is only one page. There could be several intermedite levels based on the table size. But the top root level of the index always has 1 data page.

So let’s see how it actually works: Assuming you want to select the record with ID = 50. SQL Server start from the root level and find that first row contains ID=1 and the second row contains ID=57. It means that the row with ID=50 would be located on the data page started with ID=1 on the next level of the index. So the next step is analyzing the first data page on the intermediate level which contains IDs from 1 to 50. So SQL Server finds the row with ID=50 and jump on the leaf level page with the actual data.

Now let’s look at the non-clustered index. Assuming we have the index by Name field.

The structure of the index is exactly the same with the exception that leaf level does not contain table data but values for the clustered index. It does not really matter if you specify ID in the index definition, it would be there. For the heap tables, leaf level contains actual RID – Row id which consists of FileId:PageNumber:RowNumber. Annotation at the end of the book is a good example. It does not include the actual paragraph from the book but the page # (clustered index)

Let’s see how SQL Server works when it uses non-clustered index for the lookups on the tables with clustered index. As you can see, first it needs to find ID of the row(s) and next perform clustered index lookup in order to obtain the actual table data. This operation called “Key lookup” or “Bookmark lookup” on the previous editions of SQL Server.

2 things I would like us to remember:
1. Non clustered index has the value of the clustered index on the leaf level
2. As result when SQL Server use non-clustered index for lookup, it needs to traverse clustered index to get the value for the actual data row.

This introduce interesting performance implications we will talk about next time.

Vertical Partitioning as the way to reduce IO

Let’s think about the following scenario. Assuming we have a table with a lot of data. Some of the large fields in this table are rarely accessed. Same time there are constant scans on the subset of the fields.

The good example is large Articles catalog in the system where customer can search by various different columns/attributes.  Let’s assume Article has the description field which holds 1000-2000 bytes in the average. Assuming customers are searching on the part of the article name, item number, price, etc etc etc and 98% of the customer activity are searching and browsing the results. Let’s say that only 2% of requests are opening the article properties and looking to the description.

Assuming all those searches will lead to the clustered index scans. As we already know, we want to minimize the number of  IO operations so we want to make row as small as possible. Let’s see the example:

Now let’s run the select based on the part of the item number:

Obviously one of the ways how to improve the situation is create non-clustered index on all fields which could be in the predicate list. Problem with this approach that in most part of the cases, SQL would not be able to do the index seek. It can still do the index scan but price of this option would really depend on the number of the records in the result set – key/bookmark lookups can make it really expensive. Obviously you create covered index and include every field you need for the browse – in our case there are Name, Item Number, Price and  OtherFieldsShownInBrowse, but it would make index leaf very big. In some cases though, it still acceptable and better than the solution I will show below.

Let’s change the schema and split Articles table vertically:

To simplify the situation – let’s create the view and join both tables.

Now, let’s run the previous select against view:

As we see that situation is even worse – even if we don’t need Description attribute, SQL Server joins the both tables. It does it because of the inner join. SQL Server needs to filter out all records from ArticleMain where it does not have corresponding records in ArticleDescriptions.

Let’s change it to the outer join:

Now let’s run select again:

Now when we run the select, we see, that it scans ArticleMain table only. Results are much better. If we need to select fields from the both tables (for example when user want to see the properties, it would do the join.

We can even create INSTEAD OF trigger on the view in order to cover insert and update operations and simplify the client development if needed.

Again, I don’t want to say that this method is right in every case. As bare minimum you should have:
1. A lot of data in the table
2. Scans with the predicates which do not support INDEX SEEK operations and return a lot of data
3. Research if regular (covered) index would be more beneficial than this solution.

But in some cases vertical partitioning would help.