Monthly Archives: October 2010

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 3 – Merge and insert of the master/detail data)

Last Sunday we saw what we can do with the MERGE statement and OUTPUT clause. Today I want to show you the practical angle of this feature.

Let’s assume you want to insert packet of master/detail rows in the situation when Master table has ID as identity field.


This is the trivial task if you have only one master row – you can accomplish it with insert statement, get the identity with @@identity or SCOPE_IDENTITY() and next perform insert of the detail rows. The situation is more complicated if you have multiple master-detail rows in the packets. In such case MERGE and OUTPUT would help. First of all, you need to have “internal” unique ID to link master and detail rows in the packets. Let’s see that:

Next, you need to insert the data to the master table with the MERGE operator and get both – internal ID and inserted (identity) ID from the packet. After that you can use simple insert statement.

Code can be downloaded from here

SQL Server Statistics or When SQL Server uses non-clustered indexes #2

In about a months ago we found that SQL Server does not use non-clustered non-covered indexes in the case, if it expects that it will need to perform the key lookup for more than a few percent of the total table rows. The cost of key lookup is really high, this is random IO operations so it’s cheaper to use the clustered index or table scan instead. The main question is how SQL Server estimates how many rows would be returned. Obviously if you have unique index and equal predicate, the answer is simple. But what about the other cases?

In order to be able to do the estimation, SQL Server maintains the statistics on the indexes and database columns. Good and up to date statistics allow Query Optimizer to assess the cost of the various execution plans and choose the least expensive one.

Let’s take a look at the statistics in more details. First, let’s create a table, populate it with some data and create the index on this table. As you can see the index has the values evenly distributed between 0 and 999.

Second, let’s refresh the statistics (we technically don’t need to do that) and call the dbcc_showstatistics stored procedure.

This procedure returns 3 result sets.

Let’s take a look at the first one:

This result set contains the main information. It includes:

  1. When statistics was updated
  2. How many rows are in the index
  3. How many rows have been sampled during statistics creation
  4. What is the average key length
  5. Is this the string index?
  6. Is this the filtered index (in SQL 2008 only).

The 2 most important columns are: what is the time of the last statistics update and how many columns were sampled. The first one shows how up-to-date is the statistics and the second one how full or accurate is the statistics. The last one is interesting. On one hand we don’t want to scan entire (large) table/index for the statistics creation. On other hand we would like to have an accurate statistics. So if data is distributed evenly, the partial scan would work just fine. In other cases it could introduce performance issues during uneven data distribution. One of the examples of such case is the hosting solution where one huge table stored the data for the multiple customers/accounts. Small account obviously has less data than large account. So plan that could be effective in one case could introduce the issues in another.

Let’s look at the second result set.

This result set shows the density (1.0 / # of unique values) and the length for the combination of the keys in the composite indexes. Don’t forget that technically every non-clustered index is the composite one – the leaf row (and non-leaf rows for non-unique indexes) includes clustered key values.

The last result set is the most interesting one. It calls the histogram and shows the actual distribution of the data. As the side note – only leftmost column values are included there.

So what does it mean?

  1. Range_Hi_Key – this is the upper-bound value for a step
  2. Range_Rows – Estimated number of rows in the step with Value < Range_Hi_Key. On other word – number of rows with value > Previus_Range_Hi_Key and < Range_High_Key
  3. EQ_Row – Estimated number of rows with value = Range_Hi_Key
  4. Distinct_Range_Rows – Estimated # of distinct rows
  5. Avg_Range_Rows – Estimated # of rows with duplicate values

Now let’s see what happens if we run the select:

Let’s see what SQL Server does in such case. The set of the values are within the step with Range_Hi_Key = 13.

There are 500 rows in the step, 5 distinct values and 100 rows per value. Assuming that data is distributed evenly, SQL Server estimates 400 rows to be returned. That matches the actual number of the rows.

So comparison between Estimated # of Rows and Actual # of rows is one of the first things need to be done during troubleshooting of inefficient plans. It’s completely OK if values are slightly off – for example estimated # = 400 and actual # = 1000, although if the difference is really big most likely there are some problems with the statistics.

The code can be downloaded from here

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 2 – Merge statement)

Last Sunday we saw how we can intercept new and old values from the row during insert/update/delete operations with OUTPUT clause. Even if it’s very powerful mechanism, there is one very important limitation. You can intercept only values from the row of the table that gets updated.

So, for example, if you have:

you can intercept the values from DestTable rows but you cannot get any values from SourceTable rows.

In SQL Server 2005 you’re stuck. Fortunately SQL Server 2008 has the solution for that problem – Merge operator.

So what is MERGE? In short – it’s insert/update/delete operator  – all-in-one. Assuming you have a source table and need to do that batch update based on another table. So you need to delete rows on the source table that do not exist in the second table, insert new rows from the second table and update modified rows. Let’s see that:

So Refresh process should delete row with ID = 1 from the SourceTable, insert row with ID = 7 and update rows with ID = 3 and 6.

Let’s see that in action. SQL 2005 approach is to use separate statements:

As you can see it works. But there are 3 statements. Longer locking, side effects based on the isolation levels, etc etc etc.

Now let’s do the same thing with MERGE operator:

As you can see – one operator and that’s all. Very useful and very very very convinient. But that’s not all – let’s see what can be done with the OUTPUT clause there:

It’s not only allowing you to capture old and new version of the row, it also allows to capture ACTION and the field from the Source (DataTable). The last one cannot be done with the regular insert/update/delete statements.

We will see what benefits will it give us next Sunday.

The code can be downloaded here

Composite Indexes

Last time we defined when single column index can be used for the index SEEKs. Now let’s talk about composite indexes.

Obviously, the first, and most important question – when composite index can be used for the SEEK operations. Let’s take a look at the example. Let’s create the table with 250,000 rows and create composite index on such table.

Now let’s run a few tests:

As you can see, SQL Server uses index seek each time when predicate on the Column1 (leftmost column) is SARG. The last select where we don’t have the predicate on this column produces the clustered index scan.

So it’s that simple – composite index can be used for the index SEEK as long as predicates on the index leftmost columns are SARG. Well, in real life it’s a little bit more complicated – you should not forget about really small threshold for non-clustered index usage. If SQL Server expects that iterator returns more than a few % of total table rows, index would not be used.

This leads us to quite interesting conclusion. If you have an index on (Column1, Column3), you don’t need to have the separate index on (Column1). Granted, seeks on the second single-column index would be faster because of the smaller row size, but performance benefits you gain in most part of the cases do not worth the price of the index maintenance.

Code can be downloaded from here

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 1 – insert/delete/update statements)

During the next a few Sundays I’m planning to talk about one of the under-appreciated constructs in T-SQL – OUTPUT clause. This clause can help you a lot when you need to develop set-based code or convert old cursor-based code into the set operations.

Output clause works together with inserted and deleted system tables. Every developer who wrote at least one DML Trigger familiar with those tables. Those are 2 tables that SQL Server populates and manages automatically. Inserted table contains the new version of the row values. Deleted tables contains the old version of the row values. As you understand inserted table has the data during insert and update operations. Deleted table has the data during update and delete operations.

Let’s see that in the action. First let’s create a table:

Now let’s assume we want to implement the audit trigger. Below is very simple and straightforward approach how we can do that:

Now let’s test the solution.

Below is the XMLs

Anyway, the good and very useful thing that inserted and deleted tables are available not only with the triggers but with insert/update/delete statements. Let’s take a look. First – insert:

As you can see – you have access to generated identity values. You can see that it  works the same way with updates and deletes.

One particularly interesting thing is what happen if statement is rolled back. Let’s populate data to the table again and create the trigger which rollbacks the transaction

Now let’s run the update statement:

As you can see – statement is rolled back – the data has not been changed in the original table but deleted/inserted table were created and populated. Output clause also worked. Such behavior opens the door to the interesting possibilities, for example audit on rollback.

You can download the script from here. Next Sunday we will talk about OUTPUT clause with MERGE statement. This combination is much more powerful than OUTPUT with insert/update/delete.

SQL Saturday 49

I’m going to speak on SQL Saturday #49 this Saturday (10/16/2010) in Orlando. 1 session this time – “Locking and Blocking for Developers”. This is revised version of the presentation I did in Miami – I will put more details to resource waits and bottleneck troubleshooting.

Please stop by if you’re planning to attend.

See you there!

Dmitri

Index Scan and Index Seek

Both operations utilize the indexes. But what is the difference?

Scan operator (on clustered or non-clustered indexes) scans entire index – all data pages and all rows and applies the predicate on every row.

Seek operator “process” only subset of the rows in the index (0..N) – only rows that qualifies.

Assume you have a table:

Now let’s run the select.

As you see, it shows clustered index scan. You can think about that in the following way:

Now let’s create the index on ADate column.

Seek operation does not require to process entire index (rows there are sorted already), so SQL Server finds the first row with ADate = ‘2010-10-01’ and processes the rows until ADate reaches ‘2010-10-02’ and stops.

Obviously index seek operation is much more efficient. But not all predicates are Seekable. What is seekable:
ADate = '2010-10-01'
ADate < '2010-10-02'
ADate between '2010-10-01' and '2010-10-05'
OrderId in (1,3,5)
VarcharField like 'A%'

Basically predicate is seekable if Query Optimizer can map it to the subset of the rows in the index.

Now let’s think about opposite example:

Abs(OrderId) = 1 – Non-seekable predicate. Fortunately you can convert it to: OrderId in (-1, 1)
OrderId + 1 = 10 – Can be converted to: OrderId = 9
DateAdd(day,7,OrderDate) > GetDate() – Can be converted to: OrderDate > DateAdd(day,-7.GetDate())
datepart(year,OrderDate) = 2010 – Can be converted to: OrderDate between '2010-01-01' and '2010-12-31'
VarcharField like '%A%' – here you’re out of luck.

So as you see, functions, calculations on the column makes predicate non-seekable. Avoid that if possible

You can download the code from here

Sunday T-SQL Tip: APPLY operator

One of the new features of SQL 2005 is APPLY operator. Based on books online APPLY is:

The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query. The table-valued function acts as the right input and the outer table expression acts as the left input. The right input is evaluated for each row from the left input and the rows produced are combined for the final output. The list of columns produced by the APPLY operator is the set of columns in the left input followed by the list of columns returned by the right input.

A bit confusing if you read it for a first time. Let’s try to clarify it. Think about APPLY the same way as about the JOIN. The difference is that JOIN tables are independent from each other but APPLY is dependent from the left source. There are 2 types of APPLY – CROSS APPLY (think about it as about it as about inner join) and OUTER apply (outer join)

Let’s see it in the example. This will use the orders table created earlier (code can be downloaded from here).

Let’s create the Customers table and populate it with the data

Assuming we want to return the result set that returns 2 most recent orders per customer. Let’s create inline table-valued functions which can do that.

Now let’s write select with cross apply – again think about it as about inner join that joins customer data (source) with 2 rows per customer produced per table-valued function.

Here it is. As you can see – it’s quite simple. What’s interesting about it – you don’t really need to use the function – you can simply put dependent select into the FROM cause. Look below:

You can download the code from here

Covering indexes

We already know the structure of the clustered and non-clustered indexes. We also know what is the main criteria for SQL Server to use non-clustered index.

Today I want to talk about covering indexes technique. Let’s start with the definition.

Covering index for the query is the index that has all required values in the index leaf rows and as result does not require key/bookmark lookup of the actual table data.

Let’s see the example. First of all, let’s create the table and populate it with some data

Now let’s try to run 2 selects – one that return all columns for the customer with id = 100 and second one that returns customerid and orderid only.

As you can see, the first select produces the plan with index seek and key lookup. The second one does not do the key lookup at all. Remember, non-clustered index has the values from the clustered index in the rows so OrderId is there. No needs for the key lookup.

2 Other things I would like to mention. First, you can see that cost of key lookup is 98% of the cost of the first select. Another thing that second select is about 50 times less expensive than the first one.

What should we do if our typical select requires to get a total amount of the orders placed by the customer grouped by date. Obviously we’re filtering (or grouping) by customerid and date in such case. We don’t really need to have amount as additional column in the index because we’re not using it in the predicated. Fortunately SQL Server has a way to add other columns and store it in the index leaf. You can do it with INCLUDE cause of the CREATE INDEX statement.

Below is the picture which show the structure of this index (on some of abstract tables). Click here to open it in the new window.

Let’s check how it works in the example with the orders table:

Plan is index seek, key lookup and next grouping. As you can see it produces 114 logical reads.

Now let’s create the index and run select one more time.

Simple index seek with 2 reads. Great improvement.

Last example I want to show you – let’s try to select max order amount.

As you can see in the first plan – it uses non-clustered index scan. SQL Server cannot use that index for seek operation – but it can scan the index. Because index leaf size is smaller than clustered index row size, it’s more efficient. Look at io statistics

This is one of the easiest optimization techniques you can use in the system – illuminating key lookup with the covering index.

One warning though – extra included columns in the index increase the leaf size, so index will require more pages and be less efficient. So don’t make it extremely wide.

You can download the code from here

Sunday T-SQL Tip: How to generate “Pseudo-identity” values

There are some cases when you don’t want to create physical identity column in the table and want to generate the similar value manually. There are a few methods how to do that. Let’s look at 2 of them.
The first method is the Counters table. Basically you’re creating the table like that:

When you need the new value, you simply get the current one and update it with the new one. It could be wrapped up to the stored procedure like that:

The beauty of this method is the locking – update statement places the exclusive lock on the row so until transaction is active, no other sessions would be able to update the same row. Be careful though with SNAPSHOT isolation level – it would produce an exception during simultaneous access rather than serialize the access.

Let’s test that:

Second method is using identity but from another dummy table:

Let’s see how we can get the single value:

And next – the group of values:

Obviously this method would not protect from the gaps in the values.

You can download the source code from here