Monthly Archives: November 2010

Sunday T-SQL Tip: Insert/Update pattern

Let’s think about the system that collects and processes the data. Quite often processing happens after data have been inserted into the system. Typically the rows have set of the columns inserted as NULL. Processing routine loads the rows, update those columns and save rows back.

Looks simple but this pattern introduces major page split/fragmentation issues. Think about that – SQL Server tried to fill pages up to 100%. So when you update the rows, pages don’t have enough free space to keep the data. Let’s look at the example:

First – let’s create the table. There are only 3 columns in this table – identity, char column as the placeholder for the columns populated during insert stage and post processing placeholder.

Second, let’s check the physical index statistics. Index is practically perfect.

Now let’s execute update and populate post processing placeholder column. Now check the statistics – terrible fragmentation and a lot of space wasted.

Obviously the workaround is simple – reserve the space for the post processing columns on the insert stage.

Source code is available here

Identity vs. GUID as the clustered index

Leave a reply

As we remember from the last week – there are 3 requirements for the clustered index. It needs to be unique, static and narrow. Let’s think about identity and GUID – what is more appropriate for that.

Uniqueness. Both – Identity and GUID are unique. Identity is unique in the scope of the table. GUID is unique in the scope of the database.
Static. Well, there is no particular reasons to modify either of those values. So they are static
Narrow. This is the key difference. Identity is typically int (4 bytes) or BigInt (8 bytes). GUID is 16 bytes. As we remember, it also means that non-clustered rows would be 8-12 bytes bigger with GUID. Identity is the clear winner in such case

Let’s add one other factor into consideration.

Identity is increasing monotonously. It means that SQL Server fills data pages one-by-one. That’s typically not a bad thing – it introduces high fillfactor and reduces the fragmentation unless you have the table with really high number of inserts. In such case the system will have contention in the extents allocation. The number varies based on the multiple factors – row size, hardware, etc. I would not worry about that unless I have at least a few hundred inserts per second. In such case I would like to have the clustered index that distributes the load across entire table. (And let’s be realistic – in such case I want to have clustered index that covers most critical queries).

GUID could behave similarly to Identity when it generated with NewSequentialId() or random when generated with NewId(). This is probably the most important difference besides the size. So technically, you can alter the behavior with one default constraint, although again, I’m not sure that GUID is a good choice for the transactional tables anyway.

So use GUID if:

You need to have database-wide uniqueness and don’t want to use any other workarounds
Table is small and row size does not really matter
Want to be able to alter value distribution and as result write pattern (again, this is questionable solution)

Otherwise use identity.

Sunday T-SQL Tip: Union vs. Union All

Leave a reply

What could be simpler than union? Probably “union all”.

“Union” removes all duplicates from the combined row set. Union all simply concatenate the row sets. Obviously, if original row sets are unique, results would be the same. But plans are not.

Take a look here. Let’s create 2 tables with some data.

Now let’s do the union and union all and see the plans.

Union introduces another expensive “Distinct Sort” operator. As result the execution plan with union is almost 2 times more expensive than with union all. Of course, your mileage may vary – in the case of the large and complex queries the higher cost of “Union” operator could be hidden but in any case, it’s more expensive. So don’t use it if rowsets don’t have duplicates.

Speaking of the duplicates – there is another thing you need to keep in mind. When union removes the duplicates from combined rowsets, it does not really matter where duplicates come from. So you don’t need to use distinct as part of the selects. And it affects the plan too. Look here:

Source code is available for download from here

Primary Key and Clustered Index

Leave a reply

Primary key looks similar to the clustered index but it’s not the same.

Primary key is the logical concept. Clustered index is the physical concept. By default SQL Server creates the clustered index on the table primary key. But this is default behavior only

How to choose correct clustered index? Obviously it would be beneficial for the system if clustered index covers most frequent/important queries. Unfortunately in real life it’s not always possible. Although there are 3 criteria you need to use. Those criteria are straightforward if you remember the structure of the indexes. I will post 2 old images here:

Clustered index:

Non-clustered index:

So what are the criteria?

Clustered index needs to be unique. If you don’t define the clustered index as unique, SQL Server adds 4 bytes hidden “uniquifier” to the row. This is the method that helps SQL Server to distinguish between the rows with the same clustered index values (think about non-clustered index rows for example). If you need to create clustered index on non-unique column(s), I would suggest to add another identity column as the last column in the index. Same storage size but it gives you better control. You will be able to update row via clustered index values for example.
Clustered index needs to be static. Remember, clustered index controls actual data sorting order. If you change it, SQL Server needs to move the row to the different place in the table. Besides that it needs to update all non-clustered index rows with the new clustered index values. Very expensive operations
Clustered index needs to be narrow. Again, it exists in the every non-clustered leaf (and for non-unique indexes non-leaf) rows. Large clustered index means large non-clustered index value -> less rows per page -> more data pages -> more IO operations. So less efficient non-clustered indexes. And more storage space, of course

Those 3 factors are important. You have to keep them in mind when you design the table.

Update (2011-11-24). I covered Unifuifiers in much greater depth in that post

SQL Server code-name “Denali” T-SQL enchancements

Leave a reply

No SQL Tips today. A couple notes from PASS Summit 2010 where Microsoft announced SQL Server code name “Denali”. There are a few interesting enhancements in T-SQL for that version.

1. Exception handling

SQL Server finally enhanced exception handling to make it more or less useful. No, there are still no such thing like “Finally” block but there is THROW operator that allows to throw and re-throw exceptions. It always comes with severity 16 and works from within code blocks – you can throw an error from the inner stored procedure and catch it in the outer stored procedure catch block. Good enhancement in compare with raiserror
2. Sequences

Nice to have feature even if it’s easy to implement similar mechanism right now with the separate table.

3. Contained databases

SQL Server “Denali” introduces “Partial” self-contained database mode. The long time roadmap is to be able to create the database which is self-contained and can be moved/deployed from one server to another and “just works”. It should include logins, linked servers, sql broker settings, etc. “Denali” technically support only “PARTIAL” mode that disconnects logins from the database (You will be able to create database user with password and SQL Server will be able to authenticate against this user). Obviously in this mode you will not be able to access any resources outside of the database. Another thing in this mode – objects in tempdb would use/be created in the same collation with the database.

In addition to that it introduces the DMVs show the breaches for the contained mode. It would include dynamic sql, references to other databases, linked servers, service broker queues, etc. Good feature, but not really useful for ISV who needs to support multiple SQL Server versions.
4. Better paging support

T-SQL will support selects like that:
<code>select ..
from Table
Order By Field
OFFSET 100 ROWS
FETCH NEXT 50 ROWS ONLY </code>

5. Ability to specify the shape of results from EXECUTE statement

You will be able to specify the format of the result sets that EXECUTE statement returns. Interestingly enough, Microsoft mentioned that you will be able to do it with sp_executesql but there is nothing about it in the documentation. Hope it would be supported especially because best practices suggest to use sp_executesql instead of EXEC

There are a few more features including UTF-16 support, different methods of metadata discovery, some enhancements in full-text search but nothing is really major

Will see and hope that new version of SQL Server would be really good.

Sql Server Statistics – update and management #2

Leave a reply

Last time we saw when SQL Server creates statistics and what database properties affects statistics update behavior. Today let’s talk what triggers statistics update and how can it be updated manually.

As we know, statistics is based on the table data. And obviously statistics is outdated when underlying data is changed. As you can guess at this point, SQL Server tracks the number of the modifications in the table and updates statistics when number of the modifications reach the threshold. Let’s see what is the threshold. Of course, auto update statistics needs to be enabled on the database level.

For the regular tables update is triggered when:

When row count changes from 0 to non-zero
When table has less than 500 rows and there were more than 500 updates on the leading column of the stats
When table has more than 500 rows and there were more than 500+20% of the number of rows changes on the leading column of the stat

For the temporary table there is one additional threshold at 6 rows. This behavior can be altered with OPTION (KEEP PLAN) query hint.

As the side note, temporary table variables do not keep statistics. That can produce quite inefficient plans in some cases. On the other hand update statistics on the temporary tables can introduce extensive number of recompilations. If this is the case, it could be fixed with OPTION (KEEPFIXED PLAN) query hint that prevents query recompilations based on statistics update.

Statistics also updates during index rebuild stage. It does not update during index reorganize operation through..

A few more methods:

UPDATE STATISTICS statement allows you to update either single statistics or all statistics on the table. You can control sampling method there
sp_updatestats update all statistics in the database
Statistics update could be done via maintenance plan

As I mentioned before, in most part of the cases default behavior is OK. Although it’s useful to understand how it works internally.

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 4 – insert of the master/detail data with SQL SERVER 2005)

Leave a reply

Last Sunday we saw how we can insert packet of the master/detail rows with the multiple master rows with MERGE statement. Obviously this approach would not work with SQL Server 2005. So what can be done there?

The problem that SQL Server does not guarantee the order in which rows are inserted with insert/select statement. It does not guarantee that with insert/select..order by either. Although based on the blog post from SQL Server Engine Team, insert/select..order by guarantees the order in what identity values are generated. It says: “INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted “.

This is something we can deal with. So let’s start with the same tables we had last Sunday:

The structure of the master and detail packages are slightly different:
1. Ids table does not have internalId column anymore
2. InternalID columns values in @MasterData and @DetailData should start with 1. This is important!

So we know the order in what identity values are generated. If our InternalIds in the tables are the same (started with 1), we can use it the approach below.

This should do the work. Code can be downloaded here

Sql Server Statistics – update and management #1

Leave a reply

As we already know, SQL Server relies on statistics on optimization stage. Obviously, up-to-date and accurate statistics is the key. Let’s see when and how SQL Server manages it.

First of all, there are 3 properties on the database level (Options tab)

“Auto Update Statistics” – controls if statistics is updating automatically. If this option is enabled, SQL Server detects if statistics is outdated and updates it during the query execution. In most part of the cases this is desired behavior. One of the possible examples when you can choose to disable that – if data in the table is fairly stable and updates are coming in the multiple batches nightly. In such case you can disable the auto update and update it manually after all batches are loaded.

“Auto Update Statistics Asynchronously” – By default, statistics is updating synchronously. It means when system executes the query and detects that statistics is outdated, it updates the statistics and recompiles the query. This could introduce additional delay during statistics update phase and queries with low timeout settings could timeout. If you decide to use asynchronous statistics update, sql server generates the plan based on the outdated statistics and updates statistics in another thread.

“Auto Create Statistics” – By default SQL Server creates statistics automatically. You should leave it as is in most part of the cases.

By default, when you create an index, SQL Server automatically creates the statistics with the same name with index.

Sometimes SQL Server can create statistics on column level especially if there is the join operation on non-indexed column

Based on the name convention you can see that this is automatically created statistics on the column 03 on the object with ID 0425A276 (hex). Let’s see that: