Author Archives: Dmitri Korotkevitch

Sunday T-SQL Tip: Context_Info

One of the interesting but not widely known features in SQL Server is Context_Info. This is basically connection/session specific 128 byte storage.

You can set it with: “Set Context_Info” set option. As for getting the value – there are a few methods available
1. Context_Info() function
2. context_info column in sys.dm_exec_requests and sys.sysprocesses views. Those value updates immediately within the batch
3. context_info column in sys.dm_exec_sessions view. This value is not updates until end of the patch

Let’s see that in action:


What can you do with that? Funny things, like recursion, for example:

As for the “more” practical example – we have the 3rd party system we have to integrate with our database and applications. The integration is done on triggers – when table is updated, trigger modifies 3rd party database and vise verse. In a few specific cases we should not propagate the changes – in such case context_info is ideal. We set this value before the update and check this in the trigger. If flag is set, update is not running.

Vertical Partitioning as the way to reduce IO

Let’s think about the following scenario. Assuming we have a table with a lot of data. Some of the large fields in this table are rarely accessed. Same time there are constant scans on the subset of the fields.

The good example is large Articles catalog in the system where customer can search by various different columns/attributes.  Let’s assume Article has the description field which holds 1000-2000 bytes in the average. Assuming customers are searching on the part of the article name, item number, price, etc etc etc and 98% of the customer activity are searching and browsing the results. Let’s say that only 2% of requests are opening the article properties and looking to the description.

Assuming all those searches will lead to the clustered index scans. As we already know, we want to minimize the number of  IO operations so we want to make row as small as possible. Let’s see the example:

Now let’s run the select based on the part of the item number:

Obviously one of the ways how to improve the situation is create non-clustered index on all fields which could be in the predicate list. Problem with this approach that in most part of the cases, SQL would not be able to do the index seek. It can still do the index scan but price of this option would really depend on the number of the records in the result set – key/bookmark lookups can make it really expensive. Obviously you create covered index and include every field you need for the browse – in our case there are Name, Item Number, Price and  OtherFieldsShownInBrowse, but it would make index leaf very big. In some cases though, it still acceptable and better than the solution I will show below.

Let’s change the schema and split Articles table vertically:

To simplify the situation – let’s create the view and join both tables.

Now, let’s run the previous select against view:

As we see that situation is even worse – even if we don’t need Description attribute, SQL Server joins the both tables. It does it because of the inner join. SQL Server needs to filter out all records from ArticleMain where it does not have corresponding records in ArticleDescriptions.

Let’s change it to the outer join:

Now let’s run select again:

Now when we run the select, we see, that it scans ArticleMain table only. Results are much better. If we need to select fields from the both tables (for example when user want to see the properties, it would do the join.

We can even create INSTEAD OF trigger on the view in order to cover insert and update operations and simplify the client development if needed.

Again, I don’t want to say that this method is right in every case. As bare minimum you should have:
1. A lot of data in the table
2. Scans with the predicates which do not support INDEX SEEK operations and return a lot of data
3. Research if regular (covered) index would be more beneficial than this solution.

But in some cases vertical partitioning would help.

Sunday T-SQL tip: Uniqueness of nullable field

Assuming you have the table with nullable field. What if you need to support uniqueness on not-null values only? For example, you can have the customers table with optional SSNs – not all customers provide it but if SSN is provided, you want to make sure it’s unique.

Regular index is not an option – SQL Server would not allow you to have more than 1 null value.

SQL Server 2008 provides great feature – filtered index – you will be able to create it on not null fields only.

Now when you insert duplicate value – it fails:

With SQL 2005 it’s not an option. Although you can use the indexed view:

Now when you insert the duplicate, it violates clustered index on the view and fails

LOB Storage

Let’s talk how SQL Server stores LOB data. Let’s start with restricted-length large objects – objects which are less or equal than 8000 bytes. This would include varchar, nvarchar, varbinary, CLR data types and sql_variants.

Let’s try to create the table with large varchar columns and insert some data:

As you can see, table created successfully and row with size slightly bigger than 8,800 bytes have been inserted. Let’s check what data pages do we have for this table

As you can see, there are 2 types of the pages – IN_ROW_DATA and ROW_OVERFLOW_DATA. You need to deduct 1 from total pages count (IAM pages), so in this case you see 1 page for IN_ROW_DATA (populated with Field1, Field2, Field3 column data) and 1 page for ROW_OVERFLOW_DATA (for Field4 column data).

So restricted-length LOBs are stored either completely in-row or in row-overflow pages if does not fit. In the second case, there is 24 bytes pointer + 2 byte in offset array in the row.

Let’s see what happened if you update the row:

Now the row size is slightly less than 8,000 bytes so row “should” fit on the single page. Although it still uses row-overflow page. SQL Server does not bother to check. In real life the threshold is about 1000 bytes. Let’s see that.

As you see, we updated another field for more than 1,000 bytes and row_overflow page is gone.

For unrestricted-length data (text, ntext, image), the situation is different. SQL Server also stores it on the own set of the pages with 16 bytes (by default) pointer in the row. The data itself will be organized in B-TREE matter similar to regular indexes with pointers to the actual blocks of data. I don’t show the examples here – but you get an idea. The type of the data pages would be TEXT_MIXED – if page shares the data from the multiple rows or TEXT_DATA if entire chunch of the data on the page is the single value.

What also worth to mention is “text in row” table option. This option controls if some part of the LOB data needs to be stored in the row.

And finally the (max) types. This is quite simple. If the size of the data <= 8000 bytes, it stores as restricted-length objects in the row_overflow pages. If more – it stores as unrestricted-length data.

You can create quite big rows with a lot of data on the row-overflow and text_mixed/text_data pages. But don’t forget, that SQL Server will need to read other page(s) when accessing the row. Not really good in terms of performance.

Sunday T-SQL tip: Temporary table variables and transaction context

Recently I’ve been asked if it’s possible to partially commit the transaction. Interestingly enough – the answer is not simple. SQL Server does not support nested transactions. Although there are a few tricks you can use.

First is savepoints. This will allow you to save transaction state at some point, rollback to that point and commit everything you did prior that point. Obviously it will help you to commit only first (beginning) part of the transaction.

Let’s see it in action. Let’s create the small table.

Next, let see savepoints in action.


Good. Now assumming we would like to do opposite task – commit the “tail’ actions after specific point. Unfortunately you cannot do it with regular T-SQL statements. Fortunately there is the trick – temporary table variables. They are not supporting transaction context. Let’s see it in action:

As you can see, you can create temporary table variable, populate it with data, rollback the transaction and next save data in the separate transaction. It worth to mention that regular temporary tables do not work that way – they fully support transaction context.

One of the practical usage of this scenario is audit on rollback. You can save required data into temp table variable, rollback data modification attempt and write audit information to the audit table.

P.S. Just wonder if anybody reads it 🙂

Hidden facts about table alteration

Last time we saw how row size affects performance during the scan operations. I hope you have not altered columns in your tables already. Please don’t blame me if you did – most likely it made the situation even worse. So let’s talk about table alteration today.

There are 3 types of the alterations:

1. Metadata only. This type of alteration can be done based on metadata only. For example, it happens when you add new nullable column, increase the size of variable-width column, change not null column to nullable, etc

2. Metadata change with data check. For example if you need to change nullable column to not null, SQL Server needs to check that there are no rows with null values for this column and after it update the metadata. Decrease of variable-width column is another example – SQL Server needs to check that there are no values larger than the new size

3. Every row needs to be rebuild. Example is: adding new not null column, some data type changes and so on.

Unfortunately one thing is not commonly known – alteration of the table never (repeat) never decreases the row size. When you drop the column, SQL Server removes column from the metadata but does not reclaim/rebuild the row. When you change column type from int to tinyint, for example, actual storage size remains intact – SQL Server just checks domain value on insert/update stages. When you increase the size of the field (for example change int to bigint), SQL Server creates another bigint column and keep old int column space intact.

Let’s see that in details. Let’s create the table:

First, let’s drop Int2Column

As you can see, column definition is gone although all offsets remain the same. 4 bytes simply wasted.

Next, let’s alter Bigint data type to tinyint

As you can see, column now accepts only 1 byte (max_inrow_length) although still uses 8 bytes in offsets. So another 7 bytes are wasted.

And finally let’s alter IntColumn from int to bigint.

As you can see, even if IntColumn column_id  still equal 1, SQL Server “moves” column to the end of the row – new offset is 30. As you can see now, data row has 15 bytes wasted (4 – 11, 13 – 19).

How to “fix” that? Well, you need to rebuild clustered index. SQL Server will reclaim all space and rebuild the rows when you do that. With Heap tables you’re out of luck. Only workaround is build temporary clustered index (which rebuilds the table) and drop it after that.

The biggest problem with that – clustered index rebuild is time consuming operation which locks the table. If you have huge transaction table – it would be nearly impossible to accomplish. So be careful when you design the table structure

Sunday T-SQL tip: How to select data with unknown parameter set

Please check this SP. Does it look familiar?

Easy to implement. Easy to support. Only minor problem – it never uses indexes. Never. Period.

Let’s check that with simplified example

For us, who came from imperative development world, it does not look right. How come that SQL Server cannot do the simple optimization and determine, that parameter is not null. Well, it does not. It does not know this during the compilation stage so it would scan entire table and apply both predicates to every row. Even if first predicate needs to be checked only once.

So how to deal with that? The most optimal way is to specify the list of all parameter combination. Something like that.

Every select would be optimized. Brute force approach will require 2^N selects where N is the number of parameters, which is introducing some limitations with N > 1..3. In real life it could be slightly better because if you can force index seek with one predicate, you should not really worry about others. So it could be done in the next way:

Still requires quite a few different selects and hard to maintain.

Another solution is dynamic SQL:

It would lead to recompilations of dynamic SQL for every SP call. Well, this is the downside of that method. Fortunately there is the good chance that plan would be in the cache especially if SP is called often enough so compilation would not be expensive.

Why Row Size Matters

We already saw that correct data types could decrease the row size. Why is it important? Well, obviously one of the reasons is the database and backup file sizes. Although this is not the only and frankly not the main reason.

When SQL Server reads or writes data, the data page needs to be in the memory. If data page is not present in the buffer pool, SQL Server needs to read the page from the disk. Alternatively, when data page has been modified, SQL Server needs to write this page to the disk. Physical I/O is one of the most expensive operations. So it leads to the simple conclusion:

Bigger Row Size = Less Rows Per Page = More data pages per table = More I/O operations = Performance degradation

Unfortunately I/O is not only thing affected. RAM and Buffer Pool size are limited. More data pages need to be processed/read from disk, more “old” pages would be flushed away from the Buffer Pool.

As the side note – this is one of the biggest problems with non-optimized queries. Those queries are not only slow by themselves, those are also flushing Buffer Pool with unnecessary data and compromise the system performance in general. There are 2 reliable performance counters you can see in Performance Monitor (perfmon.exe) under “SQL InstanceBuffer Manager” group: Page life expectancy – indicates how long the data page stays in cache (production value should be > 300 seconds) and Buffer cache/hit ratio – indicates % of the cases when requested data page is already in the cache (production value should be >96-97%). If you monitor those counters and see low values there, most likely the problem is non-optimized queries (assuming, of course, that you have enough RAM available for SQL Server).

Let’s forget about Buffer Pool for a minute and just focus on I/O cost. Let’s create 2 tables and populate them with 50,000 records

LargeRow table row size is a little bit more than 2,015 bytes, so it would fit 4 rows per page. SmallRow table row size is just 22 bytes. Let’s see how many data pages every table has:

As we can see, LargeRows table uses 12,500 data pages when SmallRows uses 149.

Now let’s run some tests and see how fast is table scan:

As you can see scan of SmalRows table is about 45 times faster than scan of LargeRow.

Obviously we rarely scan entire table in real life. Although above is true for every partial scans. So always use appropriate data types.

Sunday T-SQL tip: Cross joins – when Cartesian product helps

Last Sunday I wrote about cross joins. Today I want to show another example where cross joins could be useful.

Assuming you have 2 tables in the database. One is Units. Another 1 is UnitType1Settings which contains set of the settings for units with UnitType=1. See below:

As the disclaimer, this design is far from optimal. Please don’t use it in your system. Even if I’m using it only as the example, I saw similar approaches in real production systems.

In any case, there were about 20 settings per unit, so developers implemented the trigger like below:

Obviously this is not the best implementation. Cursor, 20 separate inserts. Not even talking about the triggers in general.

Let’s test it with the insert statement which inserts 20 units and 10 of them are UnitType=1

Original trigger produces 30 (and in real life 200) separate inserts with plan like that:

So how can you optimize that (assuming you want to keep trigger in place). So first step is get rid of the cursor and do something like:

If you look at the plan, you can see:

Better, but still requires multiple (20 scans) of the inserted table. And this is where cross join can help:

This implementation basically builds rowset with 3 (20) settings and next simply do the Cartesian product with all units with UnitType = 1

As you can see in this plan it uses only 1 scan which is much more efficient than previous implementations

What is the optimal size for variable width columns?

Yesterday we had the discussion during Tampa Bay SQL Server User Group meeting regarding the size of variable width columns. The question was if it matters how to define such columns – for example if column holds 100 characters should it be defined as varchar(100), varchar(1000) or varchar(max).

The answer is not so simple. From the storage prospective it does not really matter. Row stores actual data size + 2 bytes in offset array. The question though is how it affects performance.

When Query Optimizer generates the query plan, it needs to estimate how much memory needed for the query execution. It estimates the number of rows returned by each specific iterator (the valid statistics is the key) as well as the row size. And this is where it comes to play. SQL Server does not know if varchar(100) column holds in average 1 character or 100 characters. So it uses simple assumption – 50% of declared column size and 4000 bytes for (max) columns.

Correct estimate is especially critical for sort and hash operations – if operation cannot be done in memory, Sql Server flushes the data to tempdb and do the operations there. And this is the major performance hit. Let’s see this in action.

Let’s create the table with 2 varchar fields – 100 and 210 characters and populate this table with 1M rows. Let’s put 100 characters into both varchar fields.
create table dbo.Varchars
(
ID int not null,
Field1 varchar(100) not null,
Field2 varchar(210) not null,
primary key (ID)
)
go

declare
@I int = 0
begin tran
while @I < 1000000
begin
insert into dbo.Varchars(ID, Field1, Field2)
values(@I,REPLICATE(CHAR(65 + @I % 26),100),REPLICATE(CHAR(65 + @I % 26),100))
select @I += 1
end
commit

Now let’s run a few tests. Let’s run SQL Profiler and trace “Sort Warning” event. That event basically tells us when data is flushed to TempDb for the sorting purposes. Let’s run the simple query:
declare @A varchar(1000)
select @A = Field1
from dbo.Varchars
where ID < 42000
order by Field1

Let’s look at the Query Plan – as you see row size estimated incorrectly.

We also can see Sort Warning in SQL Profiler

Finally – we can see that query uses 6656Kb

Let’s run the second query:
declare @A varchar(1000)
select @A = Field2
from dbo.Varchars
where ID < 42000
order by Field2

Same query plan but different row size.

No Sort Warnings in SQL Profiler. As for the memory, this query uses 8088K.

Finally let’s run it 100 times in the loop and compare the execution statistics:

As you can see the first query is 25% slower than the second one. This is quite simple example – with complex queries the difference could be dramatic.

So on one hand it looks like that it’s good idea to always define variable width columns really wide. On other hand it introduces the different problem – you don’t want to overestimate row size. It will lead to the bigger memory consumption as well as waits for the large memory grants could be longer than to the small ones. I would suggest to define those columns in about 2 times bigger than actual average data size, so row size estimate would be as accurate as possible.