Archive for the ‘T-SQL’ Category

Views in Microsoft SQL Server (Part 1 - “Joins hell”)

Thursday, February 24th, 2011

I don’t like views. I like layers of abstractions, dedicated tiers in the system but I don’t like views. It seems kind of strange - regular views are basically the abstraction on the metadata layer. Nothing less, nothing more. It gives you some benefits in terms of security, some benefits of the code abstraction. I just think that views introduce more problems than they solves.

Today I’d focus on one specific aspect of the views - I call it “joins hell”. Remember “dll hell” from the good old days? Maybe not too old, by the way.

Let’s create a couple tables and populate it with the data.

Let’s assume we want to have a view that joins both tables and returns client name in addition to orders column.

Simply enough. Natural enough especially if you come from imperative development world. You have the view as the abstraction and you can use it everywhere where you need to access orders. So let’s select the row including client name for one specific order.

Good. As you can see, it introduces 2 CI seek + join. You cannot do any better than that. Now, let’s say, you need to populate the list of the order numbers. So let’s run it again selecting OrderId and OrderNumber only (I’m still filtering out 1 single row but it hardly matters).

Oops - still join with the clients table even if you don’t need any columns from there. Well, if you think about that, it makes sense. If you use inner join in the view, SQL Server needs to make sure that every order has corresponding client and filter out orders without valid ClientId.

So first thing how to workaround it - use outer join instead.

As you can see, in this example, SQL Server does not care, if order does not have corresponding client. It seeks in Orders table only. On other hand, outer join limits the choices optimizer has. It cannot switch between inner and outer record sets when processes nested loops and so on. So technically this is not the best option.

Another method is to add foreign key constraint (don’t forget to add the index on Orders.ClientID column to avoid performance hit). If you run the original select you see, that SQL eliminates the join. By the way, if you decide to run demo scripts - make sure to use regular db. TempDb does not have joins elimination.


Looks like it solves our problems but unfortunately there are some cases when foreign key constraints cannot be created. And there are some cases when SQL Server cannot eliminate the joins even with constraints. Look here (click to open the plan on the different window):

So we don’t know if SQL will be able to eliminate the joins all the times even with all foreign keys in place. What we know for sure, that SQL does not do that if you have multi-column foreign keys. Let’s see that. First let’s create the same Orders and Clients tables with additional Int1 column.

Now let’s create the view and run the same select and see the plan.

As you can see, SQL still does the join. This is one of “design limitations” SQL Server has - no join elimination with multi-column foreign keys.

And if you think about that, this is very simple case. I saw some systems with the views that includes 20+ joins and at the end every select against them introduces a lot of extra IO you don’t need. So I always ask myself - are those minor benefits views give you worth the efforts?

Source code is available for download.

Sunday T-SQL Tip: Server Side paging with Microsoft SQL Server (part II)

Sunday, February 13th, 2011

Last week we saw how we can optimize server side paging with Microsoft SQL Server. The method works just fine although there is one other standard requirement for such tasks. Typically, when you want to display just one page of data in UI, you also need to provide an ability to navigate through the pages and, as result, you should know total number of the pages.

There are 2 ways how to accomplish it from the coding prospective. Obviously the simplest one is to have another select that counts # of rows and returns it either as the second result set or as output variable. It’s too trivial to show here. Let’s talk about the situation when we need to return this total row count as additional column in the result sets.

Let’s use the same tables from the last week. You can achieve this with the following SQL:

 

Let’s take a look at the plan. As you can see, SQL Server scans index twice. One time for RowCnt CTE. Another time for the paging. It worth to mention that because of the index and sorting order, paging part needs to scan only first 150 rows. Alternatively, if you change the sorting order, you’ll see:

Same full index scan that time.

If you ask me what method is better, I’d say it’s more or less the matter of the personal preferences. There is no visible performance benefits in the either case. Obviously additional column introduces additional network traffic but it also is not the huge amount of data. So technically it would not matter if you move it to the separate select. But there are some cases when you have to support legacy code and need to return RowCnt in the main result set. In such case, second CTE with cross join introduces the elegant solution for such task.

Source code is available for download

Sunday T-SQL Tip: Server Side paging with Microsoft SQL Server

Sunday, February 6th, 2011

Let’s assume you have UI that allows to see subset of the data based on pages, with sorting by wide set of columns in both ascending/descending orders. This is probably one of the worst tasks for database developers and administrators - quite easy to implement with ROW_NUMBER() function, but very hard to tune.

Let’s take a look. First of all, let’s create the tables and populate it with some data.

Now, let’s try to return rows #100-150 sorted by OrderDate ASC. I’m disabling parallelism to make it a little bit simpler to analyze. So let’s start with the “standard” approach. Unfortunately SQL Server does not allow to use ranking functions in WHERE clause so we need to use either CTE or subselect. Click on the images below to open them in the separate window.

Now let’s take a look at the plan. I’m using SQL Sentry Plan Explorer - great tool - check it out if you are not familiar with it

In order to calculate Row_Number, SQL Server needs to read entire row set and next sort it. Obviously performance is not so great. So let’s think how we can improve that. One of the first things there - we don’t want to do the join on “pre-sorting stage”. So first of all, let’s move join from CTE to the main select:

As you can see, now we need to perform join with Clients table only for 51 rows. So it looks better. Let’s try to dive a little bit deeper into details.

We have 2 major problems. One is obvious (Red circle) - this is clustered index scan. We will talk about it in a minute. Second one (blue circles) is much less obvious. It’s sorting. Sorting is blocking operation - it needs to collect entire input first. So in our case, it’s 100,000 rows and 14,7M of data. Well, first of all, it’s memory and CPU consuming, but more importantly, if statistics is not ideal and SQL Server underestimates # of input rows, there is the good chance that it would produce sort in tempdb which is also major performance hit. How can we improve it? Let’s massage our SQL a little bit:

At the first glance, it introduce plan not as good as the previous one. Although, if you take a look at the data size, you’ll see, that SORT operator requires 8 times less memory. As result, it would be less CPU intensive and will execute faster. At the end, it can give you huge performance improvements in the case, when output row is large.

Now let’s think about IO. Let’s create the index on OrderDate.

Obviously it helps. It removes sort operator - SQL Server just need to scan first 150 rows from the index. The problem is that this index covers only specific case - one column and one ASC sorting order. If you change it to DESC, SQL Server still needs to sort the data.

Well, if you have 10 different columns user can sort data in both ASC/DESC orders, you need to create 20 different indexes to cover all possible cases. It’s not possible, of course. On other hand, if there are just a couple combinations that users typically use, it could make sense to create a couple indexes to cover them.

One other thing worth to mention, instead of using clustered index scan, SQL Server will use non-clustered index scan which is far more efficient as long as sorting column is there. So it could make sense to at least create one index with all columns included to help that select. For example, if you have the system that always filter by client id and filter by other order columns, it could make sense to create index like that:

Next week I’ll show how to add total row count as part of the same statement.

Source code is available for download

Sunday T-SQL Tip: Purging Data

Sunday, January 9th, 2011

Last week we discussed a few things we have to keep in mind implementing sliding window scenario. Unfortunately table partitioning available in SQL Server Enterprise edition only. How should we purge the transactional data with other editions of SQL Server? Obviously there is no such thing as the golden bullet. Same time there is one particular scenario a lot of the systems have - transactional table with clustered index on identity column. Let’s see how we can optimize the purge process in such case.

Assuming we have a table with identity ID column and DateCreated datetime column. Assuming you need to purge the data based on that DateCreated column. It could happen daily, monthly - i.e. with some time interval. So let’s create such table and populate it with 1M rows. Let’s create an index on DateCreated column.

Now let’s try to purge data a few times and see the results. Please ignore begin tran/rolback - only purpose of those is to preserve the data between the test runs.

As you can see, purging about 40% of the rows takes about 3.5 seconds. There are 2 possible execution plans based on the number of the rows we need to delete - either using clustered index scan (our 40% case) or non-clustered index seek (if % of the row is smaller).

In any case, let’s think about it in more details. In both cases, SQL Server needs to process a lot of data and acquire/hold U/X locks for the duration of the execution. Not really good in terms of concurrency.

What can we do in order to improve it? In the table design like that our DateCreated column is increasing/populating same way with identity. So instead of deleting data based on DateCreated column (that most likely uses non-clustered index seek), let’s get ID of the row we want to keep (i.e min(ID) where DateCreated > @). Next step, instead of deleting everything at once, we can delete data in batches (10-100K rows each) in the individual transactions. In such case locks would be kept only within the batch deletion.

Let’s take a look. Again, please ignore outer begin tran/rollback.

As you can see, it runs about 10% faster but more importantly, it reduces the blocking in the table. Let’s take a look at the plan.

Non-surprisingly it uses clustered index seek/delete. But as you see - a lot of time wasted on NCI maintenance. And here is the interesting point. Does system really need to have the index on DateCreated for any other purpose than purge? If this is the case, we can safely drop it. Yes, it would take more time to find initial @MaxRecId but on other hand this would be either (S) locks or even no-locks at all if read uncommitted is acceptable. And in those cases we more concern about the locking instead of the execution time.

So let’s give it a try without the index.


As you can see, it runs more than 2 times faster. This solution and design are not always good. First of all, making CI on identity column on the large transactional table is bad idea by itself. But I saw a lot of systems and designs like that. In such case, this solution could benefit the purge process.

Code is available here.

‘A’ = N’A’ or how to kill the index seek

Thursday, December 9th, 2010

Let’s start with the example. Create table, populate it and create an index on varchar column.


Next, let’s run 2 selects. First one with varchar parameter. Plan is perfect.

Second, let’s use unicode parameter.

Oops - index scan. It does not look like the big deal - who is going to use unicode for varchar field search. But..

We deal with the client applications. Unfortunately, modern development languages implement String as the unicode. And if you don’t specify the parameter type, it would be unicode.

Let’s take a look:

As you can see, .Net application defines @UntypedParam as nvarchar(5). And it produces 21 times more reads. Another thing worth to mention - the size of the parameter matches the string length. As result, if next statement uses “100″ as the parameter value, it would be nvarchar(3) - recompilation, potentially additional plan in the cache and so on.

Keep that in mind - always specify parameter type and size at the client application.

Script is available for download here. Source code is available here

Sunday T-SQL Tip: Equality predicate on large varchar column

Sunday, December 5th, 2010

Sometimes you need to do the query based on equality predicate on large text column. It does not happen quite often but it happens.Unfortunately sometimes it’s impossible to create the index on such column. First, the size of the column could be greater than 900 bytes (so index would not be created). Another possibility - you have a lot of data and simply don’t want to use a lot of space for such index.

How to optimize the query in such case? Let’s take a look at the example. First, let’s create the table and add some data there. As you can see, LargeField would have a size of 1080 bytes.

Now let’s try to run the select with the clustered index scan. 9150 reads - quite bad.

Index cannot be created - 900 bytes is the limitation.

The solution is persisted calculated column with CHECKSUM() function that generates hash of the string. Let’s take a look:


There are 2 things we need to keep in mind. First - hash is not guaranteed to be unique. So statement still needs to include the predicate on the field. Second, unicode and regular strings obviously return different hash. It should not be generally a problem although make sure that client application specify correct parameter data type.

Source code is available for download

Sunday T-SQL Tip: Insert/Update pattern

Sunday, November 28th, 2010

Let’s think about the system that collects and processes the data. Quite often processing happens after data have been inserted into the system. Typically the rows have set of the columns inserted as NULL. Processing routine loads the rows, update those columns and save rows back.

Looks simple but this pattern introduces major page split/fragmentation issues. Think about that - SQL Server tried to fill pages up to 100%. So when you update the rows, pages don’t have enough free space to keep the data. Let’s look at the example:

First - let’s create the table. There are only 3 columns in this table - identity, char column as the placeholder for the columns populated during insert stage and post processing placeholder.

Second, let’s check the physical index statistics. Index is practically perfect.

Now let’s execute update and populate post processing placeholder column. Now check the statistics - terrible fragmentation and a lot of space wasted.

Obviously the workaround is simple - reserve the space for the post processing columns on the insert stage.

Source code is available here

Sunday T-SQL Tip: Union vs. Union All

Sunday, November 21st, 2010

What could be simpler than union? Probably “union all”.

“Union” removes all duplicates from the combined row set. Union all simply concatenate the row sets. Obviously, if original row sets are unique, results would be the same. But plans are not.

Take a look here. Let’s create 2 tables with some data.


Now let’s do the union and union all and see the plans.

Union introduces another expensive “Distinct Sort” operator. As result the execution plan with union is almost 2 times more expensive than with union all. Of course, your mileage may vary - in the case of the large and complex queries the higher cost of “Union” operator could be hidden but in any case, it’s more expensive. So don’t use it if rowsets don’t have duplicates.

Speaking of the duplicates - there is another thing you need to keep in mind. When union removes the duplicates from combined rowsets, it does not really matter where duplicates come from. So you don’t need to use distinct as part of the selects. And it affects the plan too. Look here:


Source code is available for download from here

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 4 - insert of the master/detail data with SQL SERVER 2005)

Sunday, November 7th, 2010

Last Sunday we saw how we can insert packet of the master/detail rows with the multiple master rows with MERGE statement. Obviously this approach would not work with SQL Server 2005. So what can be done there?

The problem that SQL Server does not guarantee the order in which rows are inserted with insert/select statement. It does not guarantee that with insert/select..order by either. Although based on the blog post from SQL Server Engine Team, insert/select..order by guarantees the order in what identity values are generated.  It says: “INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted “.

This is something we can deal with. So let’s start with the same tables we had last Sunday:

The structure of the master and detail packages are slightly different:
1. Ids table does not have internalId column anymore
2. InternalID columns values in @MasterData and @DetailData should start with 1. This is important!

So we know the order in what identity values are generated. If our InternalIds in the tables are the same (started with 1), we can use it the approach below.

This should do the work. Code can be downloaded here 

Sunday T-SQL Tips: Inserted, Deleted tables and OUTPUT clause (Part 3 - Merge and insert of the master/detail data)

Sunday, October 31st, 2010

Last Sunday we saw what we can do with the MERGE statement and OUTPUT clause. Today I want to show you the practical angle of this feature.

Let’s assume you want to insert packet of master/detail rows in the situation when Master table has ID as identity field.


This is the trivial task if you have only one master row - you can accomplish it with insert statement, get the identity with @@identity or SCOPE_IDENTITY() and next perform insert of the detail rows. The situation is more complicated if you have multiple master-detail rows in the packets. In such case MERGE and OUTPUT would help. First of all, you need to have “internal” unique ID to link master and detail rows in the packets. Let’s see that:

Next, you need to insert the data to the master table with the MERGE operator and get both - internal ID and inserted (identity) ID from the packet. After that you can use simple insert statement.

Code can be downloaded from here