Monthly Archives: February 2011

Views in Microsoft SQL Server (Part 1 – “Joins hell”)

I don’t like views. I like layers of abstractions, dedicated tiers in the system but I don’t like views. It seems kind of strange – regular views are basically the abstraction on the metadata layer. Nothing less, nothing more. It gives you some benefits in terms of security, some benefits of the code abstraction. I just think that views introduce more problems than they solves.

Today I’d focus on one specific aspect of the views – I call it “joins hell”. Remember “dll hell” from the good old days? Maybe not too old, by the way.

Let’s create a couple tables and populate it with the data.

Let’s assume we want to have a view that joins both tables and returns client name in addition to orders column.

Simply enough. Natural enough especially if you come from imperative development world. You have the view as the abstraction and you can use it everywhere where you need to access orders. So let’s select the row including client name for one specific order.

Good. As you can see, it introduces 2 CI seek + join. You cannot do any better than that. Now, let’s say, you need to populate the list of the order numbers. So let’s run it again selecting OrderId and OrderNumber only (I’m still filtering out 1 single row but it hardly matters).

Oops – still join with the clients table even if you don’t need any columns from there. Well, if you think about that, it makes sense. If you use inner join in the view, SQL Server needs to make sure that every order has corresponding client and filter out orders without valid ClientId.

So first thing how to workaround it – use outer join instead.

As you can see, in this example, SQL Server does not care, if order does not have corresponding client. It seeks in Orders table only. On other hand, outer join limits the choices optimizer has. It cannot switch between inner and outer record sets when processes nested loops and so on. So technically this is not the best option.

Another method is to add foreign key constraint (don’t forget to add the index on Orders.ClientID column to avoid performance hit). If you run the original select you see, that SQL eliminates the join. By the way, if you decide to run demo scripts – make sure to use regular db. TempDb does not have joins elimination.


Looks like it solves our problems but unfortunately there are some cases when foreign key constraints cannot be created. And there are some cases when SQL Server cannot eliminate the joins even with constraints. Look here (click to open the plan on the different window):

So we don’t know if SQL will be able to eliminate the joins all the times even with all foreign keys in place. What we know for sure, that SQL does not do that if you have multi-column foreign keys. Let’s see that. First let’s create the same Orders and Clients tables with additional Int1 column.

Now let’s create the view and run the same select and see the plan.

As you can see, SQL still does the join. This is one of “design limitations” SQL Server has – no join elimination with multi-column foreign keys.

And if you think about that, this is very simple case. I saw some systems with the views that includes 20+ joins and at the end every select against them introduces a lot of extra IO you don’t need. So I always ask myself – are those minor benefits views give you worth the efforts?

Source code is available for download.

Sunday T-SQL Tip: Server Side paging with Microsoft SQL Server (part II)

Last week we saw how we can optimize server side paging with Microsoft SQL Server. The method works just fine although there is one other standard requirement for such tasks. Typically, when you want to display just one page of data in UI, you also need to provide an ability to navigate through the pages and, as result, you should know total number of the pages.

There are 2 ways how to accomplish it from the coding prospective. Obviously the simplest one is to have another select that counts # of rows and returns it either as the second result set or as output variable. It’s too trivial to show here. Let’s talk about the situation when we need to return this total row count as additional column in the result sets.

Let’s use the same tables from the last week. You can achieve this with the following SQL:

 

Let’s take a look at the plan. As you can see, SQL Server scans index twice. One time for RowCnt CTE. Another time for the paging. It worth to mention that because of the index and sorting order, paging part needs to scan only first 150 rows. Alternatively, if you change the sorting order, you’ll see:

Same full index scan that time.

If you ask me what method is better, I’d say it’s more or less the matter of the personal preferences. There is no visible performance benefits in the either case. Obviously additional column introduces additional network traffic but it also is not the huge amount of data. So technically it would not matter if you move it to the separate select. But there are some cases when you have to support legacy code and need to return RowCnt in the main result set. In such case, second CTE with cross join introduces the elegant solution for such task.

Source code is available for download

Sunday T-SQL Tip: Server Side paging with Microsoft SQL Server

Let’s assume you have UI that allows to see subset of the data based on pages, with sorting by wide set of columns in both ascending/descending orders. This is probably one of the worst tasks for database developers and administrators – quite easy to implement with ROW_NUMBER() function, but very hard to tune.

Let’s take a look. First of all, let’s create the tables and populate it with some data.

Now, let’s try to return rows #100-150 sorted by OrderDate ASC. I’m disabling parallelism to make it a little bit simpler to analyze. So let’s start with the “standard” approach. Unfortunately SQL Server does not allow to use ranking functions in WHERE clause so we need to use either CTE or subselect. Click on the images below to open them in the separate window.

Now let’s take a look at the plan. I’m using SQL Sentry Plan Explorer – great tool – check it out if you are not familiar with it

In order to calculate Row_Number, SQL Server needs to read entire row set and next sort it. Obviously performance is not so great. So let’s think how we can improve that. One of the first things there – we don’t want to do the join on “pre-sorting stage”. So first of all, let’s move join from CTE to the main select:

As you can see, now we need to perform join with Clients table only for 51 rows. So it looks better. Let’s try to dive a little bit deeper into details.

We have 2 major problems. One is obvious (Red circle) – this is clustered index scan. We will talk about it in a minute. Second one (blue circles) is much less obvious. It’s sorting. Sorting is blocking operation – it needs to collect entire input first. So in our case, it’s 100,000 rows and 14,7M of data. Well, first of all, it’s memory and CPU consuming, but more importantly, if statistics is not ideal and SQL Server underestimates # of input rows, there is the good chance that it would produce sort in tempdb which is also major performance hit. How can we improve it? Let’s massage our SQL a little bit:

At the first glance, it introduce plan not as good as the previous one. Although, if you take a look at the data size, you’ll see, that SORT operator requires 8 times less memory. As result, it would be less CPU intensive and will execute faster. At the end, it can give you huge performance improvements in the case, when output row is large.

Now let’s think about IO. Let’s create the index on OrderDate.

Obviously it helps. It removes sort operator – SQL Server just need to scan first 150 rows from the index. The problem is that this index covers only specific case – one column and one ASC sorting order. If you change it to DESC, SQL Server still needs to sort the data.

Well, if you have 10 different columns user can sort data in both ASC/DESC orders, you need to create 20 different indexes to cover all possible cases. It’s not possible, of course. On other hand, if there are just a couple combinations that users typically use, it could make sense to create a couple indexes to cover them.

One other thing worth to mention, instead of using clustered index scan, SQL Server will use non-clustered index scan which is far more efficient as long as sorting column is there. So it could make sense to at least create one index with all columns included to help that select. For example, if you have the system that always filter by client id and filter by other order columns, it could make sense to create index like that:

Next week I’ll show how to add total row count as part of the same statement.

Source code is available for download