Category Archives: SQL Server 2005

CLR vs. T-SQL: Performance considerations

I am pretty sure that all of us read or even participated in quite a few heated discussions about Common Language Runtime (CLR) code in Microsoft SQL Server. Some people state that CLR code works faster than T-SQL, others oppose them. Although, as with the other SQL Server technologies, there is no simple answer to that question. Both technologies are different in nature and should be used for the different tasks. T-SQL is the interpreted language, which is optimized for set-based logic and data access. CLR, on the other hand, produces compiled code that works the best for imperative procedural-style code.

Even with imperative code, we need to decide if we want to implement it in CLR or as the client-side code, perhaps running on the application servers. CLR works within SQL Server process. While, on one hand, it eliminates network traffic and can provide us the best performance due to the “closeness” to the data, CLR adds the load to the SQL Server. It is usually easier and cheaper to scale out application servers rather than upgrading SQL Server box.

There are some cases when we must use CLR code though. For example, let’s think about the queries that performing RegEx evaluations as part of the where clause. It would be inefficient to move such evaluations to the client code and there is no regular expressions support in SQL Server. So CLR is the only choice we have. Although, in the other cases, when procedural-style logic can be moved to the application servers, we should consider such option. Especially when application servers are residing closely to SQL Server and network latency and throughput are not an issue.

Today we will compare performance of the few different areas of CLR and T-SQL. I am not trying to answer the question – “what technology is better”. As usual it fits into “It depends” category. What I want to do is looking how technologies behave in the similar tasks when they can be interchanged.

Before we begin, let’s create the table and populate it with some data.

01. Test table

As the first step, let’s compare the user-defined functions invocation cost. We will use the simple function that accepts the integer value as the parameter and returns 1 in case if that value is even. We can see CLR C# implementation below.

02. Invocation overhead: CLR code

As we can see, there are the attributes specified for each function. Those attributes describes different aspects of UDF behavior and can help Query Optimizer to generate more efficient execution plans. I would recommend specifying them explicitly rather than relying on default values.

One of the attributes – DataAccess – indicates if function performs any data access. When this is the case, SQL Server calls the function in the different context that will allow access to the data. Setting up such context introduces additional overhead during the functional call, which we will see in a few minutes.

T-SQL implementation of those functions would look like that:

03. Invocation overhead: T-SQL code

Let’s measure average execution time for the statements shown below. Obviously, different hardware leads to the different execution time although trends would be the same.

04. Invocation overhead: Test script

Each statement performs clustered index scan of dbo.Numbers table and checks if Num column is even for every row from the table. For CLR and T-SQL scalar user-defined functions, that introduces the actual function call. Inline multi-statement function, on the other hand, performed the calculation inline without function call overhead.

05. Invocation overhead: Avg. Execution Time

As we can see, CLR UDF without data access context performs about four times faster comparing to T-SQL scalar function. Even if establishing data-access context introduces additional overhead and increases execution time, it is still faster than T-SQL scalar UDF implementation.

The key point here though is than in such particular example the best performance could be achieved if we stop using the functions at all rather than converting T-SQL implementation to CLR UDF. Even with CLR UDF, the overhead of the function call is much higher than inline calculations.

Unfortunately, this is not always the case. While we should always think about code refactoring as the option, there are the cases when CLR implementation can outperform inline calculations even with all overhead it introduced. We are talking about mathematical calculations, string manipulations, XML parsing and serialization – to name just a few. Let’s test the performance of the functions that calculate the distance between two points defined by latitude and longitude.

06. Distance Calculation: CLR

07: Calculating Distance: Scalar T-SQL UDF

08. Calculating Distance: Inline function

09. Calculating Distance: Test script

10. Calculating Distance: Execution Time

We can see that CLR UDF runs almost two times faster comparing to inline table-valued functions and more than five times faster comparing to T-SQL scalar UDF. Even with all calling overhead involved.

Now let’s look at the data access performance. The first test compares performance of the separate DML statements from T-SQL and CLR stored procedures. In that test we will create the procedures that calculate the number of the rows in dbo.Numbers table for specific Num interval provided as the parameters. We can see the implementation below

11. Data Access: CLR

12. Data Access: T-SQL

Table below shows the average execution time for stored procedure with the parameters that lead to 50,000 individual SELECT statements. As we can see, data access from CLR code is much less efficient and works about five times slower than data access from T-SQL.

13. Data Access: Individual Statements – execution time

Now let’s compare performance of the row-by-row processing using T-SQL cursor and .Net SqlDataReader class.

14. SqlDataReader vs. cursor: CLR

15. SqlDataReader vs. cursor: T-SQL

16. SqlDataReader vs. Cursor: Execution time

As we can see, SqlDataReader implementation is faster.

Finally, let’s look at the performance of CLR aggregates. We will use standard implementation of the aggregate that concatenates the values into comma-separated string.

17. Building CSV list: Aggregate

As with user-defined functions, it is extremely important to set the attributes that tell Query Optimizer about CLR Aggregate behavior and implementation. This would help to generate more efficient execution plans and prevent incorrect results due to optimization. It is also important to specify MaxByteSize attribute that defines the maximum size of the aggregate output. In our case, we set it to -1 which means that aggregate could hold up to 2GB of data.

Speaking of T-SQL implementation, let’s look at the approach that uses SQL variable to hold intermediate results. That approach implements imperative row-by-row processing under the hood.

As another option let’s use FOR XML PATH technique. It is worth to mention that this technique could introduce different results by replacing XML special characters with character entities. For example, if our values contain < character, it would be replaced with &lt; string.

Our test code would look like that:

18. Building CSV list: T-SQL

When we compare the performance on the different row set sizes, we would see results below

19. Building CSV list: Execution time

As we can see, CLR aggregate has slightly higher startup cost comparing to T-SQL variable approach although it quickly disappears on the larger rowsets. Performance of both: CLR aggregate and FOR XML PATH methods linearly depend on the number of the rows to aggregate while performance of SQL Variable approach degrade exponentially. SQL Server needs to initiate the new instance of the string every time it concatenates the new value and it does not work efficiently especially when it needs to be populated with the large values.

The key point I would like to make with that example is that we always need to look at the options to replace imperative code with declarative set-based logic. While CLR usually outperforms procedural-style T-SQL code, set-based logic could outperform both of them.

While there are some cases when choice between technologies is obvious, there are the cases when it is not clear. Let us think about scalar UDF that needs to perform some data access. Lower invocation cost of CLR function can be mitigated by higher data access cost from there. Similarly, inline mathematical calculations in T-SQL could be slower than in CLR even with all invocation overhead involved. In those cases, we must test different approaches and find the best one which works in that particular case.

Source code is available for download

Next: CLR: Security considerations

Locking in Microsoft SQL Server (Part 18) – Key lookup deadlock

Today I would like us to talk about the special case of the deadlock called key lookup deadlock. This deadlock can happen when multiple sessions are reading and updating the same rows simultaneously. Let us look at the example.

As the first step, let us create the table with the clustered and nonclustered indexes. Nonclustered index has one included column. We are inserting 256 rows there keeping clustered and nonclustered key values the same – from 1 to 256.

Creating the table and populating it with the data

Now let us run two sessions in parallel. In the first session, we are updating the column that included to the nonclustered index using clustered key value in where clause.

Session 1 code

As we can guess, this session will use clustered index seek operation in the execution plan.

Session 1 execution plan

The second session will read the same row using nonclustered key value

Session 2 code

Because Col1 is not part of the nonclustered index, we would have nonclustered index seek and key lookup operations in the execution plan:

Session 2 execution plan

Both statements are running in the loop just to emulate concurrent access to the data. In just a few seconds, we will have the deadlock and session with select would be chosen as the deadlock victim.

Deadlock error

At the first glance, this looks confusing. Both sessions are dealing with the same row. We would expect to have the blocking cases due to exclusive (X) and shared (S) lock incompatibility for the duration of the transaction although we do not expect the deadlock. However, even if we are dealing with the single row, there are two indexes involved.

Let us take a look what locks SQL Server acquires when the table has the multiple indexes. First, let us update the column, which does not belong to nonclustered index, and see what row-level locks will be held.

Updating column that is not part of the nonclustered index

As we see, there is only one exclusive (X) lock on the clustered index. Col1 is not part of nonclustered index and as result, SQL Server does not need to update it and acquire the lock there.

Let us see what happen, if we update the column, which is included to the nonclustered index.

Updating column that is included to the nonclustered index

As we see, now we have two locks in place – one on each index key. And the point here is that we run such update, SQL Server would lock the row in one index first and another index after that. The sequence depends on the execution plan and in our case it would acquire exclusive (X) lock on the clustered index first.

Similarly, our select statement also acquires two shared (S) locks. First, it would lock the row in non-clustered index and then acquire the lock on the clustered index during key lookup operation.

That should give us the idea why we have the deadlock. Both statements are running simultaneously. In the first step, update statement acquires exclusive (X) lock on the row in the clustered index and select statement acquires shared (S) lock on the row in the nonclustered index

Key lookup deadlock: Step 1

After that, update statement is trying to acquire the exclusive (X) lock on the nonclustered index row and being blocked because there is the shared (S) lock held. Same thing happens with select statement, which is trying to acquire shared (S) lock on the clustered index row and being blocked because of the exclusive (X) lock held. Classic deadlock.

Key lookup deadlock: Step 2

To prove that, we can run the statement that shows the current row-level locks immediately after we run our original two sessions. If we are lucky, we can catch the state when both sessions are blocked before deadlock monitor task wakes up and terminate one of the sessions.

Row-level locks in time of the deadlock

There are a few ways that can help to eliminate the deadlock. First option would be eliminating key lookup operation by adding Col1 as included column to the index. In such case, select statement does not need to access the data from the clustered index, which will solve the problem. Unfortunately, that solution increases the size of the nonclustered index key row and introduce additional overhead during data modifications and index maintenance.

Another approach would be switching to optimistic isolation levels where writers do not block readers. While it can help to solve blocking and deadlocking issues, it would also introduce additional tempdb overhead and increases the fragmentation.

Finally, we can refactor the code and separate nonclustered index seek and key lookup operations to two separate selects

Workaround: separating NCI seek and key lookup

Workaround: Execution plan

Both select statements are working on the single index scope and as result would not hold shared (S) locks on the both indexes simultaneously when we are using read committed transaction isolation level. Although, this solution would not work in repeatable read and serializable isolation levels where shared (S) locks held until the end of the transaction.

Source code is available for download

Next: Concurrency model in in-memory OLTP (Hekaton)

Table of content

Locking in Microsoft SQL Server (Part 17) – Implementing Critical Section / Mutexes in T-SQL

Today I’d like us to discuss how we can implement analog of Critical Section (or Mutex) in T-SQL. One of the tasks when it could be beneficial is when we need to prevent the multiple sessions from reading the data simultaneously. As the example let’s think about the system which collects some data and does some kind of post processing after data is inserted.

One of the typical implementation in such architecture would be having the farm of the application servers that do the post processing. We usually need to have more than one server in such scenario for scalability and redundancy reasons. The key problem here is how to prevent the different servers from reading and processing the same data simultaneously. There are a few ways how we can do it. One approach would be using central management server that loads and distributes the data across processing servers. While it could help with the scalability we will need to do something to make that server redundant. Alternatively we can use some sort of distributed cache solution. We could load the data there and every server grabs and processes the data from the cache. That approach could be scalable and work great although distributed cache is not the easy thing to implement. There are the few (expensive) solutions on the market though if you don’t mind to spend money.

There are of course, other possibilities but perhaps the easiest approach from the coding standpoint would be implementing application servers in the stateless manner and do the serialization while reading the data in T-SQL.

Let’s create the table we can use in our exercises and populate it with some data.

A couple things here. First of all, we need to handle the situations when application server crashes and make sure that data would be loaded again after some time by another app server. This is a reason why we are using ReadForProcessing datetime column rather than the simple Boolean flag.

I’d also assume that system wants to read data in FIFO (first in, first out) order as much as possible and after processing is done the data would be moved into another table and deleted from the original RawData table. This is the reason why there is no indexes but clustered primary key. If we need to keep the data in the same table we can do it with additional Boolean flag, for example Processed bit column, although we will need to have another index. Perhaps:

create nonclustered index IDX_RawData_ReadForProcessing
on dbo.RawData(ReadForProcessing)
include(Processed)
where Processed = 0

In addition to the index we also need to assign default value to ReadForProcessing column to avoid ISNULL predicate in the where clause to make it SARGable. We can use some value from the past. 2001-01-01 would work just fine.

In either case, after we read the data for the processing we need to update ReadyForProcessing column with the current (UTC) time. The code itself could look like that:

DataPacket CTE is using ordered clustered index scan. It would stop scanning immediately after read 10 rows (TOP condition). Again, we are assuming that data is moved to another table after the processing so it would be efficient. We are updating the timestamp same time when we read it and saving the package for the client in the temporary table @Result using output clause

The problem here is the race condition when two or more sessions are starting to read and update the data simultaneously. Our update statement would obtain shared (S) locks during select in CTE and after that use update (U) and exclusive (X) locks on the data to be updated.

Obviously different sessions would not be able to update the same rows simultaneously – one session will hold exclusive (X) lock on the row while other sessions would be blocked waiting for shared (S) or update (U) lock. In the first case (shared (S) lock), it’s not a problem – the blocked session will read new (updated) value of ReadForProcessing column as soon as the first session releases the exclusive (X) lock. But in the second case the second session will update (and read) the row the second time. Simplified version of the process is shown below.

At the first step both sessions read the row acquiring and releasing shared (S) locks. Both sessions evaluate the predicate (isnull(ReadForProcessing,’2001-01-01′) < dateadd(minute,-1,GetUtcDate())) and decided to update the row. At this point one of the sessions acquires update(U) and then exclusive (X) lock while other session is blocked.

After the first session releases the exclusive (X) lock, the second session updates the same row.

How can we avoid that? We can create another resource and acquire exclusive lock on that resource before update statement from within the transaction. As we remember, exclusive (X) locks held till the end of transaction, so we will use it as the serialization point. Let’s take a look how it works by creating another table as the lock resource. Now, if we start transaction, we can obtain exclusive table lock. Again, exclusive (X) locks held till the end of transaction, so other sessions would be blocked trying to acquire the lock on the table. As result, execution of our update statement would be serialized.

We can test that approach by running this SP from the multiple sessions simultaneously. There is the artificial delay which we are using during the testing just to make sure that we have enough time to run SP in the different session.

While that approach works, there is another, better way to accomplish the same task. We can use application locks. Basically, application locks are just the “named” locks we can issue. We can use them instead of locking table.

That would lead to the same results.

Application locks are also very useful when we need to implement some code that alters the database schema (for example alter partition function) in the systems that are running under load all the time (24×7). Our DDL statements can issue shared application locks while DDL statements acquire exclusive application locks. This would help to avoid deadlocks related to the lock partitioning. You can see the post about lock partitioning with more details about the problem and implementation.

Although, if we talk about specific task of serialization of the reading process, we don’t need critical section at all. We can use the locking hints instead.

As you can see, there are two locking hints in the select statement. UPDLOCK hint forces SQL Server using update (U) locks rather than shared (S) ones. Update locks are incompatible with each other so multiple sessions would not be able to read the same row. Another hint – READPAST – tells SQL Server to skip the locked rows rather than being blocked. Let’s modify our stored procedure to use that approach.

I’m adding some code to the procedure to emulate race condition. In one session we will run the stored procedure with @UseDelay = 1. In another with @UseDelay = 0. Both of those sessions will start to execute the main update statement roughly at the same time.

This method works even more efficiently than the “critical section” approach. Multiple sessions can read the data in parallel.

Well, I hope that we achieved two goals today. First – we learned how to implement critical section and/or mutexes in T-SQL. But, more importantly, I hope that it taught us that in some cases, the “classic” approach is not the best and we need to think out of the box. Even when this thinking involved the standard functional available in SQL Server.

Source code is available for download

Table of content

Next: Key lookup deadlock

 

Locking in Microsoft SQL Server (Part 16) – Monitoring Blocked Processes Report with Event Notifications

UPDATE 2018-08-01: New and redesigned version of the code is available here

As we already know it’s very easy to capture blocked process report by using SQL Traces. That method though has a few limitations. First of all, it means we need to have SQL Trace up and running all the time. And SQL Trace, especially the client one, introduces the overhead on SQL Server. Another big problem is that we need to monitor traces on the regular basis. And in case if we had the blocking from within the stored procedures (e.g. session input buffer contains SP reference only), we would need to use sql handles and get the estimate execution plan from the plan cache. Nothing guarantees that plan would be there by the time when we start troubleshooting the blocking problem. Of course, we can set up an alert with SQL Agent and get the notification when blocking
occurs although it would still mean that we have to do our job manually.

As another option we can use Event Notification for BLOCKED_PROCESS_REPORT event. This approach would utilize Service Broker so we would be able to create activation stored procedure and parse blocking report there. Let’s take a look at that.

First of all, we need to decide where to store the data. While we can put the table to the user database, I’d prefer to use separate utility database for the data collection. Let’s do that:

At that point we would have blocked process report events going to dbo.BlockedProcessNotificationQueue service broker queue. Assuming, of course, that we have blocked process threshold option set.

Obviously we do not want to have those messages sitting in the queue – it’s kind of defeating the purpose of having the process automated. What I would like to do at this point is shredding event data and putting it to the table for analysis. Another important factor is that blocked process monitor would generate separate events for the same blocking condition every time it wakes up. For example, if we have blocking process threshold set to 5 seconds, we can get five events until our query times out after 30 seconds. Ideally I’d like to combine those events into the single one to have analysis simplified. So let’s create the table to store the data.

This table stores the information about both – blocked and blocking processes. Although blocking information can be misleading in case if blocking session currently executes the different batch or even waiting for the next batch to be executed – table would store the current state rather than info at the time when blocking occurs. In any cases, from the blocking process standpoint the most interesting attributes are:

  1. Process Status – is it running, sleeping or suspended? If it’s sleeping, it could be the sign that client does not work with transations correctly – either did not commit one of the nested transactions or, perhaps, mixed them with UI activity. Suspended status could be the sign of the blocking chain which is another story
  2. TranCount – if it’s more than one, it would tell us that we have nested transactions and again, perhaps, client does not handle them correctly.

In any case, we will have full report stored and can access it if needed. And of course, we can modify the table and add extra attributes if we want to.

Now it’s the time to put the activation procedure in place. I’m going to cheat a little bit here – click at the link to the source code at the end of the post to see it.

There are two things I’d like to mention though. First one is how we get the query plans.  For the blocked process we are trying to get it from sys.dm_exec_requests first. This is the only bullet-proof way to get the real plan but it would work only if the statement is still blocked when activation SP executes. If this is not the case we are using sys.dm_exec_query_stats DMV. There are a couple challenges though. First, there is the chance that plan would not be there – for example in case of the memory pressure. Alternatively we have the situation when there are multiple plans due recompilation. We are trying to guess the right one by filtering based on the blocking time but that method is not always working. So no guarantees. For the blocking process we are always using sys.dm_exec_query_stats picking up the top (random) plan.

Another thing is how we are looking up if there are other events for the same blocking. Basically stored procedure is trying to match various columns in the merge statement – perhaps even more than needed – but in either case I’d rather have duplicate records than incorrect information.

Last step we need to do is setting up the security. That step is kind of optional in case if we are storing the data in the user database but in our case, when we create the blank database and set up everything under “dbo” user it’s required. When Service Broker activates the stored procedure under that security context (EXECUTE AS OWNER), dbo has enough rights to deal with the database object. But that user also needs to have the rights to query system DMV. As result, we need to create the certificate in the both, EventMonitoring and master databases, create the login from the certificate, grant this login “view server state” and “authenticate server” rights and finally sign the stored procedure with the certificate. Or, perhaps, mark the database as Trustworthy 🙂

And now it’s time for the testing. Let’s create the small table and populate it with a few records.

Next, let’s place exclusive (X) lock on one of the rows in the first session.

In another session let’s introduce the table scan in read committed isolation level.

If we query the service broker queue we would see that there are a few events there. Our queue does not have automatic activation yet.

And finally let’s alter the queue to enable the activation.

Next, let’s query the table.

As we can see, there is the single record in the table now – exactly what we need. This approach is, of course, customizable. You can collect other statistics by changing the implementation. Hope, that script would be the great starting point

Next: Implementing Critical Sections / Mutexes in T-SQL

Table of content

 

Data Partitioning – Scaling-Out (Part 3: Factors to consider)

Of course, there are plenty other ways to scale-out the system in addition to horizontal partitioning and data sharding. Although in every case we have the same set of factors we need to consider before making the final decision of separating the data between the physical servers. Let’s talk about them today.

Development cost:
Scaling-out would increase development cost. The system becomes more complicated. Often there are additional tiers (application server, data warehouse, etc.) that need to be developed, tested and supported. You also need to have different kind of people in the team – while you can possibly save something by having good rather than exceptional DBA, you’d need to have very good architects and backend developers there. With the current trends when hardware prices are decreasing and development cost is rising, you’d need to consider that aspect very carefully from the beginning.

And of course, there is the question about legacy software. Neither of us want to touch old, often obsolete, code. In some cases we could be lucky and keep everything as is (again, legacy is accessing only operational data and/or we have multiple desktop shortcuts). In other cases we could face very painful and expensive process of refactoring.

Support and maintenance cost:
Well, you’ll have more servers to maintain. And everything depends on how much manual work you are doing. There is not much difference between supporting ten or one hundred servers when everything is automated but it would be completely different story when manual work is involved. PowerShell, policy-based management, data collectors and other tools and technologies – learn them – they are your friends.

Hardware and software cost:
At bare minimum you’ll need to buy OS and SQL Server licenses for each server. A lot of things depend on SQL Server Edition – in some cases scaling out would allow you to use Standard editions on the smaller servers rather than stuck with the Enterprise when you don’t scale out. You can be forced to use Enterprise edition even if you don’t need edition features simply because of 64GB RAM limitation Standard edition has. In some cases you can live with lower end storage too. So you need to carefully plan everything.

And there are the Clouds especially when we talk about SQL Server within VM (Infrastructure As A Service approach). With Clouds the picture changes dramatically. We are paying by the hour – for what we are using. There is no (or minimal) upfront cost for hardware and software. And there are physical limitations in VM configuration (RAM, # of Cores, etc.) as well as slow IO performance. Those limitations could force us to scale out regardless of the cost.

High Availability:
This one is interesting. On the one hand, with multiple servers we have to implement HA strategy multiple times (think about automation again). It introduces additional work but on the other hand gives us better (or, perhaps, different) type of protection. When we use horizontal partitioning, we obviously need to keep our main (operational period) server online if we want the system to be operational. But if the server with historical data is down, it would not affect operational activity. Same with data sharding. If we have one out of ten servers down, only ten percent of the customers would be affected. It still bad but trust me, ten percent of complaining customers are better than one hundred percent.

Of course, everything is in “It depends” category. There is no such thing as strict set of rules when scaling out is appropriate. We need to look at the systems on case by case basis and do our homework all the time.

Data Partitioning – Scaling-Out (Part 2: Data Sharding)

Last time we discussed how to scale-out our system with horizontal partitioning. Another approach that can be helpful is data sharding. With such approach we are creating multiple databases of the same (or very similar) structure and distribute (“shard”) our data between them. The typical use-case for this situation would be the system that collects data from the multiple customers where every customer works with his/her own subset of the data.

When customer logs in to the system, we redirect him/her to specific database. After that customer works within that database only. One of the key differences with horizontal partitioning is that customer-facing UI does not necessarily need to have application server. We still need to implement some sort of “redirector” – the part that knows where customer’s data is located and redirects the application to the right shard database. Although, technically speaking, as soon as it’s done, even regular 2-tier client/server system could work just fine. Don’t take me wrong – I’m not trying to suggest not implementing application server but in some cases legacy systems can work just fine with the minimum modifications.

What are the challenges there? From the customer-facing functional the biggest one is how to deal with the shared data. One of the examples is shared Article list in the point-of-sale system. We need to either replicate the data across all databases or create separate database to store that list. Obviously there are plenty of ways to replicate the data. Perhaps the simplest one would be to use snapshot replication (assuming we can have centralized publisher and articles are generally small). Although it still requires additional efforts to support it. Approach with the separate database introduces its own set of issues. It becomes single point of failure – when this database is down – all shards are down. And of course, there is the question how to access this database from the separate shards. Again, most likely linked servers will be in the game. While it’s not necessarily the big issue performance wise – those shared entities are usually small – it still introduces some overhead from development and
management standpoints.

Second interesting question is the legacy software. Well, again, it’s not necessarily a problem. The database schema remains the same across all shards so legacy software technically can work “As Is”. Assuming, of course, it can connect to and work within the single shard and we don’t mind to have large number of shortcuts on desktop.

Cross-shard database access is challenging. Similarly to horizontal partitioning, we need to design the application server that works across multiple shards. With all development challenges this architecture can be beneficial from performance standpoint – we can query shards in parallel. On the other hand cross-shard access is not always needed. Some of the use-cases (for example, CRM) could be done within the single shard. Others, such as analysis, reporting and accounting, could be implemented in the separate Data Warehouse type database with ETL processes that get the data from the shards. And we rarely need raw transaction data in the data warehouse so size should not be an issue.

One very important question is how to shard the data. In some cases we can have natural criteria – like geographic regions in the picture above. Although we need to be careful when criteria is artificial. Ideally we are looking for uniform data distribution though it could be complicated in some cases. It’s not the good idea to distribute customers data based on ID ranges when, for example, first shard stores IDs between 1 and 1,000; second one – between 1,001 and 2,000 and so on. Old customers (lowest IDs) tend to leave and we will end up with set of underutilized shards. In those cases it’s better to use modulus division: Shard_ID = ID mod Total_Number_Of_Shards.

Another thing to keep in mind is uneven amount of data across for different customers. Let’s think about GPS Tracking as the example when customers are tracking their assets. One customer can have just a few assets, another one hundreds or even thousands of them. Obviously amount of data for those customers would vary greatly. In case, if we have shard servers powerful enough and store large set of the customers per shard, we would be fine. At least should be from statistics standpoint. But it still makes sense to implement some sort of the monitoring solutions to avoid extreme situations.

Next: Scaling-out – Factors to consider

Data Partitioning – Scaling-Out (Part 1: Separating Operational and Historical data)

Before we start talking about Data Partitioning within the same database, I’d like us to discuss a couple methods of partitioning by scaling-out our data and servers. So, first of all, let’s think about the situation when we have all our data in the same database/table.

This is quite general multi-tier client architecture nowadays. Obviously we can have client software (especially legacy one) which is connecting to the database directly.

As we already discussed, almost every system stores two different kinds of data – operational and historical. So what if we do something like that:

In this schema, we keep all operational data and catalog entities in the main database and move historical data to the separate SQL Server(s). We define the linked servers which will allow us to join historical data with catalog entities as well as union the data from the different servers in some cases. Data placement is transparent to the clients which access the data through application server – “the must have” in such architecture. Application server knows how data is stored, queries the data from the multiple databases, merges the streams and returns the final dataset to the clients. Technically speaking, we can even have performance advantages by querying the data in parallel – each database connection could be done from the separate thread.

Legacy software that queries database directly is usually OK too – as long as software does not need access to historical data. Even if historical data is needed, it could be sometimes workarounded by creating distributed partitioned views.

The biggest drawback of this design is additional complexity and overhead introduced by developing the code that combines data from the multiple sources. E.g. application server and, sometimes, database code. Everything would depend on what do we need to do with the data. Some cases are very simple (think about code like that running in the single database/table solution):

With the new multiple servers we will need to open multiple database connections and run those statements against multiple servers:

And finally merge the data for the client. We can even have Customers list cached on application server somewhere and avoid cross-server joins. Simple enough. Although what if we look for something like that – find most expensive orders, page the results and return the second page to the client? (Small disclaimer – it’s not the best way of doing paging. Check this post for more information)

As you can imagine, there are no easy ways to implement it. We need to either write the query that collects the data from the multiple databases via linked servers and sort/page it after that. Alternatively, we can do it on application server side which could lead to the complex code that does not perform well enough. It’s also worth to mention that linked servers could introduce the own set of the issues from both, performance and development standpoints.

The biggest problem I see with such design is that requirements are rarely static – things tend to change. I have this architecture implemented in one of my systems and while things were relatively simple at the beginning, newer requirements led to the huge development overhead. But everything, of course, is system-specific and in “It depends” category. That design has its own purpose and use-cases. And we will talk about them later.

Next: Scaling-Out by Data Sharding

Data Partitioning – When to partition the data

Last week we were discussing why do we want to partition the data. I hope at this point we all agreed that data partitioning could help us with the large table. Which leads to the question – when the table becomes “big enough” for partitioning.

This is kind of theoretical discussion. I heard a lot of different opinions varied from hundreds MBs to TBs. We can spend a lot of time discussing that and will never end up with the right answer. Or, better say, all answers would be the right ones. Obviously your mileage may vary – a lot of things depend on your system, environment and hardware. But my personal take is very simple. If I expect my system to collect large amount of data, the right time (for me) to partition the data is:

Even if Data Partitioning introduces some overhead on development and testing there are two particular reasons why I’d suggest to do that. Most importantly – data partitioning changes execution plans. As result, when you finally implemented it you’d need to re-test entire system. And this could be very expensive and time consuming. Not even mention the possibility of another round of performance tuning. It’s much easier to deal with potential issues during initial development and testing especially when you don’t have a lot of data and impact of sub-optimal execution plans is smaller.

Second reason is that it’s very hard (if even possible) to partition the data keeping the system online. In most part of the cases we are going to deal with exclusive schema modification locks that prevent any access to the table even in read uncommitted transaction isolation level. And if you think how long physical data movement of 100s GBs or TBs of data could take.. Well, you get an idea.

Again, everything depends on the system and requirements. There is no need to spend time and efforts on data partitioning if you don’t expect to have a lot of data. But remember that situation and requirements can change. There is always the possibility that project would be more successful than we expect it to be 🙂

 Next: Scaling Out by separation operational and historical data

Data Partitioning – Why do we want to partition the data?

Even if we spoke about table partitioning a long time ago, I’d like to circle back to this question from the different, architectural standpoint. In the next a few posts we will discuss why we want to partition data and what options do we have to do that.

Data Partitioning is the complex and time consuming process. So as the first step I’d like to explain why we want to go through all the efforts to do that.

Reason #1 – Without partitioning everything is in the same place.

Quite obvious, is not it? And it’s not necessarily bad. One of advantages of when data is in the same place – it makes development and optimization very easy. Execution plans are predictable but that’s all about it. The biggest problem is that when everything is in the same table (or better say partition) it
stored on the same filegroup, same set of files and same disk array. Well, technically speaking we can separate our clustered and non-clustered indexes between different filegroups but at the end it introduces its own set of the issues especially in disaster recovery situations.

The problem here is that in every system we have data that belongs to the different “groups” / types – operational and historical. Duration of the operational period (which is basically customer facing OLTP portion of the system) varies – could be a day, week, month, maybe even year(s) but in the large system
there is always additional historical data and often there is a lot of historical data.

Obviously we care more about operational (customer facing) performance and would like to put operational data to the fastest disk array possible. But when everything is in the same place we don’t have a lot of choices – we either need to buy disk array big enough to store historical data, which could be
terribly expensive and generally speaking is waste of money, or compromise and go with slower disks due budget constraints.

Reason #2 – The same schema, indexes and compression

Operational and historical data usually have different access patterns. As I mentioned above, operational data is usually OLTP-related data. So we have a lot of short optimized queries. Historical data is often used for analysis and reporting – e.g. we are talking about Data Warehouse type access. Different
access patterns require different index structures. Again, when we have everything in the one table we don’t have a lot of choices. We either create additional indexes to support queries against historical data and hurt performance of OLTP part of the system or, alternatively, do not have those indexes and, as result, have bad performance of the queries against historical portion of the data.

Another thing is that in some cases we want to have slightly different schemas for those portions of the data. One of the examples – operational data can have some additional columns used for the processing. There is the good chance that historical data does not necessarily need them and when we have a lot of
records every byte counts.

Lastly, we cannot partially compress our data. Compression could help a lot with historical data which is relatively static – we are increasing performance by reducing number of pages and, as result, amount of IO operations. On the other hand, for operational data compression usually hurts performance because
data is changing quite often.

Reason #3 – Index maintenance

Same thing as above. Generally speaking we don’t need to rebuild or reorganize the indexes for historical data which is static. But there is no way we can do rebuild/reorg on the part of the data.

Reason #4 – Backup strategy

Same thing again. All data is in the same filegroup. We cannot exclude historical portion from the backups. As result, it increases backup time, size of backup file and introduces additional overhead during the process

Reason #5 – Disaster recovery

One of the biggest benefits of Enterprise Edition of SQL Server is piecemeal restore. It allows us to bring system partially online on filegroup by filegroup basis. If we had operational and historical data separated to the different filegroups we could restore operational part first and make system available
to the customers while we are working on historical part. With everything in the same non-partitioned table it’s impossible. We will need to restore everything first before system becomes available and online.

Reason #6 – Statistics

SQL Server needs to estimate number of rows on the every step of the execution plan to make it efficient. In order to do so SQL Server uses statistics and histogram in
particular. Histogram contains some values from the key as well as the information about data distributions in the intervals of the values. The problem is that histogram contains
at most 200 steps/values. So more data we have in the table, bigger intervals are. Approximations are done on each interval and as result our estimations are less accurate.

Another thing that’s worth to mention is that by default SQL Server updates statistics only after 20% of the key values are updated. So if we have 1 million rows in the table, we can insert 200,000 new rows, delete 200,000 rows or update them before statistics update is triggered. Obviously, more rows we have,
less often statistics would be updated.

That list is not completed by any means. But each of those reasons is big enough by itself to start thinking about data partitioning.

Next: When to partition the data

Sliding window scenario – when data spilled out to the right-most partition

After my last post I’ve received a couple emails asking how did I fix the problem. Just to remind – data spilled out to the right-most partition of the table and as result sliding window scenario did not work anymore. When process tried to split right-most, non-empty partition, it obtained long-time schema modification (SCH-M) lock. And then start to scan/move data while all access to the table was blocked. Today we are going to discuss how to fix that.

The most important question is if we can put the system offline during that operation. If this is not the case, well, our life becomes much more complicated. And our options for online recovery are rather limited. Let’s take a look. First, I would create partition function, scheme, table and populate it with some data with ID between 1 and 1250.

As you can see, I duplicated the case we had – we spilled out some data to the right most partition (with ID > 1100) which should be empty.

Let’s create another correct partition function and scheme:

For online recovery we can re-create our indexes online moving them to another partition. I’m doing it as part of transaction just to roll everything back to the initial “invalid” stage.

This operation would run with the minimal locking and achieve our goal. There are 3 things worth to mention though. First – online index rebuild uses row versioning and as result you will introduce quite heavy tempdb usage during that operation. Next, we are “rebuilding” much more data than actually needed – think about it – we have just one (right most) partition that needs to be recreated but we are rebuilding entire table. But more critical for us that this approach would not work if you have primary key constraint defined on the table. Unfortunately there is no way to recreate constraint moving it to another partition. Basically it means that you would not be able to resolve the issue without putting your system offline. Well, to be exact, if you have non-clustered primary key you can replace it with unique non-clustered index if this is acceptable. But still..

So what can we do with the primary key constraint? First, let’s prepare our table. As the side note – I’m disabling non-clustered index to prevent extra rebuild when we drop the clustered index.

As I mentioned before, we would not be able to do the operation without putting system offline. The goal though is to reduce offline time as well as to process/move as little data as possible (our right most partition only).

First of all, let’s create another table with indexes using new (corrected) partition scheme.

The problem here is that new table should have exactly the same physical row structure with the old one. And if you dropped or altered any column in the old table, your physical structure would not match table definition. If this is the case, you’ll need to create and alter the new table in the same way. Simplest way to compare is to run the following statement after new table is created.



I strongly suggest to test that you did not miss anything with the latest backup. Otherwise you could end up in the nasty situation when your data spread across multiple tables.

Anyway, assuming everything is fine our action plan would be:

  1. Copy all data from invalid partition(s) from Data to Data2. Let’s say that step starts at time T1.
  2. Apply all data changes that happened between T1 and now. Assuming that step starts at time T2. This step is optional but it would reduce downtime
  3. Put system offline
  4. Apply all changes from Data that happened between T2 and now. No new changes would be done in Data during this step because system is offline
  5. Switch all “normal” partitions from Data to Data2
  6. At this point Data would have only right-most incorrect partition, Drop (or rename) the table.
  7. Rename Data2 to Data (including constraints)
  8. Bring system online

Obviously one of the questions is how to find what is changing in the table while we are running steps 1 and 2. Well, it depends, of course. If you have identity or sequence keys, you can use them to track inserts. For updates – you can log time of update – perhaps by creating the new column and populating it with the trigger. For deletions – trigger again. Let’s do our prep work. I assume in that example that we are using sequences (or identity) for inserts. First – updates.

Now deletions. I’m using very simple approach here and going to apply all deletions at once during step 4.

Now we are ready to go. At the beginning of the step 1 we will log our ID and time to use it later.

This step would be quite time consuming and depend on amount of data we are copying. I’m also going to emulate some update activity in Data.

Now we need to apply changes in the original table that were made during step 1. Generally speaking, we can skip this step because we are going to do that again later when system is offline. But remember – our system is still online and this step would reduce downtime. In fact, if you have heavy activity you’d probably like to run this step again a few times.

As I mentioned before – I’m going to cut a corner here and don’t deal with deletions during that step. Again, we need to log ID and current time at the beginning of this step.

Again, I’m emulating some update activity in the system

Now it’s time to put system offline. And if your system is truly offline at this point, it’s also the good idea to create database snapshot to be able to “rollback” the changes.

So let’s apply all updates again including deletion activity.

Now we are almost ready to switch partitions. Only thing we need to do before is dropping our ModTime column and index we created.

And now the key part – switch and rename.

Now we can bring the system online and do the final clean up.

As you can see, the process is a bit complex. But at the end it minimizes the downtime. One thing to keep in mind – partition scheme and functions have been recreated under the new name so you need to take care of the code that references the old ones. SQL Search is your friend here.

While problem is fixed now the best way to deal with such kind of problems is don’t have them at all. As I mentioned before – all of that could be avoided if the size of partition has been re-evaluated from time to time.

The source code is available for download