Category Archives: SQL Server 2012

SQL Server Troubleshooting and Performance Tuning – Observer Effect

I hope you are OK during those turbulent times!

People often ask me about my most curious experience in SQL Server consulting. Well, as many of my colleagues, I have a lot of stories to tell (though many of them are still under NDA). One of my favorite ones is the story about the case when I was able to solve all performance problems on the server in just a few minutes. 

We, Database Professionals, love monitoring tools. They allow us to be more proactive and reduce incident detection and recovery times. Unfortunately, the benefits do not come for free – every type of monitoring adds the overhead to the system. In some cases, this overhead may be negligible and acceptable, in others – it may have significant impact to the server performance. 

I have seen many cases of inefficient monitoring. For example, I saw the tool that provided the information about index fragmentation by calling sys.dm_db_index_physical_stats function in DETAILED mode for every index in the database doing it every 4 hours. I also saw the tool, that constantly polled sys.dm_exec_requests view joining it with sys.dm_exec_query_plan function. It did not work very well on the system that handled hundreds of concurrent requests all the time. 

It is relatively easy with the tools – you can detect those inefficiencies just by looking at the expensive queries in the system. This is not always the case, however, with other technologies, like with monitoring done by Extended Events. They can stay almost invisible in the system, especially if targets are keeping up and don’t generate waits.

Let me show you an example and create xEvent session that captures queries that were executed in the system. This is very common one, is not it? 

create event session CaptureQueries on server 
add event sqlserver.rpc_completed
(
  set collect_statement=(1)
  action
  (
    sqlos.task_time,sqlserver.client_app_name
    ,sqlserver.client_hostname
    ,sqlserver.database_name
    ,sqlserver.nt_username
    ,sqlserver.sql_text
  )
),
add event sqlserver.sql_batch_completed(
  action
  (
    sqlos.task_time
    ,sqlserver.client_app_name
    ,sqlserver.client_hostname
    ,sqlserver.database_name
    ,sqlserver.nt_username
    ,sqlserver.sql_text
  )
),
add event sqlserver.sql_statement_completed
add target package0.event_file
(set filename=N'C:\PerfLogs\LongSql.xel',max_file_size=(200))
with
(
  max_memory=4096 KB
  ,event_retention_mode=allow_single_event_loss
  ,max_dispatch_latency=5 seconds
)

As the next step, let’s deploy it to the server that operates under heavy load with large number of concurrent requests. We will measure the throughput in the system with and without xEvent session running. Obviously, be careful and do not run it on production servers! 

Figure 1 below illustrates CPU load and number of batch requests per second in both scenarios. As you see, enabling xEvent session led to more than 20% decrease of throughput in the system.

Figure 1. System throughput with and without xEvent session

And coming back to my story – this is exactly what happened. The customer had very similar session running. Amazingly, but not surprisingly, nobody used collected data. The team was unaware that this session even existed. 

After some investigation, they found that they created this session several months ago during investigation of some performance issues. After investigation had been completed, they stopped the session. Unfortunately, they overlooked that the session was created with “Start the event session at server startup” flag. At some point the server was patched, rebooted and voila!

I did not make much money from that case – after all, I am billing by the hour. But the look at the customer’s face was priceless!   

The morale of this story – be careful with the monitoring. Know what you are doing and do not keep unnecessary data collectors running. 

This is the link to the video version of this blog post. 🙂

The danger of HADR_SYNC_COMMIT wait and synchronous replication in AlwaysOn Availability Groups

The life of Database Engineer is never boring. We are constantly pushing the boundaries and dealing with more complex systems every year. Our challenges are also changing over time. Our current problems did not even exist 5 years ago. New and powerful hardware creates bottlenecks in unexpected areas.

One of such examples is the blocking introduced by AlwaysOn Availability Groups and synchronous commit. This condition may occur with any transaction log-intensive operations, especially on the modern servers with powerful CPUs and fast flash-based storage.

As you know, Availability Groups consist of one primary and one or more secondary nodes/servers. All data modifications are done on the primary node, which sends the stream of t-log records to secondaries. Those log records are saved (hardened) in transaction logs there and asynchronously re-applied to the data files by the set of REDO threads.

The secondary nodes may be configured using asynchronous or synchronous commit. With asynchronous commit, transaction considered to be committed and all locks were released when COMMIT log record is hardened on the primary node. SQL Server sends COMMIT record to secondary node; however, it does not wait for the confirmation that the record had been hardened in the log there.

This behavior changes when you use synchronous commit as shown in Figure 1. In this mode, SQL Server does not consider transaction to be committed until it receives the confirmation that COMMIT log record is hardened in the log on the secondary node. The transaction on primary will remain active with all locks held in place until this confirmation is received. The session on primary is suspended with HADR_SYNC_COMMIT wait type.

Figure 1. Synchronous Commit in AlwaysOn Availability Groups

You can analyze impact of commit latency by looking at HADR_SYNC_COMMIT wait in wait statistics. It is not uncommon nowadays to see this wait to be at the top of the list in busy OLTP systems. Figure 2 below illustrates the output from one of my production servers.

Figure 2. Wait Statistics from one of Production Servers

While we may argue that the Average Wait Time of 3ms would not introduce noticeable overhead, the situation may quickly change when you run t-log intensive processes that generate large amount of transaction log records. Those processes may quickly saturate the Send queue and dramatically increase HADR_SYNC_COMMIT wait/latency. This will affect all processes and transactions in the system – all of them share the same Send queue.

Unfortunately, index maintenance is one of such processes. Index reorgs and rebuilds may create enormous amount of transaction log records, especially when LOB columns are involved. You can watch the short YouTube video that illustrates how index rebuild operation put the system down in no time.

There are a few things you can try to improve the situation:

  • You can throttle index maintenance performance by reducing MAXDOP for the operation. This will reduce log generation rate at cost of increased execution time. Unfortunately, even MAXDOP=1 may not be enough in some cases
  • Alternatively, in Enterprise Edition of SQL Server, you can throttle the process with Resource Governor, limiting IOPS and CPU bandwidth. This approach may require good amount of trial and errors to achieve optimal configuration
  • In SQL Server 2017 and above, you can utilize resumable index rebuild and build the process which will frequently pause the operation allowing Availability Group queues to catch up. I am not the big fan of this approach; however, it may work when properly implemented.

Unfortunately, even with some tricks, in many cases, you’d still need to switch to asynchronous commit mode during massive index operations. Well, it is the great example that everything in SQL Server fits into “It Depends” category. Even good hardware.

Queue Monitoring

As you can guess, it is essential to monitor Availability Group queues. You need to keep an eye on both, Send and Redo queues.

  • In case of synchronous replicas, large Send queue implies potential blocking. With asynchronous replicas, large Send queue dictates potential data loss – data from the queue has not been sent to secondaries.
  • Large Redo queue implies long database recovery time in case of failover. All records from the queue need to be applied to the database before it becomes available. Moreover, in case of readable secondaries, large Redo queue indicates that secondary replica may fall behind being out of date.

The script below provides you the information about Availability Group health and replication latency along with Send and Redo queues sizes. It is available for download (link is in the end of the post)

I usually run that script from SQL Server job every a few minutes and use it to send alert to the team when Send or Redo queue sizes exceed predefined threshold. Obviously, thresholds may vary based on workload the databases handle.

Addressing High HADR_SYNC_COMMIT waits

I consider HADR_SYNC_COMMIT as the dangerous wait and want to address it when it is noticeable in the wait statistics.

Generally, there are three main factors that contribute to this wait:

  • The time, log record waits in the send queue. You can analyze this with the code from Listing 1 using the data from [Send Queue Size(KB)] and [Send Rate KB/Sec] columns. It is worth noting that the queue management process is CPU intensive, which may lead to additional latency in the systems with high CPU load.
  • Network throughput. You can troubleshoot it with network-related performance counters along with SQL Server:Availability Replica > Bytes Sent performance counter. Obviously, make sure that your network infrastructure is configured in the optimal way
  • I/O latency on secondary nodes. Synchronous commit requires COMMIT log record to be hardened in transaction log before acknowledgement is sent back to the primary node. You can monitor write latency of transaction log file using sys.dm_db_virtual_file_stats view. You can download the script to do it through the link in the end of the blog post.

While network and I/O performance may, sometimes, be addressed by hardware upgrades; it is much harder to deal with the latency, introduced by large amount of log records in very busy OLTP systems. You can reduce the overhead of queue management by utilizing CPUs with higher clock speed; however, there are some limits on what you can achieve with the hardware.

There are several things you can do when you experienced this situation:

  • Make sure that SQL Server schedulers are evenly balanced across NUMA nodes. For example, if SQL Server is using 10 cores on 2 NUMA-node server with 8 cores per node, set affinity mask to use 5 cores per node. Unevenly balanced schedules may introduce various performance issues in the system and affect Availability Groups throughput.
  • Reduce amount of log records generated in the system. Some of the options are redesigning transaction strategy avoiding autocommitted transactions; removing unused and redundant indexes; fine-tuning index FILLFACTOR property reducing the page splits in the system.
  • Rearchitect data tier in the system. It is very common that different data in the system may have different RPO (recovery point objective) requirements and tolerance to the data loss. You may consider to move some data to another Availability Group that does not require synchronous commit and/or utilize NoSQL technologies for some entities.

Finally, if you are using SQL Server prior 2016, you should consider to upgrade to the latest version of the product. SQL Server 2016 has several internal optimizations, which dramatically increase Availability Groups throughput comparing to SQL Server 2012 and 2014. It may be the simplest solution in many cases.

Figure 3 below illustrates HADR_SYNC_COMMIT waits before and after SQL Server 2016 upgrade in one of the systems. As you can see, cumulative time for HADR_SYNC_COMMIT waits had been reduced for more than 3 times after upgrade. It is also worth noting that general system performance had been improved as well.

Figure 3. HADR_SYNC_COMMIT waits before and after SQL Server 2016 upgrade

While having zero data loss is the great thing, remember about synchronous commit overhead. Monitor Availability Groups Queues and HADR_SYNC_COMMIT in the system.

Several useful scripts.

Link to YouTube video with the demo.

PS. I am teaching full day SQL Server Internals class at SQL Saturday #862 in Cork, Ireland on June 28th, 2019. You can register to pre-con here.

The Hidden Danger of Readable Secondaries in SQL Server AlwaysOn Availability Groups

A long time ago in a galaxy far, far away, I had to troubleshoot interesting performance issue in SQL Server. Suddenly, the CPU load on the server started to climb up. Nothing changed in terms of workload. The system still processed the same amount of requests. The execution plans of the critical queries stayed the same. Nevertheless, the CPU usage grew up slowly and steadily by a few percent per hour.

Eventually, we nailed it down. The problem occured in very busy OLTP system with very volatile data. We noticed that system performed much more I/O (logical and physical) than before. It was very strange, because nothing should have changed that day. Finally, we found that we have large number of deleted rows in the database that had not been cleaned up by ghost cleanup task.

It was the most impactful for the few “queue” tables in the system that handled hundreds of inserts and deletes per second. The tables had millions of data pages despite that they stored just a handful of rows. The tables supposed to be small and there were bunch of queries that performed Clustered Index Scan over them burning a lot of CPU resources to perform mullions of logical and physical reads.

The output of sys.dm_db_index_physical_stats looked similar to Figure 1 below. Very little record_count with very large number of page_count and version_ghost_record_count. The latter one indicates how many ghosted rows cannot be cleaned up due to row-versioning transactions in the system.

01. Output of sys.dm_db_index_physical_stats

We looked at the active transaction and discovered, that one of our readable secondaries ran terrible SELECT that cross-join several hundred-million row tables. Unfortunately, SQL Server allowed it to execute and this select was running for more than 12 hours deferring ghost and version store clean-up on primary node. The bottom line – non-optimized workload on the readable secondary node can affect your primary. Heh.

You can read why it happens in my recent blog post at Apress.com and in my upcoming book on locking, blocking and concurrency (should be published in October). However, I also wanted to mention it here and provide you the demo scripts and video that demonstrates it.

Remember about this overhead. Secondary nodes are great to scale your read workload but they are by no means “set it and forget about it” kind of solution.

Blocking Monitoring Framework: Capture and Analyze SQL Server Blocking and Deadlock Information With Event Notifications

If you worked with SQL Server for a while, you should have noticed how landscape changed over the years. We are dealing with the different problems now. Five years ago, majority of the issues I saw were related to non-optimized queries. There were the huge scans with a lot of  physical I/O and bad performance.

You do not see them as often nowadays. It is very cheap to solve the problems by upgrading the server. Put a couple hundred GBs of RAM and cache all the data; add more CPUs and problems magically disappear. The root-cause has not been fixed but who cares?

Surprisingly, there is one category of the issues that did not went away – concurrency. It even becomes worse. Modern servers handle more users and problems that did not exist with 50 concurrent users may put the server to its knees with 5000 users. I’ve been constantly involved in the troubleshooting of various concurrency issues and, in fact, I see more and more of them overtime.

Troubleshooting of the blocking and concurrency issues is, in the nutshells, a simple process. You need to identify the processes involved in blocking conditions or deadlocks and analyze why those processes acquire the locks on the same resources. In majority of cases, you need to analyze queries and their execution plans identifying possible inefficiencies that led to excessive number of locks being acquired.

Collecting this information is not a trivial task. The information is exposed through DMVs (you can download the set of scripts here); however, it requires you to run the queries at time when blocking occurred. Fortunately, SQL Server allows you to capture blocking and deadlock conditions with the blocked process report and deadlock graph, analyzing them later.

There is the caveat though. Neither blocked process report nor deadlock graph provide you execution plans of the statements. Nor do they always include affected statements in the plain text. You may need to query plan cache and other DMVs to get this information and longer you wait lesser is the chance that the information is available. Moreover, SQL Server may generate enormous number of blocked process reports in cases of prolonged blocking and complex blocking chains, which complicates the analysis.

This analysis may become very time consuming, especially if you need to deal with the large number of servers. Over the years, I have created the set of routines, which simplify it for me. I have been thinking to publish my scripts for a while, but I’ve never had time to polish them enough for public consumption. Until now – and I am very happy to share my collection with all of you. So allow me introduce the Blocking Monitoring Framework, which I am using with majority of my servers!

This framework is using Event Notifications. It captures blocked process report and deadlock graphs and parses them at time when event occurred and all data is still available in the system. All information is persisted in the set of tables for the further analysis.

The first version is available for download. I also promise you that I am going to support and enhance it in the future publishing the new versions on the regular basis.

Please feel free to contact me in case of any questions. I would also appreciate if you provide me any blocked process reports and deadlock graphs that framework was unable to parse. I will address the issues as quickly as I could.

Blocking Monitoring Framework

Locking in Microsoft SQL Server (Part 21) – Intro Into Transaction Management and Error Handling

As strange as it sounds, I have never considered transaction management to be the topic that requires explanation. However, it seems that some aspects of it are confusing and may benefit from the separate, long overdue, blog post.

Transaction Types

There are three types of transactions in SQL Server – explicit, autocommitted and implicit.

Explicit transactions are explicitly controlled by the code. You can start them by using BEGIN TRAN statement. They will remain active until you explicitly call COMMIT or ROLLBACK in the code.

In case, when there are no active transactions present, SQL Server would use autocommitted transactions – starting transactions and committing them for each statement it executes. Autocommitted transactions work on per-statement rather than per-module level. For example, when a stored procedure consists of five statements; SQL Server would have five autocommitted transactions executed. Moreover, if this procedure failed in the middle of execution, SQL Server would not roll back previously committed autocommitted transactions. This behavior may lead to logical data inconsistency in the system.

For the logic that includes multiple data modification statements, autocommitted transactions are less efficient than explicit transactions due to the logging overhead they introduce. In this mode, every statement would generate transaction log records for implicit BEGIN TRAN and COMMIT operations, which leads to the large amount of transaction log activity and degrade performance of the system.

There is another potential performance hit caused by excessive number of autocommitted transactions. As you know, SQL Server implements Write-Ahead Logging to support transaction durability hardening log records on-disk synchronously with data modifications. Internally, however, SQL Server batches log write operations and caches log records in memory in small 60KB structures called log buffers. COMMIT log record forces SQL Server to flush log buffers to disk introducing synchronous I/O operation.

Figure 1 illustrates this condition. INSERT_1, UPDATE_1 and DELETE_1 operations run in autocommitted transactions generating additional log records and forcing log buffer to flush on each COMMIT. Alternatively, INSERT_2, UPDATE_2 and DELETE_2 operations run in implicit transaction, which leads to more efficient logging.

01. Transaction Logging with Autocommitted and Explicit Transactions

You can run the code below to see this overhead in action. It performs INSERT/UPDATE/DELETE sequence 10,000 times in the loop in autocommitted and explicit transactions respectively, measuring execution time and transaction log throughput with sys.dm_io_virtual_file_stats view.

create table dbo.TranOverhead
(
    Id int not null,
    Col char(50) null,
    constraint PK_TranOverhead
    primary key clustered(Id)
);

-- Autocommitted transactions
declare
    @Id int = 1,
    @StartTime datetime = getDate(),
    @num_of_writes bigint,
    @num_of_bytes_written bigint

select @num_of_writes = num_of_writes, @num_of_bytes_written = num_of_bytes_written
from sys.dm_io_virtual_file_stats(db_id(),2);

while @Id < 10000
begin
    insert into dbo.TranOverhead(Id, Col) values(@Id, 'A');

    update dbo.TranOverhead set Col = 'B' where Id = @Id;

    delete from dbo.TranOverhead where Id = @Id;

    set @Id += 1;
end;

select 
    datediff(millisecond, @StartTime, getDate()) as [Exec Time ms: Autocommitted Tran]
    ,s.num_of_writes - @num_of_writes as [Number of writes]
    ,(s.num_of_bytes_written - @num_of_bytes_written) / 1024 as [Bytes written (KB)]
from
    sys.dm_io_virtual_file_stats(db_id(),2) s;
go

-- Explicit Tran
declare
    @Id int = 1,
    @StartTime datetime = getDate(),
    @num_of_writes bigint,
    @num_of_bytes_written bigint

select @num_of_writes = num_of_writes, @num_of_bytes_written = num_of_bytes_written
from sys.dm_io_virtual_file_stats(db_id(),2);

while @Id < 10000
begin
    begin tran
        insert into dbo.TranOverhead(Id, Col) values(@Id, 'A');

        update dbo.TranOverhead set Col = 'B' where Id = @Id;

        delete from dbo.TranOverhead where Id = @Id;
    commit
    set @Id += 1;
end;

select 
    datediff(millisecond, @StartTime, getDate()) as [Exec Time ms: Explicit Tran]
    ,s.num_of_writes - @num_of_writes as [Number of writes]
    ,(s.num_of_bytes_written - @num_of_bytes_written) / 1024 as [Bytes written (KB)]
from
    sys.dm_io_virtual_file_stats(db_id(),2) s;

In my environment and this test, explicit transactions are about two times faster and generated three less log activity than autocommitted ones as shown in Figure 2.

02. Explicit and Autocommitted Transaction Performance

As the side note, SQL Server 2014 and above allows you to improve transaction log throughput by using delayed durability. In this mode, SQL Server does not flush log buffers when COMMIT log records are generated. This reduces the number of disk writes at cost of potential small data loss in case of disaster.

SQL Server also supports implicit transactions, which you can enable with SET IMPLICIT_TRANSACTION ON statement. When this option is enabled, SQL Server starts the new transaction when there is no active explicit transactions present. This transaction stays active until you explicitly issue COMMIT or ROLLBACK statement.

Implicit transactions may make transaction management more complicated and they are rarely used in production. However, there is the caveat – SET ANSI_DEFAULT ON option also automatically enables implicit transactions. This behavior may lead to unexpected concurrency issues in the system.

Error Handling

The error handling in SQL Server is the tricky subject especially with transactions involved. SQL Server would handle exceptions differently depending on error severity, active transaction context and several other factors.

Let’s look how exceptions affect control flow during execution. Listing below creates two tables- dbo.Customers and dbo.Orders – and populates them with the data. Note the existence of foreign key constraint defined in dbo.Orders table.

create table dbo.Customers
(
    CustomerId int not null,
    constraint PK_Customers
    primary key(CustomerId)
);

create table dbo.Orders
(
    OrderId int not null,
    CustomerId int not null,

    constraint FK_Orders_Customerss
    foreign key(CustomerId)
    references dbo.Customers(CustomerId)
);
go

create proc dbo.ResetData
as
begin
    begin tran
        delete from dbo.Orders;
        delete from dbo.Customers;
        insert into dbo.Customers(CustomerId) values(1),(2),(3);
        insert into dbo.Orders(OrderId, CustomerId) values(2,2);
    commit
end;

exec dbo.ResetData;

Let’s run three DELETE statements in one batch as shown below. The second statement will trigger a foreign key violation error. @@ERROR system variable provides the error number for the last T-SQL statement executed (0 means no errors).

delete from  dbo.Customers where CustomerId = 1; -- Success
select @@ERROR as [@@ERROR: CustomerId = 1];
delete from  dbo.Customers where CustomerId = 2; -- FK Violation
select @@ERROR as [@@ERROR: CustomerId = 2];
delete from  dbo.Customers where CustomerId = 3; -- Success
select @@ERROR as [@@ERROR: CustomerId = 3];
go

select * from dbo.Customers;

Figure 3 illustrates the output of the code. As you can see, SQL Server continues execution after non-critical foreign key violation error deleting a row with CustomerId=3 afterwards.

03. Running Three Autocommitted Transactions in a Batch

The situation would change when you use TRY..CATCH block as shown below.

exec dbo.ResetData;
go

begin try
    delete from  dbo.Customers where CustomerId = 1; -- Success
    delete from  dbo.Customers where CustomerId = 2; -- FK Violation
    delete from  dbo.Customers where CustomerId = 3; -- Not executed
end try
begin catch
    select 
        ERROR_NUMBER() as [Error Number]
        ,ERROR_LINE() as [Error Line]
        ,ERROR_MESSAGE() as [Error Message];
end catch
go

select * from dbo.Customers;

As you can see in Figure 4, the error was caught in the CATCH block and the third deletion statement has not been executed.

04. Running Three Autocommitted Transactions in TRY..CATCH block

Remember that non-critical exceptions do not automatically rollback explicit or implicit transactions regardless if TRY..CATCH block is present. You still need to commit or rollback transaction after the error.

Depending on severity of the error, transaction in which error occurred may be committable or become uncommittable and doomed. SQL Server would not allow you to commit uncommittable transactions and you must roll it back to complete it.

The XACT_STATE() function allows you to analyze the state of transaction and it returns one of three values:

  • 0 indicates that there is no active transactions present.
  • 1 indicates that there is an active and committable transaction present. You can perform any actions and data modifications committing transactions afterwards.
  • -1 indicates that there is an active uncommittable transaction present. You cannot commit such transaction.

There is very important SET option- XACT_ABORT– which allows you to control error-handling behavior in the code. When this option is set to ON, SQL Server treats every run-time error as severe, making transaction uncommittable. This prevents you from accidentally committing transactions when some data modifications failed with non-critical errors.

When XACT_ABORT is enabled, any error would terminate the batch when TRY..CATCH block is not present. For example, if you run the code from the second code sample above again using SET XACT_ABORT ON, the third DELETE statement would not be executed and only the row with CustomerId=1 will be deleted. Moreover, SQL Server would automatically rollback doomed uncommitted transaction after the batch completes.

The code below shows this behavior. The stored procedure dbo.GenerateError sets XACT_ABORT to ON and generates an error within the active transaction. @@TRANCOUNT variable returns the nested level of transaction (more on it later) and non-zero value indicate that transaction is active.

create proc dbo.GenerateError
as
begin
    set xact_abort on
    begin tran
        delete from dbo.Customers where CustomerId = 2; -- Error
        select 'This statement will never be executed';
end
go

exec dbo.GenerateError;
select 'This statement will never be executed';
go

-- Another batch
select XACT_STATE() as [XACT_STATE()], @@TRANCOUNT as [@@TRANCOUNT];
go

Figure 5 illustrates the output of the code. As you can see, batch execution has been terminated and transaction has been automatically rolled back at the end of the batch.

505. XACT_ABORT behavior (no TRY..CATCH block)

TRY..CATCH block, however, will allow you to capture the error even with XACT_ABORT set to ON.

begin try
    exec dbo.GenerateError;
    select 'This statement will never be executed';
end try
begin catch
    select 
        ERROR_NUMBER() as [Error Number]
        ,ERROR_PROCEDURE() as [Procedure]
        ,ERROR_LINE() as [Error Line]
        ,ERROR_MESSAGE() as [Error Message];

    select 
        XACT_STATE() as [XACT_STATE()]
        ,@@TRANCOUNT as [@@TRANCOUNT];

    if @@TRANCOUNT > 0
        rollback;
end catch

As you can see in Figure 6, exception has been trapped in the CATCH block with transaction still remain active there.

06. XACT_ABORT Behavior (with TRY..CATCH block)

Consistent error handling and transaction management strategies are extremely important and allow to avoid data consistency errors and improve data quality in the system. I would recommend the following approach as the best practice:

  • Always use explicit transactions in the code during data modifications. This would guarantee data consistency in transactions that consists of multiple operations. It is also more efficient comparing to individual autocommitted transactions.
  • Set XACT_ABORT to ON before data modifications. This would guarantee “all-or-nothing” behavior of the transaction preventing SQL Server from ignoring non-severe errors and committing partially completed transactions.
  • Use proper error handling with TRY..CATCH blocks and explicitly rollback transactions in case of exceptions. This helps to avoid unforeseen side effects in case of the errors.

The choice between client-side and server-side transaction management depends on application architecture. Client-side management is required when data modifications are done in the application code, for example changes are generated by ORM frameworks. On the other hand, stored procedure-based data access tier may benefit from server-side transaction management.

The code below provides the example of the stored procedure that implements server-side transaction management.

create proc dbo.PerformDataModifications
as
begin
    set xact_abort on
    begin try
        begin tran
            /* Perform required data modifications */
        commit
    end try
    begin catch
        if @@TRANCOUNT > 0 -- Transaction is active
            rollback;
        /* Addional error-handling code */
        throw;  -- Re-throw error. Alternatively, SP may return the error code
    end catch;
end;

Nested Transactions

SQL Server technically supports nested transactions; however, they are primarily intended to simplify transaction management during nested stored procedure calls. In practice, it means that the code needs to explicitly commit all nested transactions and the number of COMMIT calls should match the number of BEGIN TRAN calls. The ROLLBACK statement, however, rolls back entire transaction regardless of the current nested level.

The code below demonstrates this behavior. As I already mentioned, system variable @@TRANCOUNT returns the nested level of the transaction.

select @@TRANCOUNT as [Original @@TRANCOUNT];
begin tran
    select @@TRANCOUNT as [@@TRANCOUNT after the first BEGIN TRAN];
    begin tran
        select @@TRANCOUNT as [@@TRANCOUNT after the second BEGIN TRAN];
    commit
    select @@TRANCOUNT as [@@TRANCOUNT after nested COMMIT];    
    begin tran
        select @@TRANCOUNT as [@@TRANCOUNT after the third BEGIN TRAN];
    rollback
select @@TRANCOUNT as [@@TRANCOUNT after ROLLBACK];
rollback; -- This ROLLBACK generates the error

You can see the output of the code in Figure 7 below.

07. Nested Transactions

You can save the state of transaction and create a savepoint by using SAVE TRANSACTION statement. This will allow you to partially rollback a transaction returning to the most recent savepoint. The transaction will remain active and needs to be completed with explicit COMMIT or ROLLBACK statement later.

It is worth noting that uncommittable transactions with XACT_STATE() = -1 cannot be rolled back to savepoint. In practice, it means that you cannot rollback to savepoint after an error if XACT_ABORT is set to ON.

The code below illustrates savepoints in action. The stored procedure creates the savepoint when it runs in active transaction and rolls back to this savepoint in case of committable error.

create proc dbo.TryDeleteCustomer
(
    @CustomerId int 
)
as
begin
    -- Setting XACT_ABORT to OFF for rollback to savepoint to work
    set xact_abort off

    declare
        @ActiveTran bit

    -- Check if SP is calling in context of active transaction
    set @ActiveTran = IIF(@@TranCount > 0, 1, 0);

    if @ActiveTran = 0
        begin tran;
    else 
        save transaction TryDeleteCustomer;

    begin try
        delete dbo.Customers where CustomerId = @CustomerId;

        if @ActiveTran = 0
            commit;
        return 0;
    end try
    begin catch
        if @ActiveTran = 0 or XACT_STATE() = -1
        begin
            -- Rollback entire transaction
            rollback tran; 
            return -1; 
        end
        else begin
                -- Rollback to savepoint 
            rollback tran TryDeleteCustomer; 
            return 1; 
        end
    end catch;
end;
go

-- Test
declare
    @ReturnCode int

exec dbo.ResetData; 

begin tran
    exec @ReturnCode = TryDeleteCustomer @CustomerId = 1;
    select 
        1 as [CustomerId]
        ,@ReturnCode as [@ReturnCode]
        ,XACT_STATE() as [XACT_STATE()];
    
    if @ReturnCode >= 0
    begin
        exec @ReturnCode = TryDeleteCustomer @CustomerId = 2;
        select 
            2 as [CustomerId]
            ,@ReturnCode as [@ReturnCode]
            ,XACT_STATE() as [XACT_STATE()];    
    end
if @ReturnCode >= 0
    commit;
else 
    if @@TRANCOUNT > 0
        rollback;
go

select * from dbo.Customers;

The test triggered foreign key violation during the second dbo.TryDeleteCustomer call. This is non-critical error and, therefore, the code is able to commit after it as shown in Figure 8.

08. Transaction Has Been Committed After Rollback to Savepoint

It is worth noting that this example is shown for demonstration purposes only. From efficiency standpoint, it would be better to validate referential integrity and existence of the orders before deletion occurred rather than catching exception and rolling back to savepoint in case of an error.

I hope that those examples provided you the good overview of transaction management and error handling strategies in the system. If you want to dive deeper, I would strongly recommend you to read the great article by Erland Sommarskog, which provides you much more details on the subject.

Source code is available for download.

Table of Context

Splitting and Merging Non-Empty Partitions in Columnstore Indexes

Last week we discussed how to implement Sliding Window pattern in the tables with columnstore indexes. As I promised, today we are going to focus on partition management in such tables.

As you will remember, SQL Server does not allow you to split and merge non-empty columnstore partitions. I personally do not consider it as the huge limitation – perhaps because I rarely have reasons to do it in columnstore tables. However, there are still some cases when it may be required.

As one of examples, think about a table, partitioned by date, which may be populated with incorrect data from the future. Perhaps due to some issues in ETL processes. Some rows may be placed to incorrect (future) partition, which would prevent its split.

The common, by the book approach recommends dropping columnstore index, splitting or merging partitions and recreating the index afterwards. As you can imagine, it would lead to extremely inefficient process with huge amount of unnecessary overhead on large tables. After all, you have to drop and recreate columnstore index, converting table to Heap, while just subset of the partitions needs to be rebuilt. Fortunately, you can minimize the overhead with simple workaround:

  1. Switch partition(s) to split or merge to the separate staging table
  2. Split or merge partition(s) in the main table. You can do that because partitions will be empty after the previous step
  3. Drop columnstore index in the staging table, split/merge partition(s) there and recreate the index afterwards
  4. Switch partition(s) back from staging to the main table.

Let’s look at the process in details. I am going to recreate the tables I used last week with the script below.

create partition function pfOrders(datetime2(0))
as range right for values
('2016-11-01','2016-12-01','2017-01-01','2017-02-01'
,'2017-03-01','2017-04-01','2017-05-01','2017-06-01'
,'2017-07-01','2017-08-01','2017-09-01','2017-10-01'
,'2017-11-01','2017-12-01','2018-01-01');

create partition scheme psOrders 
as partition pfOrders
all to ([Primary]);

create table dbo.Orders
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        Placeholder char(100),
)
on psOrders(OrderDate);

create table dbo.OrderLineItems
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        OrderLineItemId int not null,
        Placeholder char(100),
)
on psOrders(OrderDate);
go

-- Left-most and right-most are empty
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as t1 cross join N1 as t2) -- 4 rows
,N3(C) as (select 0 from N2 as t1 cross join N2 as t2) -- 16 rows
,N4(C) as (select 0 from N3 as t1 cross join N3 as t2) -- 256 rows
,N5(C) as (select 0 from N4 as t1 cross join N4 as t2) -- 65,536 rows
,Ids(Id) as (select row_number() over (order by (select null)) from N5)
insert into dbo.Orders(OrderDate, OrderId)
        select dateadd(day,Id % 390,'2016-11-01'), ID
        from Ids;

insert into dbo.OrderLineItems(OrderDate,OrderId, OrderLineItemId)
        select OrderDate, OrderId, OrderId
        from dbo.Orders;
go

create clustered columnstore index CCI_Orders on dbo.Orders
on psOrders(OrderDate);

create clustered columnstore index CCI_OrderLineItems on dbo.OrderLineItems
on psOrders(OrderDate);
go

select 
        t.object_id, i.index_id, i.name as [Index], p.partition_number,
        p.[Rows], p.data_compression_desc, 
        fg.name as [Filegroup],
        sum(a.total_pages) as TotalPages, 
        sum(a.used_pages) as UsedPages, 
        sum(a.data_pages) as DataPages,
        sum(a.total_pages) * 8 as TotalSpaceKB, 
        sum(a.used_pages) * 8 as UsedSpaceKB, 
        sum(a.data_pages) * 8 as DataSpaceKB
from 
        sys.tables t with (nolock) join sys.indexes i with (nolock) on
                t.object_id = i.object_id
        join sys.partitions p with (nolock) on 
                i.object_id = p.object_id AND i.index_id = p.index_id
        join sys.allocation_units a with (nolock) on 
                p.partition_id = a.container_id
        join sys.filegroups fg with (nolock) on 
                a.data_space_id = fg.data_space_id
where
        t.object_id = object_id(N'dbo.Orders')
group by
        t.object_id, i.index_id, i.name, p.partition_number,
        p.[Rows], p.data_compression_desc, fg.name
order by 
        i.index_id, p.partition_number;

Figure 1 demonstrates data distribution across partitions in the dbo.Orders table. As you can see, we have 13 non-empty partitions started from November 2016 up to end of November 2017.

01.Initial Data Distribution

Let’s assume that we would like to merge November and December 2016 together, which are partitions 2 and 3 in the tables.

As the first step, we will create the staging tables. They need to be partitioned in the same way as the main tables. It is better to use separate partition function to decouple merge and split operations in main and staging tables from each other. Remember that SQL Server would hold schema modification (Sch-M) lock on all tables that use partition function for the duration of merge and split operations.

Partition function on the staging tables does not need to include all partitions from the main tables. It just need to include partition(s) you are about it split or merge and two adjacent partitions on the left and right sides. This will guarantee that partitions in the main and staging tables are aligned before and after the operation. It would also support partition switch without requirement of creating extra CHECK constraints on the staging tables. As you can see in the code below, pfOrdersStaging function consists of four partitions:

  • Everything prior November 2016. Will be empty.
  • November 2016
  • December 2016
  • Everything on or after January 1st 2017. Will be empty
create partition function pfOrdersStaging(datetime2(0))
as range right for values
('2016-11-01','2016-12-01','2017-01-01');

create partition scheme psOrdersStaging
as partition pfOrdersStaging
all to ([Primary]);

create table dbo.OrdersStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create table dbo.OrderLineItemsStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        OrderLineItemId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);

After the staging tables are created, we can switch two partitions from the main tables there.

alter table dbo.Orders switch partition 2 
to dbo.OrdersStaging partition 2;

alter table dbo.OrderLineItems switch partition 2 
to dbo.OrderLineItemsStaging partition 2;

alter table dbo.Orders switch partition 3
to dbo.OrdersStaging partition 3;

alter table dbo.OrderLineItems switch partition 3
to dbo.OrderLineItemsStaging partition 3;

Now we can merge the partitions in the main tables. SQL Server will allow us to proceed because they are empty after the switch.

alter partition function pfOrders() merge range ('2016-12-01');

Finally, we can merge partitions in our staging tables and switch them back to the main tables as shown below.

drop index CCI_OrdersStaging on dbo.OrdersStaging;
drop index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging;
go

-- Offline with Sch-M
alter partition function pfOrdersStaging()
merge range ('2016-12-01');
go

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);
go

-- Switching partitions back
alter table dbo.OrdersStaging switch partition 2 
to dbo.Orders partition 2;

alter table dbo.OrderLineItemsStaging switch partition 2 
to dbo.OrderLineItems  partition 2;

As you can see in Figure 2, now partition 2 stores the data for both November and December  of 2016.

02.Data Distribution After Merge

We can use the same approach for partition split. Code below demonstrates how to split November 2017 partition using 2017-11-15 as the new range value.

-- Spliting November 2017 partition (#13)
-- Recreating Staging objects first
drop table if exists dbo.OrdersStaging;
drop table if exists dbo.OrderLineItemsStaging;
if exists(select * from sys.partition_schemes where name = 'psOrdersStaging') drop partition scheme psOrdersStaging;
if exists(select * from sys.partition_functions where name = 'pfOrdersStaging') drop partition function pfOrdersStaging;
go

create partition function pfOrdersStaging(datetime2(0))
as range right for values
('2017-11-01','2017-12-01');

create partition scheme psOrdersStaging
as partition pfOrdersStaging
all to ([Primary]);

create table dbo.OrdersStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create table dbo.OrderLineItemsStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        OrderLineItemId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);

-- Switching partitions to the staging table
alter table dbo.Orders switch partition 13
to dbo.OrdersStaging partition 2;

alter table dbo.OrderLineItems switch partition 13
to dbo.OrderLineItemsStaging partition 2;
go

-- Splitting partition in the main table 
alter partition scheme psOrders 
next used [PRIMARY];

alter partition function pfOrders()
split range ('2017-11-15');
go

-- Now spllitting partition in the Staging table
drop index CCI_OrdersStaging on dbo.OrdersStaging;
drop index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging;
go

-- Offline with Sch-M
alter partition scheme psOrdersStaging 
next used [PRIMARY];

alter partition function pfOrdersStaging()
split range ('2017-11-15');
go

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);

-- Switching partitions back
alter table dbo.OrdersStaging switch partition 2 
to dbo.Orders partition 13;

alter table dbo.OrdersStaging switch partition 3 
to dbo.Orders partition 14;

alter table dbo.OrderLineItemsStaging switch partition 2 
to dbo.OrderLineItems  partition 13;

alter table dbo.OrderLineItemsStaging switch partition 3
to dbo.OrderLineItems  partition 14;

As you can see, with just a few extra operators, we were able to remove the overhead of rebuilding columnstore index on entire table.

Unfortunately, this is offline approach. The data in the main tables would become inconsistent as soon as you switch partitions to the staging tables. Fortunately, if data is static, you can address it with some coding.

Instead of switching partitions to the staging tables, you can copy the data there. After split or merge is done in the staging tables, you can empty source partitions by switching them to another, temporary tables, and switch data back from the staging tables.

The code below illustrates this approach.

-- If data in the table is static 
-- Let's merge November-December 2016 and January 2017 
drop table if exists dbo.OrdersStaging;
drop table if exists dbo.OrderLineItemsStaging;
if exists(select * from sys.partition_schemes where name = 'psOrdersStaging') drop partition scheme psOrdersStaging;
if exists(select * from sys.partition_functions where name = 'pfOrdersStaging') drop partition function pfOrdersStaging;
go

create partition function pfOrdersStaging(datetime2(0))
as range right for values
('2016-11-01','2017-01-01','2017-02-01');

create partition scheme psOrdersStaging
as partition pfOrdersStaging
all to ([Primary]);

create table dbo.OrdersStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create table dbo.OrderLineItemsStaging
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        OrderLineItemId int not null,
        Placeholder char(100),
)
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);

-- Copying data from Main to Staging tables
insert into dbo.OrdersStaging(OrderDate,OrderId,Placeholder)
        select OrderDate, OrderId, Placeholder
        from dbo.Orders
        where $Partition.pfOrders(OrderDate) in (2,3);

insert into dbo.OrderLineItemsStaging(OrderDate,OrderId,OrderLineItemId,Placeholder)
        select OrderDate, OrderId, OrderLineItemId, Placeholder
        from dbo.OrderLineItems
        where $Partition.pfOrders(OrderDate) in (2,3);
go

-- Merging partitions in Staging tables
drop index CCI_OrdersStaging on dbo.OrdersStaging;
drop index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging;
go

alter partition function pfOrdersStaging()
merge range ('2017-01-01');
go

create clustered columnstore index CCI_OrdersStaging on dbo.OrdersStaging
on psOrdersStaging(OrderDate);

create clustered columnstore index CCI_OrderLineItemsStaging on dbo.OrderLineItemsStaging
on psOrdersStaging(OrderDate);
go

-- Creating temporary table to switch original
-- partitions from the main table
create partition function pfOrdersTmp(datetime2(0))
as range right for values
('2016-11-01','2017-01-01','2017-02-01');

create partition scheme psOrdersTmp
as partition pfOrdersTmp
all to ([Primary]);

create table dbo.OrdersTmp
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        Placeholder char(100),
)
on psOrdersTmp(OrderDate);

create clustered columnstore index CCI_OrdersTmp on dbo.OrdersTmp
on psOrdersTmp(OrderDate);

create table dbo.OrderLineItemsTmp
(
        OrderDate datetime2(0) not null,
        OrderId int not null,
        OrderLineItemId int not null,
        Placeholder char(100),
)
on psOrdersTmp(OrderDate);

create clustered columnstore index CCI_OrderLineItemsTmp on dbo.OrderLineItemsTmp
on psOrdersTmp(OrderDate);

-- Final steps. Let's do it in transaction
-- All operations are on metadata level
set xact_abort on
begin tran
        -- Switching original partitions out
        alter table dbo.Orders switch partition 2 
        to dbo.OrdersTmp partition 2;

        alter table dbo.OrderLineItems switch partition 2 
        to dbo.OrderLineItemsTmp partition 2;

        alter table dbo.Orders switch partition 3
        to dbo.OrdersTmp partition 3;

        alter table dbo.OrderLineItems switch partition 3
        to dbo.OrderLineItemsTmp partition 3;

        -- Merge
        alter partition function pfOrders()
        merge range ('2017-01-01');

        -- Switching partitions from the staging table
        alter table dbo.OrdersStaging switch partition 2 
        to dbo.Orders partition 2;

        alter table dbo.OrderLineItemsStaging switch partition 2 
        to dbo.OrderLineItems  partition 2;
commit;
go

-- Dropping temporary tables
drop table dbo.OrderLineItemsTmp;
drop table dbo.OrdersTmp;
drop partition scheme psOrdersTmp;
drop partition function pfOrdersTmp;

As you can see, this implementation is relatively simple and considering that data in columnstore indexes is often static, you may be able to utilize it to perform partition maintenance transparently to the users.

Source code is available for download.

Locking in Microsoft SQL Server (Part 20) – Range lock (RangeS-U) deadlock due to IGNORE_DUP_KEY index option

As you know, SQL Server uses range locks to protect the range of the index keys. This usually happens in SERIALIZABLE isolation level. This level prevents phantom and non-repeatable reads phenomena and it guarantees that queries executed in transaction will always obtain the same set of data every time they were executed.

As the simple example, consider the table with ID column and two rows with ID = 1 and ID = 10. In SERIALIZABLE isolation level, the query that selects data from the table should always obtain those and only those two rows if you run it multiple times. SQL Server uses shared range lock protecting ID key range interval of (1..10), which guarantees that other transactions would not be able to update or delete existing rows nor insert any new rows into the interval.

In contrast, REPEATABLE READ isolation level uses row locks rather than range locks. They would prevent modifications of existing rows- 1 and 10- but would not prevent other transactions from inserting rows into the interval and introducing phantom read phenomena.

The range locks are usually acquired only in SERIALIZABLE isolation level; however, there is another, pretty much undocumented case, when SQL Server can use those locks. It happens even in READ UNCOMMITTED and READ COMMITTED SNAPSHOT modes when you have nonclustered indexes that have IGNORE_DUP_KEY=ON option. In that case rows with the duplicated index keys would not raise an error but rather being ignored. SQL Server would not insert then into the table.

This behavior leads to very hard to explain cases of blocking and even deadlocks in the system. Let’s look at the example and create the table with a few rows as shown below. As you see, nonclustered index on the table has IGNORE_DUP_KEY option enabled.

create table dbo.IgnoreDupKeysDeadlock
(
    CICol int not null,
    NCICol int not null
);

create unique clustered index IDX_IgnoreDupKeysDeadlock_CICol
on dbo.IgnoreDupKeysDeadlock(CICol);

create unique nonclustered index IDX_IgnoreDupKeysDeadlock_NCICol
on dbo.IgnoreDupKeysDeadlock(NCICol)
with (ignore_dup_key = on);

insert into dbo.IgnoreDupKeysDeadlock(CICol, NCICol)
values(0,0),(5,5),(10,10),(20,20);

Now let’s start transaction in READ UNCOMMITTED mode and insert the row into the table checking the locks session acquired.

set transaction isolation level read uncommitted
begin tran
    insert into dbo.IgnoreDupKeysDeadlock(CICol,NCICol) values(1,1);

    select request_session_id, resource_type, resource_description
        ,resource_associated_entity_id, request_mode, request_type, request_status
    from sys.dm_tran_locks
    where request_session_id = @SPID;

As you can see in Figure 1, INSERT statement acquired and held two exclusive (X) locks on the rows inserted into clustered and nonclustered indexes. It also obtained Range (RangeS-U) lock on nonclustered index. RangeS-U means that the key range is protected with the shared (S) lock and SQL Server uses update (U) scan within the range.

01. Locks Held by the Session

01. Locks Held by the Session

You may ask the obvious question – why the range lock is required? The reason is the way how SQL Server handles modifications of the data. The data is always inserted into or updated in the clustered index first followed by nonclustered index updates. With IGNORE_DUP_KEY=ON, SQL Server should prevent the situation when the duplicated keys were inserted into nonclustered index simultaneously after clustered index insert was done and, therefore, clustered index insert needs to be rolled back. Thus, it locks nonclustered index key range before any data modifications preventing other sessions from inserting any rows there.

You can confirm it by looking at lock_acquired xEvent events as it is show in Figure 2. As you can see, the RangeS-U lock is acquired before exclusive (X) locks on the resources.

02. lock_acquired Events During Insert

02. lock_acquired Events During Insert

The key problem here, however, is that RangeS-U lock behaves the same way as in SERIALIZABLE isolation level and it is held until the end of transaction. Moreover,  RangeS-U locks are incompatible with each other.  That can lead to very unpleasant and hard to understand deadlocks.

Let’s run the code shown below in another session. The first INSERT would succeed (it is in the different key range in the index). The second, however, would be blocked due to RangeS-U/RangeS-U lock incompatibility.

set transaction isolation level read uncommitted
begin tran
    -- Succeed
    insert into dbo.IgnoreDupKeysDeadlock(CICol,NCICol)	values(12,12);
    -- Blocked
    insert into dbo.IgnoreDupKeysDeadlock(CICol,NCICol) values(2,2);
commit

If we checked the locks held by the both sessions now, we would see the picture shown in Figure 3. You can see that session 2 successfully acquired the first range lock but the second range lock request is blocked due to incompatible range lock on the same key interval held by the session 1.

03. Locks Held by Both Sessions

03. Locks Held by Both Sessions

Finally, if we run another INSERT in the session 1 into the range locked by the session 2, it would be also blocked with the typical deadlock condition.

    insert into dbo.IgnoreDupKeysDeadlock(CICol,NCICol) values(11,11);
commit

Figure 4 shows you the deadlock graph.

04. Deadlock Graph

04. Deadlock Graph

Unfortunately, there is very little you can do about that. The only way to address the problem is removing IGNORE_DUP_KEY option from the index handling duplicates in the different ways. It may or may not work for you.

Finally, it is worth mentioning, that SQL Server does not use range locks in case of clustered indexes with IGNORE_DUP_KEY=ON option. Clustered indexes are modified first and SQL Server could detect duplicated keys at this stage without any extra range locking required.

Source code is available for download.

Table of Content.

How To Move Data and Log Files To Different Drives Keeping the Database Online in Non-Enterprise Editions of SQL Server

We, SQL Server professionals, like Enterprise Edition. It has many bells and whistles that make our life easier and less stressful. We wish to have Enterprise Edition installed on every server. Unfortunately, customers do not always share our opinions – they want to save money. More often than not, they choose to go with the Standard Edition, which is significantly less expensive.

From performance standpoint, Standard Edition would suffice in many cases. Even though it lacks several nice features, it would work just fine even in large and busy systems. I dealt with many multi-TB installations that handled thousands transactions per second using Standard Edition of SQL Server.

Nevertheless, Standard edition lacks many of availability features offered in Enterprise Edition. Most important is index management. You cannot rebuild indexes keeping the table online. There are some tricks that can help reducing index rebuild time; however, it would not help much with the large tables.

This limitation has another interesting implication. In Standard Edition you cannot rebuild the indexes moving data to another filegroup transparently to the users. One of the cases when such an ability is very important is changing the database disk layout when you are upgrading disk subsystem. Obviously, it is very easy to do offline – this is just the matter of copying database files. However, even with the fast disk subsystem, that can take hours in multi-TB databases, which could violate your availability SLA.

This is especially critical with the Cloud installations where I/O subsystem is usually the biggest bottleneck due to the bad I/O performance. The situation, however, is starting to change. Both, Microsoft Azure and Amazon AWS now offer fast SSD-based I/O solutions for very reasonable price. Unfortunately, the old installations were usually deployed to the old and slow disks and upgrading to the new drives will often lead to the hours of the downtime.

Fortunately, you can move data to the different disk arrays almost transparently to the users even in non-Enterprise Editions of SQL Servers. There are two ways how to accomplish it. The first one is very simple and can be done if system uses database mirroring. It requires failovers and secondary server downtime, which could lead to the data loss in case of disaster.

The second approach works without the mirroring. It is slow, it generates large amount of transaction log records, it introduces huge index fragmentation; however, it keeps database online most of the time. There is still the downtime involved; although, it could be limited to just a few minutes. It will work in any SQL Server version and edition – well, to be frank, I have not tried it in SQL Server 2000 yet.

Let’s look at both of those approaches in details.

Moving database files with mirroring Involved

Database mirroring and, as matter of fact, Always On Availability Groups rely on the stream of transaction log records. Secondary servers apply the changes in the data files using file and page IDs as the reference. With exception of database file-related operations, for example file creation, primary and secondary servers do not need to store database files in the same location – it is possible to use different disk and folder structure on the servers.

You can rely on this behavior if you need to move database files to the different drives. You can run ALTER DATABASE MODIFY FILE(FILENAME=..) command on the secondary server, which will allow you to update data and log files paths in the system catalogs. Everything will continue run normally – those changes would not take place until the next database restart.

Unfortunately, you cannot take database that participate in the mirroring session offline and you need to shut down entire instance of SQL Server. After that, you can physically move database files to the new location. On the primary server, the database mirroring will switch to the DISCONNECTED state. The database will continue to be available to the clients; however, it remains unprotected – all changes will be lost in case of disaster. You need to remember that file copy operation can take hours and you need to evaluate if you can take such a risk. It is also worth to mention that transaction log on the primary would not truncate and continue to grow even after log backups – SQL Server needs to retain the log records until they sent to the secondary server.

After the file copy operation is completed, you can start the instance (the primary database will switch to SYNCHRONIZING state) and wait until all log records have been transmitted to the secondary (SYNCHRONIZED state). Then, you can failover and wash, rinse and repeat the process on the former primary server.

To summarize, this process is very simple and transparent to the client applications. It is the good choice as long as you can afford the instance downtime and possibility of  data loss in case of disaster. If this is not the case, you will have to use much more complicated approach.

When mirroring is not an option..

.. our life is much more complicated and the process will require multiple steps to complete.

  1. We need to create the new data files in the secondary filegroups and shrink existing files by using DBCC SHRINKFILE(EMPTYFILE) command. This will move data from old to the new data files.
  2. Next, we need to repeat the same process with the primary filegroup. You cannot remove primary MDF file from the database; although, you can make it very small and move all data from there.
  3. Next, we need to shrink the transaction log .
  4. Finally, we need to copy MDF and LDF files to the new location. This is offline operation; however, both, MDF and LDF data files are small at this point and downtime is minimal.

Let’s look at the process in details. As the first step, let’s create the test database with two filegroups and populate it with some data. For the demo purposes, I am assuming that C:\OldDrive folder represents old and C:\NewDrive – new disk arrays respectively.

create database DataMovementDemo
on primary
( name = N'DataMovementDemo', filename = N'C:\OldDrive\DataMovementDemo.mdf', size = 100MB, filegrowth = 50MB),
filegroup [Secondary]
( name = N'DataMovementDemo_Secondary1', filename = N'C:\OldDrive\DataMovementDemo_Secondary1.ndf', size = 100MB, filegrowth = 50MB),
( name = N'DataMovementDemo_Secondary2', filename = N'C:\OldDrive\DataMovementDemo_Secondary2.ndf', size = 100MB, filegrowth = 50MB)
log on
( name = N'DataMovementDemo_log', filename = N'C:\OldDrive\DataMovementDemo_log.ldf', size = 500MB, filegrowth = 500MB)
Go

alter database DataMovementDemo set recovery full
go

use DataMovementDemo
go

create table dbo.DataOnPrimary
(
    ID int not null,
    Placeholder char(8000),
    constraint PK_DataOnPrimary
    primary key clustered(ID)
    on [Primary]
);

create table dbo.DataOnSecondary
(
    ID int not null,
    Placeholder char(8000),
    constraint PK_DataOnSecondary
    primary key clustered(ID)
    on [Secondary]
);

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2 ) -- 65,536 rows
,Nums(Num) as (select row_number() over (order by (select null)) from N5)
    insert into dbo.DataOnPrimary(ID)
        select Num from Nums;

insert into dbo.DataOnSecondary(ID)
    select ID from dbo.DataOnPrimary;

We can check the size of the data and log files along with their free space with the code below.

select 
    f.name as [FileName]
    ,fg.name as [FileGroup]
    ,f.physical_name as [Path]
    ,f.size / 128.0 as [CurrentSizeMB]
    ,convert(int,fileproperty(f.name,'SpaceUsed')) / 
        128.0 as [UsedSpaceMB]
    ,f.size / 128.0 - convert(int,fileproperty(f.name,'SpaceUsed')) / 
        128.0 as [FreeSpaceMb]
from 
    sys.database_files f left join sys.filegroups fg on 
        f.data_space_id = fg.data_space_id;

Figure 1 shows the output of the statement.

01. Database file stats after database creation

Moving data files from secondary filegroups

As the first step, you need to create new data files on the target drive. You can keep the same number of files as before, or use this as the opportunity to change the filegroup layout. In general, the number of files in the fielgroup greatly depends on the volatility of the data. Every data file has its own set of allocation map pages, which reduces the contention during page and extent allocations. It is impossible to give the general advice on how many files to create – I usually start with four files per filegroup unless the data is extremely volatile and the filegroup handles hundreds or even thousands of inserts per second. You can monitor and analyze PAGELATCH waits to see if there is the contention and adjust the number of the files accordingly.

In our example, let’s create two data files on C:\NewDrive folder as shown below. Make sure that both files have exactly the same initial size and autogrowth parameters specified in MB. This will help SQL Server to evenly distribute data between them.

alter database DataMovementDemo add file 
( name = N'DataMovementDemo_NewSecondary1', filename = N'C:\NewDrive\DataMovementDemo_NewSecondary1.ndf', size = 250MB, filegrowth = 50MB )
to filegroup [Secondary];

alter database DataMovementDemo add file 
( name = N'DataMovementDemo_NewSecondary2', filename = N'C:\NewDrive\DataMovementDemo_NewSecondary2.ndf', size = 250MB, filegrowth = 50MB )
to filegroup [Secondary];

Figure 2 shows the output from the Listing 2 after new files were created.

02. File stats after new files has been created

The DBCC SHIRINKFILE command works in a very simple manner. It reads allocated extents from the end of the file and moves them to the other files in the filegroup. In case, if  filegroup has multiple files, SQL Server uses proportional fill algorithm choosing to which file those extents need to be moved. The choice depends on amount of free space in the file – more space file has, more data would be copied there.

In case, when the filegroup originally has more than one file, you would like to avoid an overhead of moving data to the file, which yet to be moved. In our case, you want the data from DataMovementDemo_Secondary1 file to be distributed only between DataMovementDemo_NewSecondary1 and DataMovementDemo_NewSecondary2 files avoiding any inserts into DataMovementDemo_Secondary2 file.

Usually, data files in production databases do not have excessive amount of free space. When this is the case, you can simply prevent unnecessary data movements by restricting auto-growth of the old files. However, if those files have large amount of free space, you can also consider to shrink them and release this space first. There is the catch though. If free space is located in the beginning of the data file, shrink operation would start data movement and introduce the overhead. You need to make decision how to proceed on case by case basis.

The next listing shows how you can restrict the auto-growth for the file.

declare
    @MaxFileSizeMB int
    ,@SQL nvarchar(max)
	
-- Obtaining current file size 	
select @MaxFileSizeMB = size / 128 + 1
from sys.database_files
where name = 'DataMovementDemo_Secondary2';

set @SQL = N'alter database DataMovementDemo 
modify file(name=N''DataMovementDemo_Secondary2'',maxsize=' + 
    convert(nvarchar(32),@MaxFileSizeMB) + N'MB);';

exec sp_executesql @SQL;

Now we are ready to process the first data file. Listing below shows the code that performs data movement and removes an empty file from the filegroup afterwards. Both operations are transparent to the users and client applications. It is worth mentioning that you can use the code from the second listing above to monitor the progress of the operation. You can also look at percent_complete column in sys.dm_exec_requests view.

dbcc shrinkfile(DataMovementDemo_Secondary1, emptyfile);
alter database DataMovementDemo remove file DataMovementDemo_Secondary1;

If you checked the status of the files after operation is completed, you would see the results as shown in Figure 3. As you see, the data from the data file has been distributed between other files in the filegroup.

03. File stats after the first file has been processed

You can use exactly the same approach to move data from the DataMovementDemo_Secondary2 file. Listing shows the code and Figure 4 shows the database files after the process is completed. As you see, the secondary filegroup now resides on the new drive.

dbcc shrinkfile(DataMovementDemo_Secondary2, emptyfile);
alter database DataMovementDemo remove file DataMovementDemo_Secondary2;

04. File stats after the second file movement

The word of caution. As I already mentioned, DBCC SHRINKFILE  generates enormous amount of transaction log records. Make sure that transaction log is truncating especially if the database uses FULL recovery model.

Moving primary data file

Even though many of us know about the best practice of keeping PRIMARY filegroup empty, it rarely followed. Majority of production databases keep the data in PRIMARY filegroup, which usually consist of the single MDF file.

Unfortunately, you cannot remove nor change the primary data file in the database. Moreover, you cannot shrink the file below the size of the data currently stored in the file, even if a filegroup has the other data files.

Fortunately, you can still use DBCC SHRINKFILE(EMPTYFILE) command on MDF data file. It would move data to the other files in the filegroup and failing on the final stage of the execution with the error message shown in Figure 5. Nevertheless, the majority of the data from the MDF data file would be moved to the other files.

05. DBCC SHRINKFILE(EMPTYFILE) error on the primary data file

Listing below shows the code that performs this action. It creates the second data file in PRIMARY filegroup and moves the data from MDF file there. Figure 6 shows the file stats after it is completed – after DBCC SHRINKFILE(EMPTYFILE) error.

alter database DataMovementDemo add file 
( name = N'DataMovementDemo_Primary2', filename = N'C:\NewDrive\DataMovementDemo_Primary2.ndf', size = 550MB, filegrowth = 50MB )
to filegroup [Primary];
go

-- It will error in the end
dbcc shrinkfile(DataMovementDemo, emptyfile);

06.File stats after DBCC SHRINKFILE(EMPTYFILE) error

As you see, MDF data file is pretty much empty. You can release the empty space from the file using DBCC SHRINKFILE(DataMovementDemo, 10) command. Figure 7 illustrates the situation after it is completed and MDF file became very small.

07. File stats after removal free space from MDF file

Unfortunately, this approach introduces two or more unevenly sized data files in the PRIMARY filegroup, which makes proportional fill algorithm less efficient. It may or may not be a problem in your system, depending on how volatile is the data. You can address it after you move MDF file to the new drive by creating other data files in PRIMARY filegroup and shrinking and emptying the file you just created. This will distribute the data in all files in the filegroup evenly.

Finally, it is worth mentioning that in some cases, especially when MDF file is very large, DBCC SHRINKFILE(EMPTYFILE) command can error in the middle of the execution stating that it cannot move some of the data pages that belong to the system objects. You can address it by re-running DBCC SHRINKFILE using the current data size as the target (e.g. releasing the empty space from the file). This will move those data pages within the file and you can re-run DBCC SHRINKFILE(EMPTYFILE) command afterwards.

Shrinking transaction log

The decision how to handle transaction log depends on its size, and backup and high availability strategies you have in place. Transaction log size affects time, which file copy operation will require and, therefore, the system downtime. Obviously, the simplest solution is avoid shrinking transaction log if the log file is not very large and downtime is acceptable.

In case, if you need to reduce the downtime, there are no options but shrinking the log file. It is usually not a problem in case if database uses the SIMPLE recovery model. However, with FULL recovery model situation is a bit more complicated.

As the first step in this process, you need to truncate the log by performing the log backup. This operation does not decrease the size of the log file but rather reduce the size of the active/used portion of the log. Keep in mind that open transactions, backlogs in high availability log record queues and a few others factors can prevent transaction log from being truncated.

Next, you can shrink the log file using DBCC SHRINKFILE command with the very small size- 50MB, for example- as the target. Your results may vary. Internally, SQL Server splits the log to the multiple blocks called Virtual Log Files and re-uses them in the circular matter. Shrink operation would release the empty space from the tail of the log; however, the resulting file size depends on the active VLF offsets in the file.  It is entirely possible that shrink command would not reduce the file size if active VLFs are close to the end of the file.

Let’s illustrate that with the example and run DBCC LOGINFO command, which shows the status of the VLFs in the log file. Figure 8 illustrates the partial output from our test database. Status value of 2 indicates that VLF is active and cannot be truncated. As you can see, it is in the middle of the file.

08. DBCC LOGINFO output

As you saw in Figure 7, the log file is using just 61MB out of 1.5GB in the file. However, if you run DBCC SHRINKFILE(DataMovementDemo_Log, 65) command, you’d see that it did not shrink beyond 1000MB as shown in Figure 9.

09. File stats after shrinking transaction log

If you run DBCC LOGINFO again, you would see that SQL Server removed the empty VLFs from the end of the file and stopped when it reached the active VLF there. Figure 10 illustrates that. It is also worth mentioning that the first VLF in the file also became active during the shrink.

10. DBCC LOGINFO output after shrinking transaction log

At this point you have the two options, assuming that size of the log file is still unacceptable. You can wait until SQL Server truncates the log making last VLF inactive and repeat the shrink operation afterwards. This will eventually happen with the regular workflow. You can even force this by generating transaction log records by creating the table with one CHAR(8000) column and inserting multiple rows there in the separate transactions and batches. Do not forget to force log truncation with BACKUP LOG operations and use DBCC LOGINFO to monitor the progress.

Alternative option is switching database to the SIMPLE recovery model using ALTER DATABASE SET RECOVERY SIMPLE command. This will perform log truncation and will allow you to shrink the log to the minimal size immediately. Obviously, this approach will require you to disable transaction log-based high availability technologies and recreate backup chain afterwards.

While, on the surface, switching database to SIMPLE recovery model introduces unnecessary complications, it could be the good opportunity to rebuild transaction log file. Large number of VLFs negatively affect system performance and can slow down database recovery time. Unfortunately, default settings in New Database dialog in Management Studio leads to that situation. At least in SQL Server prior 2016.

You can rebuild transaction log after you moved the file to the new drive by manually growing it in 4000MB chunks – do not use 4GB chunks due to the bug in some of SQL Server versions. Every chunk will generate 16 250MB VLF files, which works well for the most configurations.  After that, change log auto-growth to be in MB – I found that 1000MB chunks are good for majority of the cases.

Moving MDF and LOG files to the new drive

Finally, it is the time to move MDF and LDF files to the new drive. Unfortunately, it is offline operation. Fortunately, at this point, both files should be very small and downtime should be minimal.

As the first step, you need to change location of the files using ALTER DATABASE MODIFY FILE command. This will change location of the files in the system catalogs, and will take an effect after the database restart.

Next, you can take database offline using ALTER DATABASE .. SET OFFLINE WITH ROLLBACK IMMEDIATE command. This will disconnect all users from the database rolling back the active transactions. You can copy the files and take database back online using ALTER DATABASE .. SET ONLINE command as shown below.

use master
go

alter database DataMovementDemo modify file
( name = N'DataMovementDemo', filename = N'C:\NewDrive\DataMovementDemo.mdf');
alter database DataMovementDemo modify file
( name = N'DataMovementDemo_Log', filename = N'C:\NewDrive\DataMovementDemo_Log.ldf');
go

alter database DataMovementDemo set offline with rollback immediate
go

-- COPY FILES 

alter database DataMovementDemo set online;

As you see in Figure 11 – our database is now residing on the new drive.

11. Final file layout

Almost done! Again, do not forget to rebuild the log file and switch database to FULL recovery model if needed.

Dealing with index fragmentation

There is one final step though. As you already know, DBCC SHRINKFILE command works on the extent level. It moves allocated extents from the end of the file to the new place without any considerations to which objects those extents belong. As you can guess, this leads to the huge index fragmentation, which you need to address at the final stage of the process.

Obviously, you do not want to acquire Schema Modification (SCH-M) locks blocking access to the tables during index rebuild operations. It makes index reorg the better choice for this scenario – it is online in any edition of SQL Server.

Listing below shows the script you can use to generate ALTER INDEX REORG commands for every clustered and nonclustered indexes from specific filegroup. It returns result as XML string – you can simply cut and paste it and run as another batch.

;with FGObjects(SchemaName, TableName, IndexName, RowNum, Cnt)
as
(
    select 
        s.Name, t.Name, i.Name
        ,ROW_NUMBER() over(order by t.object_id, i.index_id) as RowNum
        ,COUNT(*) over() as Cnt
    from
        sys.indexes i join sys.filegroups f on
            i.data_space_id = f.data_space_id
        join sys.all_objects t on
            i.object_id = t.object_id
        join sys.schemas s on
            t.schema_id = s.schema_id
    where
        i.index_id >= 1 and
        t.type = 'U' and -- User Created Tables
        i.data_space_id = f.data_space_id and
        f.name = 'PRIMARY' -- Filegroup
)
select 
    'alter index ' as [text()]
    ,[IndexName] as [text()]
    ,' on ' + SchemaName + '.' as [text()]
    ,[TableName] as [text()]
    ,' reorganize;' + CHAR(13) + CHAR(10) as [text()]
    ,'raiserror(''' as [text()]
    ,RowNum as [text()]
    ,'/' as [text()]
    ,Cnt as [text()]
    ,' is done'',0,1) with nowait;' + CHAR(13) + CHAR(10) as [text()]
    ,'go' + CHAR(13) + CHAR(10) as [text()]
from FGObjects
for xml path('');

As you see, the process of moving database files between different drives could lead to significant amount of work if you want to minimize the downtime. However, it is often the only choice, especially in the Cloud environment where you can get significant performance benefits by utilizing new SSD-based drives. Go for it! 🙂

Source code is available for download.

Compressing LOB (XML) Data in the Database

Some time ago we have discussed the several techniques that can help reducing the database size. I have received quite a few responses on that post and several people asked if I could provide more details and examples. Today I will try to follow up on one of the methods, such as reducing the size of LOB data (in particular XML) in the database.

As you know, SQL Server stores the data in regular B-Tree indexes in three different sets of the data pages called allocation units. The main data row structure and fixed-length data are stored in IN-ROW data pages. Variable-length data greater than 8,000 bytes in size is stored in LOB (large object) pages. Such data includes (max) columns, XML, CLR UDT and a few other data types. Finally, variable-length data, which does not exceed 8,000 bytes, is stored either in IN-ROW data pages when it fits into the page, or in ROW-OVERFLOW data pages. You can read more about it in the previous post and here.

Enterprise Edition of SQL Server allows you to reduce the size of the data by implementing data compression. However, data compression is applied to IN-ROW data only and it does not compress ROW-OVERFLOW and LOB data. Any large objects that do not fit into IN-ROW data pages remain uncompressed.

Unfortunately, there is very little we can do to reduce the size of the LOB data. SQL Server does not provide any build-in functional to address it. The only remaining option is compressing it manually – either on the client or with CLR routines. Obviously, compression and decompression adds overhead and, from SQL Server load standpoint, it is preferable to do it on the client side. However, in the large number of cases, you will need to access compressed data from T-SQL, and CLR integration is the only choice. Ideal implementation in that case would combine compression and decompression code in both tiers and use CLR only when it is necessary.

Fortunately, .Net implementation of the compression code is very simple and can be done with DeflateStream or GZipStream classes. Below you can see the code of CLR functions that perform compression and decompression. You can also download entire project with the link at the end of the post.

/// <summary>
/// Compressing the data
/// </summary>
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
        DataAccess = DataAccessKind.None)]
public static SqlBytes BinaryCompress(SqlBytes input)
{
    if (input.IsNull)
        return SqlBytes.Null;

    using (MemoryStream result = new MemoryStream())
    {
        using (DeflateStream deflateStream = 
            new DeflateStream(result, CompressionMode.Compress, true))
        {
            deflateStream.Write(input.Buffer, 0, input.Buffer.Length);
            deflateStream.Flush();
            deflateStream.Close();
        }
        return new SqlBytes(result.ToArray());
    } 
}

/// <summary>
/// Decompressing the data
/// </summary>
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
        DataAccess = DataAccessKind.None)]
public static SqlBytes BinaryDecompress(SqlBytes input)
{
    if (input.IsNull)
        return SqlBytes.Null;
    
    int batchSize = 32768;
    byte[] buf = new byte[batchSize];

    using (MemoryStream result = new MemoryStream())
    {
        using (DeflateStream deflateStream = 
            new DeflateStream(input.Stream, CompressionMode.Decompress, true))
        {
            int bytesRead;
            while ((bytesRead = deflateStream.Read(buf, 0, batchSize)) > 0)
                result.Write(buf, 0, bytesRead);
        }
        return new SqlBytes(result.ToArray());
    } 
}

You can define the functions in the database with the following code (you either need to get byte sequence of the compiled assembly from the demo script or compile CLR project).

create assembly LOBCompress
authorization dbo
from /*..*/
go

create function dbo.BinaryCompress(@input varbinary (max))
returns varbinary (max)
as external name [LOBCompress].[Compress].[BinaryCompress];
go

create function dbo.BinaryDecompress(@input varbinary (max))
returns varbinary (max)
as external name [LOBCompress].[Compress].[BinaryDecompress];
go

Now let’s see the process in action, create the test table and populate it with some data.

create table dbo.DataWithXML
(
    ID int not null,
    Data xml not null,

    constraint PK_DataWithXML
    primary key clustered(ID)
)
go

declare
    	@X xml

select @X = 
    (
        select *
        from master.sys.objects
        for xml raw, root('Data')
    )

;with n1(c) as (select 0 union all select 0) -- 2 rows
,n2(c) as (select 0 from n1 as t1 cross join n1 as t2) -- 4 rows
,n3(c) as (select 0 from n2 as t1 cross join n2 as t2) -- 16 rows
,n4(c) as (select 0 from n3 as t1 cross join n3 as t2) -- 256 rows
,n5(c) as (select 0 from n4 as t1 cross join n3 as t2) -- 4,096 rows
,ids(id) as (select row_number() over (order by (select null)) from n5)
insert into dbo.DataWithXML(ID,Data)
    select id, @X
    from Ids;

update dbo.DataWithXML
set Data.modify('replace value of (/Data/row/@object_id)[1]
with sql:column("ID")');

Let’s examine the size of the table and average size of XML there using the following queries:

select 
    index_id, partition_number, alloc_unit_type_desc
    ,index_level
    ,page_count
    ,page_count * 8 / 1024 as [Size MB]
from 
    sys.dm_db_index_physical_stats
    (
        db_id() /*Database */
        ,object_id(N'dbo.DataWithXML') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */
    )
go

select avg(datalength(Data)) as [Avg XML Size]
from dbo.DataWithXML;

As you can see in Figure 1, LOB allocation units are using 96MB of space to store the data. The average size of the XML data is 20,897 bytes per row. It is also worth mentioning that actual storage size for XML is 3 data pages per row, which is 24,576 bytes.

1. Uncompressed data: Storage space and XML Size

Let’s see what we can achieve by using compression. The code below creates another table and copies the data there, compressing it on the fly. As you see, the data is stored in varbinary(max) column.

create table dbo.DataWithCompressedXML
(
    ID int not null,
    Data varbinary(max) not null,

    constraint PK_DataWithCompressedXML
    primary key clustered(ID)
)
go

insert into dbo.DataWithCompressedXML(ID,Data)
    select ID, dbo.BinaryCompress(convert(varbinary(max),Data))
    from dbo.DataWithXML

Let’s check the size of the data in compressed table:

select avg(datalength(Data)) as [Uncompressed] 
from dbo.DataWithXML;

select avg(datalength(Data)) as [Compressed] 
from dbo.DataWithCompressedXML
go

select 
    index_id, partition_number, alloc_unit_type_desc
    ,index_level
	,page_count
	,page_count * 8 / 1024 as [Size MB]
from 
    sys.dm_db_index_physical_stats
    (
        db_id() /*Database */
        ,object_id(N'dbo.DataWithCompressedXML') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */
    )

As you can see in Figure 2, we were able to reduce the size of the table from 96 to 10MB decreasing the size of the XML in every row from 20,897 to 2,674 bytes.

2. Compressed data: Storage size and compressed XML size

It is also worth noting that in our case, the size of compressed data is less than 8,000 bytes and SQL Server was able to accommodate all the data using IN-ROW allocation units. Even though the new table is almost 10 times smaller than the old one, it has significantly more IN-ROW data pages in the index. This could introduce some performance side effects in some cases. For example, when system has poorly optimized queries that perform clustered index scans. Again, it could become the issue only if compressed data is less than 8,000 bytes. Otherwise, SQL Server will still store it using the LOB data pages.

Obviously, this implementation requires schema and code changes. We can mitigate it a little bit by abstracting it with the views as it is shown below.

create view dbo.vDataWithXML(ID, Data)
as
    select ID, convert(xml,dbo.BinaryDecompress(Data))
    from dbo.DataWithCompressedXML

We can even create INSTEAD OF trigger (yuck!) on the view to minimize the changes if absolutely needed.

Speaking of the overhead, compressing and decompressing are CPU intensive and there is the performance penalty of calling CLR functions. It is not that noticeable when you need to decompress the single or very few rows; however, it could be very significant on the large data sets. For example, when you need to decompress and shred XML and use some of its elements in a where clause of the query. The code below shows such an example.

set statistics time on
select count(*) 
from dbo.DataWithXML 
where Data.value('(/Data/row/@object_id)[1]','int') = 3;

select count(*) 
from dbo.vDataWithXML 
where Data.value('(/Data/row/@object_id)[1]','int') = 3;
set statistics time off

The second SQL that accesses the compressed table has to decompress XML for every row in the table. The execution times of the statements on my laptop are 116ms and 6,899ms respectively. As you see, the compression definitely comes at cost.

One of the ways to reduce such an overhead, is storing XML elements that are used in the queries in the separate table columns. Unfortunately, by-the-book approach with persisted calculated columns would not always works. It is possible to create and persist such a column using user-defined functions; however, SQL Server would ignore it in some cases. The code below shows the example that creates calculated column that contains the attribute we are using in our queries.

create function dbo.fnGetCompressedObjectId(@Compressed varbinary(max))
returns int
with schemabinding
as
begin
    return (convert(xml,dbo.BinaryDecompress(@Compressed))
                .value('(/Data/row/@object_id)[1]','int'))
end
go	 

alter table dbo.DataWithCompressedXML
add
    ObjId as dbo.fnGetCompressedObjectId(Data)
        persisted
go

-- It is a good practice to rebuild index after alteration
alter index PK_DataWithCompressedXML 
on dbo.DataWithCompressedXML rebuild
go

alter view dbo.vDataWithXML(ID, Data, ObjId)
as
    select ID, convert(xml,dbo.BinaryDecompress(Data)), ObjId
    from dbo.DataWithCompressedXML
go

Unfortunately, if you ran the following query: select count(*) from dbo.vDataWithXML where ObjId = 3, you’d notice that SQL Server recalculates the value of the calculated column even though it is persisted and functions are defined as deterministic and precise. This is just the limitation of the Query Optimizer. Figure 3 shows that ObjId is recalculated.

3. Execution plan of the query

There is still the possibility of using persisted calculated columns. For example, you can define CLR function, which decompress and parse XML and return ObjID as the integer. Something like that:

.Net CLR:
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
            DataAccess = DataAccessKind.None)]
public static SqlInt32 GetObjId(SqlBytes input)
{
    if (input.IsNull)
        return SqlInt32.Null;
    /* Parsing XML with XmlReader and return ObjId attribute */
}

SQL: 
create function dbo.GetObjId(@input varbinary (max))
returns int
as external name [LOBCompress].[Compress].[GetObjId]
go

alter table dbo.DataWithCompressedXML drop column ObjId
go

alter table dbo.DataWithCompressedXML
add
    ObjId as dbo.GetObjId(Data)
        persisted
go

alter index PK_DataWithCompressedXML 
on dbo.DataWithCompressedXML rebuild
go

Now, if you run the previous query: select count(*) from dbo.vDataWithXML where ObjId = 3, it would work just fine and does not recalculate the column value. Figure 4 shows the execution plan in this case.

4. Execution plan that utilizes calculated column

It is worth mentioning that execution time of this query on my laptop is just 3 milliseconds comparing to 116 milliseconds of the query against dbo.DataWithXML table. The query is significantly faster because it does not need to shred XML to obtain ObjID value.

Obviously, creating separate CLR methods for each calculated column can lead to some coding overhead. You can consider using regular columns instead and populate them in the code or even in the triggers. Each approach has the own set of benefits and downsides based on the use-cases implemented in the system.

Compressing LOB data in the database could help you to significantly reduce the database size in the large number of cases. However, it adds an overhead of compressing and decompressing data. In some cases, such overhead would be easily offset by the smaller data size, less I/O and buffer pool usage but in any case, you should be careful and take all other factors into the consideration.

Source code is available for download.

Optimizing Substring Search Performance in SQL Server

The requirement of searching data by part of the value is very common in business applications. All of us are familiar with it – users want to be able to search by entering just a few letters from the client or article name; locate postal address by typing just a part of the street; or do something similar in dozens of the other cases.

Obviously, there are many ways to skin the cat and implement such a search. In some complex and performance-critical cases we can use external to SQL Server solutions, for example Apache Lucene. In others, we can use Full-Text Search or even do the brute force approach with LIKE operator. Today, I would like to talk about the latter one. After all, even though LIKE is not necessarily the fastest solution, its performance could often be acceptable especially with relatively small tables. Last but not least, it comes with very little implementation cost.

Unfortunately, LIKE operator cannot use Index Seek unless you are performing the prefix search. In that case, when you are searching by the beginning of the string – for example, LastName LIKE ‘Smit%’ condition – SQL Server is able to locate subset of the data where predicate needs to be evaluated. In our example, the predicate, in the nutshell, is the following condition: LastName >= ‘Smit’ and LastName < ‘Smiu’, which is perfectly SARGable and suitable for the fast Index Seek.

This is not the case, however, when LIKE expression requires SQL Server to find patterns in the middle of the string – for example, in LastName LIKE ‘%Smit%’ situation. The only option for SQL Server is evaluating expression against every row from the index, which leads to the Index Scan.

As strange as it sounds, you can often improve search performance by challenging business requirements. Even though customers want to be able to search by substring, in the very large number of cases prefix search would do. For example, when you are calling customer service somewhere and asking them to look up your account, you’d usually provide them the first few letters of your name rather than some letters from the middle of it.

Unfortunately, changing the business requirements is not always possible. In some cases, we do not have any choices but implementing substring search. In those occasions, there are two ways to improve the performance – reduce the number of rows where LIKE predicate must be evaluated and reduce predicate evaluation time.

Reducing the number of rows for predicate evaluation greatly depends on the indexes. While you cannot do much when LIKE is the only predicate in the query, such condition is usually an exception rather than the rule. In case when query has multiple predicates, the right composite indexes would help. The key here is adding evaluation column as the key or included column of the index and avoiding post-Key Lookup predicate evaluation.

For example, consider multi-tenant shopping cart system and the query that need to return the list of the articles that belong to particular tenant. The query could be implemented as follows:

select ..
from dbo.Articles
where TenantID = @TenantID and Name LIKE '%' + @paramName + '%'
order by Name

Such query would benefit from the following index, which will limit LIKE predicate evaluation to the single tenant scope. As the side note, adding Name as the key rather than included column would help to avoid SORT operator in the execution plan – data in the index would be sorted according to order by clause of the query.

create index IDX_Articles_TenantID_Name
on dbo.Articles(TenantID,Name)

Reducing the predicate evaluation time is the trickier subject. Fortunately, you can often achieve significant performance increase by utilizing binary collations during such an evaluation.

Let’s take a look at the example. As the first step, we will create a table and populate it with some random data. Col1, Col2 and Col3 columns are populated with randomly generated GUIDs and VarCol and NVarCol store concatenated values from them using SQL_Latin1_General_CP1_CI_AS collations. Finally, I am creating nonclustered indexes on VarCol and NVarCol columns to minimize amount of data pages SQL Server need to read during our tests and make those tests consistent.

create table dbo.Data
(
    ID int not null,
    Col1 uniqueidentifier not null
        default NEWID(),
    Col2 uniqueidentifier not null
        default NEWID(),
    Col3 uniqueidentifier not null
        default NEWID(),
    VarCol varchar(108) null,
    NVarCol nvarchar(108) null,
    
    constraint PK_Data
    primary key clustered(ID)
)
go

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select NULL)) from N5)
insert into dbo.Data(ID)
    select ID from IDs;

update dbo.Data
set
    VarCol =
        convert(varchar(36),Col1) +
        convert(varchar(36),Col2) +
        convert(varchar(36),Col3)
    ,NVarCol =
        convert(nvarchar(36),Col1) +
        convert(nvarchar(36),Col2) +
        convert(nvarchar(36),Col3)
go

create nonclustered index IDX_Data_VarCol
on dbo.Data(VarCol);

create nonclustered index IDX_Data_NVarCol
on dbo.Data(NVarCol);

Next, let’s randomly choose substring to search using one of the rows from the table. You would obviously have different data in your case/

select * from dbo.Data where ID = 10000

01. Choosing test substring for the search.

Now, let’s run SELECT statements that perform substring search against both columns and measure the execution time of the selects. I am disabling parallelism with MAXDOP 1 hint to avoid any parallelism overhead during queries execution.

declare
    @V varchar(32) = '9D81AB12'
    ,@NV nvarchar(32) = N'9D81AB12'

set statistics time on

select count(*)
from dbo.Data
where VarCol like '%' + @V + '%'
option (maxdop 1);

select count(*)
from dbo.Data
where NVarCol like '%' + @NV + '%'
option (maxdop 1);

set statistics time off

In my environment, CPU times of the first and second statements are 203 and 844 milliseconds respectively. Obviously, you would get the different results in your system and performance would greatly depend on the data.

It is also worth mentioning, that index on NVarCol is about two times larger than index on VarCol column due to the fact, that unicode data uses 2-bytes per character to store the data as the opposite to 1-byte per character with non-unicode varchars. However, the overhead of the extra logical reads is minimal.

Now let’s measure execution time using binary collations. First, we will alter the table adding two calculated columns that represent our strings in binary collation and creating nonclustered indexes afterwards.

alter table dbo.Data
add VarColBin as upper(VarCol) collate Latin1_General_100_Bin2
persisted;

alter table dbo.Data
add NVarColBin as upper(NVarCol) collate Latin1_General_100_Bin2
persisted;

create nonclustered index IDX_Data_VarColBin
on dbo.Data(VarColBin);

create nonclustered index IDX_Data_NVarColBin
on dbo.Data(NVarColBin);

PERSISTED keyword tells SQL Server to materialize calculated columns and store them in the data row. Technically speaking, you do not need to persist calculated columns in our case – you can index them even when they are not persisted, which helps to avoid clustered index row size increase. However, you need to be careful making sure that SQL Server always uses nonclustered index for the search. Otherwise, search performance could be even slower than with nonbinary collation – SQL Server will need to calculate column values on the fly before evaluating LIKE predicate.

Another very important factor to remember is case sensitivity of the binary collation. You need to convert your data to upper or lower case if your system uses case-insensitive collation. Obviously, you need to use the same conversion rule for the search predicate.

declare
    @V varchar(32) = '9D81AB12'
    ,@NV nvarchar(32) = N'9D81AB12'

set statistics time on

select count(*)
from dbo.Data
where VarColBin like '%' + Upper(@V) + '%' collate Latin1_General_100_Bin2
option (maxdop 1);

select count(*)
from dbo.Data
where NVarColBin like '%' + Upper(@NV) + '%' collate Latin1_General_100_Bin2
option (maxdop 1);

set statistics time off

The execution times in my environment are 125 and 62 milliseconds respectively. You can see all the results in Figure 2 below.

02. Test results

As you can see, we got more than 13 times performance improvement in case of the unicode data. Performance improvement with non-unicode strings are less drastic; however, it still ran about 40% faster than before. It is also worth mentioning that with the binary collation, predicate evaluation against unicode data is faster than with varchar data. At least with my test data.

Lastly, the word of caution. While that technique can be help to improve performance of substring search and reduce CPU load in the system, you should not treat it as the replacement of the proper indexing. After all, you can get much better ROI by investing your time into query optimization. However, it is the great technique to use in conjunction with query optimization and index tuning when you need to get the most from your queries.