Category Archives: SQL Server 2014

Plan Cache: Parameter Sniffing

There are several reasons why it is beneficial to use parameters in the queries. It mitigates security risks by protecting from SQL Injection. Moreover, it helps with the performance. Query optimization is expensive and resource-intensive process. SQL Server reduces that cost by keeping execution plans in the special part of the memory called Plan Cache and reusing them on subsequent calls.

Even though, this approach benefits the system by reducing the number of query compilations, it could also introduce some issues. Cached plans can become inefficient due to the data distribution and/or parameter values change. SQL Server tracks changes in the data by counting number of changes of the index columns outdating statistics and recompiling queries when number of changes reach about 20% of number of rows in the table (trace flag 2371 can reduce that threshold). However, tracking parameter values change is more complicated.

When SQL Server compiles parametrized query or stored procedure, it analyzes parameters and generates execution plan, which is optimal for those specific values. This process is called Parameter Sniffing. Furthermore, generated plans are cached and reused regardless of parameter values at time of reuse. That behavior can lead to the situations when query is compiled and plan is generated based on atypical parameter set. Cached plan, in this case, would be suboptimal for the calls that reuse that plan.

All of us experienced the situations when query or stored procedure suddenly became very slow even when they ran fine before. Parameter sniffing could be the reason of why it happened. SQL Server decided to recompile the query due to statistics update and the query that triggered recompilation had atypical parameter values. Even though, generated execution plan was efficient for parameters at time of compilation; that cached plan is not necessarily efficient for the other queries that reuses it.

Historically, parameter sniffing has the negative connotation in SQL Server community. It is used to explain why queries have suboptimal execution plans, like I just did in the previous paragraph. This is vastly wrong, however. Parameter sniffing is just the name for SQL Server behavior of analyzing parameters during query compilation.

Let’s look at a few examples of parameter sniffing and discuss how to address the issues it can introduce. In this post, I am using the stored procedure; however, everything would work the same in case of parametrized queries.

As the first step, let’s create a table dbo.Employees. We will populate it distributing data in the way that most rows have Country column value of USA with a few employees with Country value of Canada. Listing below shows the code.

create table dbo.Employees
(
	ID int not null,
	Number varchar(32) not null,
	Name varchar(100) not null,
	Salary money not null,
	Country varchar(64) not null,

	constraint PK_Employees
	primary key clustered(ID)
);

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2 ) -- 65,536 rows
,Nums(Num) as (select row_number() over (order by (select null)) from N5)
insert into dbo.Employees(ID, Number, Name, Salary, Country)
	select 
		Num, 
		convert(varchar(5),Num), 
		'USA Employee: ' + convert(varchar(5),Num), 
		40000,
		'USA'
	from Nums;

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,Nums(Num) as (select row_number() over (order by (select null)) from N3)
insert into dbo.Employees(ID, Number, Name, Salary, Country)
	select 
		65536 + Num, 
		convert(varchar(5),65536 + Num), 
		'Canada Employee: ' + convert(varchar(5),Num), 
		40000,
		'Canada'
	from Nums;

create nonclustered index IDX_Employees_Country
on dbo.Employees(Country);

As a next step, let’s create a stored procedure that calculates the average salary for employees in a specific country.

create proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
end

As you can guess, when we call this stored procedure with parameter @Country=’USA’, stored procedure needs to touch almost every row in the table and the most optimal execution plan is Clustered Index Scan. However, when we call the stored procedure with parameter @Country=’Canada’, the most efficient execution plan is Nonclustered Index Seek with Key Lookup operators.

Let’s call the stored procedure twice: the first time with @Country=’USA’ and the second time with @Country=’Canada’, as shown below.

exec dbo.GetAverageSalary @Country='USA';
exec dbo.GetAverageSalary @Country='Canada';

As you can see in Figure 1, SQL Server compiles the stored procedure and caches the plan with the first call and reuses it later. Even though such a plan is less efficient with the @Country=’Canada’ parameter value, it may be acceptable when those calls are rare, which is expected with such a data distribution.

01. Execution plans when the stored procedure is compiled with @Country=’USA’

Now let’s take a look what happens if we swap those calls when the plan is not cached. We will use the DBCC FREEPROCCACHE command, which clears plan cache (DO NOT RUN IT ON PRODUCTION SERVER). As I already mentioned, another instance when this might happen is with a statistics update that forces query to recompile.

dbcc freeproccache
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 2, SQL Server now caches the plan based on the @Country=’Canada’ parameter value. Even though this plan is more efficient when the stored procedure is called with @Country=’Canada’, it is highly inefficient for @Country=’USA’ calls.

02. Execution plans when the stored procedure is compiled with @Country=’Canada’

There are a few ways to address the issue. You can force the recompilation of either stored procedure using EXECUTE WITH RECOMPILE or a statement-level recompile with OPTION (RECOMPILE) clauses. Obviously, a statement-level recompile is better because it performs the recompilation on a smaller scope. SQL Server sniffs the parameter values at the time of the recompilation, generating the optimal execution plan for each parameter value. It is also worth mentioning, that statement-level recompile allows SQL Server to obtain number of rows stored in table variables, which can help to generate more efficient execution plans in some cases.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (recompile)
end
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 3, SQL Server recompiles the statement on every call, and it generates the most efficient execution plan for every parameter value. It is also worth noting that SQL Server does not cache plans for the statements when statement-level recompile is used.

03. Execution plans with OPTION (RECOMPILE) hint

The statement-level recompile may be a good option when the queries do not execute very often or in the case of complex queries, when the compilation time is just a fraction of the total execution time. However, it is hardly the best approach for OLTP queries, which are constantly running in the system due to the extra CPU load that recompilation introduces.

Another option is using an OPTIMIZE FOR hint, which forces SQL Server to optimize a query for the specific parameter values provided in the hint. Listing below illustrates such an approach.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (optimize for(@Country='USA'))
end
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 4, SQL Server ignores the parameter value during the compilation and optimizes the query for the @Country=’USA’ value.

04. Execution plans with OPTIMIZE FOR hint

Unfortunately, the OPTIMIZE FOR hint introduces supportability issues, and it can lead to suboptimal execution plans in cases where the data distribution has changed. Let’s consider the situation, albeit unrealistic, when a company and all of its employees moved from the United States to Germany.

update dbo.Employees set Country='Germany' where Country='USA';

exec dbo.GetAverageSalary @Country='Germany';

Statistics are outdated at the time of the update, which forces SQL Server to recompile the statement in the stored procedure. At this point, there are no rows in the table with Country=’USA’, and the recompilation produces a suboptimal execution plan, as shown in Figure 5. As a side note, the query uses more reads than before due to the index fragmentation introduced by the update.

05. Inefficient execution plan after data distribution change

SQL Server 2008 introduced another optimization hint, OPTIMIZE FOR UNKNOWN, which helps to address such situations. With this hint, SQL Server performs an optimization based on the most statistically common value in the table. Listing below shows the code involved in doing this.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (optimize for(@Country UNKNOWN))
end
go

exec dbo.GetAverageSalary @Country='Canada';

Figure 6 illustrates the execution plan. Country=’Germany’ is the most statistically common value in the table, and therefore SQL Server generates an execution plan that is optimal for such a parameter value.

06. Execution plan with OPTIMIZE FOR UNKNOWN hint

You can achieve the same results with an OPTIMIZE FOR UNKNOWN hint by using local variables instead of parameters. This method also works with SQL Server 2005, where the OPTIMIZE FOR UNKNOWN hint is not supported. Listing below illustrates that approach. It introduces the same execution plan with a Clustered Index Scan, as shown in Figure 6 above.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	declare
		@CountryTmp varchar(64)
	set @CountryTmp = @Country

	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @CountryTmp
end

Finally, in case when you are dealing with 3rd party systems and cannot add optimization hints to the code, you could use plan guides to achieve the same goal. We will discuss plan guides in one of the future blog posts.

Source code is available for download.

Partial Database Backup and Piecemeal Restore in Microsoft SQL Server

Unfortunately, disasters happen. Regardless of how good is High Availability strategy that you have implemented, there is always the chance that one day you will need to move the database to another server and quickly bring system online. And the chance is – you will have to do it under stress with your phone ringing every few minutes and your manager asking about ETA. It does not help that such process can be very time consuming in the case of the large databases.

Enterprise Edition of SQL Server supports concept of piecemeal restore and allows you to restore database on filegroup-by-filegroup basis keeping database online during the process. Queries that access data from online filegroups would work just fine. This is the great technique that can dramatically reduce system downtime.

In the large number of cases, the size of the operational data that is required for system to be operational is relatively small. Historical data, on the other hand, is often kept due to regulation/compliance reasons and rarely accessed by the applications. As you can guess, it allows you to quickly bring part of the database that supports operational activity online and work with remaining historical data afterwards while customers can connect and use the system.

Obviously, you should design data placement in the way that supports piecemeal restore. In the nutshell, it means the separation of the operational and historical data across different filegroups. As the example, let’s consider Order Entry/Shopping Cart system that stores data for several years. One of the data layout designs could be the following:

  • Empty Primary FG. Primary Filegroup should be online in order for database to be online. It is good idea to keep primary filegroup empty and do not place any objects there.
  • Entities FG. This filegroup could store catalog tables, such as Customers, Articles and others.
  • One or more filegroups for the operational data. For example, if operational period is the current year, this filegroup can store Orders, OrderLineItems and related entities that stores current-year data.
  • One or more filegroups for the historical data. Those filegroups store data that is not required to support operational activity in the system.

Piecemeal restore strategy will require you to bring online Primary, Entities and Operation data filegroups first. System will be available to the customers at this point. After that, you can work on restoring historical data filegroups, which in most part of the cases, will be significantly larger than operational data and, therefore, will take longer time to restore.

It is also worth noting, that in case of SQL Server 2014 In-memory OLTP, you should also have Hekaton filegroup online before database becomes available to the users. Usually, In-memory tables keep operational data anyway, so it should not be a problem in most part of the cases.

Let’s look at the example and create a database with the structure outlined above. For simplicity sake, every filegroup has only one data file. However, in the real-life you should consider creating multiple files to reduce allocation contention in the filegroups with volatile data.

create database [MyBigOrderDb]
on primary
(name = N'MyBigOrderDb', filename = N'c:\db\MyBigOrderDb.mdf'), 
filegroup [Entities] 
(name = N'MyBigOrderDB_Entities', filename = N'c:\db\MyBigOrderDB_Entities.ndf'), 
filegroup [FG2013] 
(name = N'MyBigOrderDB_FG2013', filename = N'c:\db\MyBigOrderDB_FG2013.ndf'), 
filegroup [FG2014] 
(name = N'MyBigOrderDB_FG2014', filename = N'c:\db\MyBigOrderDB_FG2014.ndf')
log on
(name = N'MyBigOrderDb_log', filename = N'c:\db\MyBigOrderDb_log.ldf')

As the next step, let’s create a few tables including partitioned table Orders.

create table dbo.Customers
(
    CustomerId int not null,
    CustomerName nvarchar(64) not null,
)
on [Entities];

create table dbo.Articless
(
 ArticlesId int not null,
 ArticleName nvarchar(64) not null,
)
on [Entities];

create partition function pfOrders(smalldatetime)
as range right
for values('2014-01-01');

create partition scheme psOrders
as partition pfOrders
to (FG2013,FG2014)
go

create table dbo.Orders
(
    OrderId int not null,
    OrderDate smalldatetime not null,
    OrderNum varchar(32) not null,
    constraint PK_Orders
    primary key clustered(OrderDate, OrderId)
    on psOrders(OrderDate)
)
go

insert into dbo.Customers(CustomerId, CustomerName) values(1,'Customer 1');
insert into dbo.Orders(OrderDate, OrderId, OrderNum)
values
    ('2013-01-01',1,'Order 1'),
    ('2013-02-02',2,'Order 2'),
    ('2014-01-01',3,'Order 3'),
    ('2014-02-02',4,'Order 4')

Next, let’s create the backup chain and perform FULL, DIFFIRENTIAL and LOG backups.

-- Full backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Full.bak' 
with noformat, init, name = N'MyBigOrderDb-Full Database Backup', 
    compression, stats = 2
go

-- Differential backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Diff.bak' 
with differential, noformat, init, 
    name = N'MyBigOrderDb-Differential Database Backup', 
    compression, stats = 2
go

-- Transaction log
backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Log.trn' 
with noformat, init, name = N'MyBigOrderDb-Tran Log', 
    compression, stats = 2
go

And at this point, let’s assume that disaster happens and we need to move database to another server. In this example, I would assume that we still have access to transaction log of the original database and we will perform tail-log backup to avoid any data loss.

backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_TailLog.trn' 
with no_truncate, noformat, init, name = N'MyBigOrderDb-Tail Log', 
 compression, norecovery, stats = 2

At this point, we will need to copy all files from the backup chain to another server and start restore process. We will perform piecemeal restore of Primary, Entities and FG2014 filegroups to support operational activity of the system without bringing historical FG2013 data online. The first operation is performing restore of the FULL database backup specifying just the filegroups we need to restore.

-- Restoring on another server (Same folder structure for demo sake)
-- Full Backup
restore database [MyBigOrderDb] 
FILEGROUP = 'primary', FILEGROUP = 'Entities', FILEGROUP = 'FG2014'
from disk = N'C:\DB\MyBigOrderDb_Full.bak' with file = 1,
move N'MyBigOrderDB' to N'c:\db\MyBigOrderDb.mdf', 
move N'MyBigOrderDB_Entities' to N'c:\db\MyBigOrderDb_Entities.ndf', 
move N'MyBigOrderDB_FG2014' to N'c:\db\MyBigOrderDb_2014.ndf', 
move N'MyBigOrderDb_log' to N'c:\db\MyBigOrderDb.ldf', 
NORECOVERY, partial, stats = 2;

it is worth mentioning, that I am using WITH NORECOVERY clause in all RESTORE commands including tail-log backup restore. In the end, I am recovering database with the separate RESTORE statement. This is just a good practice and safety measurement. By default, Management Studio uses WITH RECOVERY option with restore, which can lead to the situation that you accidentally recovered database at intermediate restore stage. You would not be able to restore further backups after that and would be forced to start from scratch.

As the next steps, we need to restore DIFFERENTIAL, LOG and tail-log backups as shown below. You do not need to specify filegroups as part of restore statement anymore – SQL Server would perform restore only in scope of the filegroups we are working with and which are in RESTORING state.

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Diff.bak' with file = 1,
NORECOVERY, stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Log.trn' with file = 1,
NORECOVERY, stats = 2;

-- Tail-log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_TailLog.trn' with file = 1,
NORECOVERY, stats = 2;

-- Recovery
restore database [MyBigOrderDb] with RECOVERY;

At this point, our database is partially online. We can query operational data as it shown below in Figure 1.

select * from MyBigOrderDb.dbo.Customers
select * from MyBigOrderDb.dbo.Orders where OrderDate >= '2014-01-01'

01. Querying Data from Operational Filegroups

However, if we try to query historical data, we will get an error, as shown in Figure 2

select * from MyBigOrderDb.dbo.Orders where OrderDate < '2014-01-01'

02. Querying Data from Historical Filegroup

You can check the status of the database filegroups with the following query. Figure 3 shows that three filegroups are online while FG2013 filegroup is still in RECOVERY_PENDING stage.

select file_id, name, state_desc, physical_name
from MyBigOrderDb.sys.database_files

03. Filegroup Status Before Historical Filegroup Restore

As you see, customers can use the system while we are working on restoring of FG2013 filegroup. We can perform this with the following script:

-- Full Backup (restoring individual filegroup)
restore database [MyBigOrderDb] 
FILEGROUP = 'FG2013'
from disk = N'C:\DB\MyBigOrderDb_Full.bak' with file = 1,
move N'MyBigOrderDB_FG2013' to N'c:\db\MyBigOrderDb_2013.ndf',  
stats = 2;

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Diff.bak' with file = 1,
stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Log.trn' with file = 1,
stats = 2;

-- Tail-log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_TailLog.trn' with file = 1,
stats = 2;

Now database is online as it is shown in Figure 4.

04. Filegroups Status After Historical Filegroup Restore

Piecemeal restore can significantly decrease database restore time. However, there is still time-consuming part in this process. Regardless how many filegroups you are restoring, you should copy/move backup file that contains initial FULL database backup to the new server. Time of this operation depends on network throughput and slow network and/or large backup files can lead to very long delays.

Obviously, you can take several approaches. You can always “be prepared” and copy files to reserved server after each backup. With such strategy, you would already have backup files in place when they need to be restored if/when disaster occurs.

However, you can use another approach if historical data is read-only. This approach is called “partial database backup”. Even though, Microsoft stated that it is designed for SIMPLE recovery models, it would work just fine with FULL recovery model, as long as you implemented it correctly.

Let’s look how it works and as the first step, let’s mark our FG2013 filegroup as read-only:

alter database MyBigOrderDb modify filegroup FG2013 readonly

At this point, you should start the new backup chain, which, in the nutshell, are two different sets of backup files. One set includes backup files for read-write filegroups (FULL, DIFFERENTIAL and LOG backups using READ_WRITE_FILEGROUPS option. In addition, you need to perform backup of read-only filegroup. You can do it with the script shown below:

-- Backing Up Read-Only Filegroup
backup database [MyBigOrderDb] 
FILEGROUP = N'FG2013'
to disk = N'c:\db\MyBigOrderFG2013.bak' 
with noformat, init, 
	name = N'MyBigOrderDb-FG2013 FG backup', 
	compression, stats = 2
go

-- Full backup of read_write filegroups
backup database [MyBigOrderDb] READ_WRITE_FILEGROUPS
to disk = N'c:\db\MyBigOrderDbRW_Full.bak' 
with noformat, init, name = N'MyBigOrderDb-Full Database Backup (R/W FG)', 
	compression, stats = 2
go

-- Differential backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDbRW_Diff.bak' 
with differential, noformat, init, 
	name = N'MyBigOrderDb-Differential Database Backup (R/W FG)', 
	compression, stats = 2
go

-- Transaction log
backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDbRW_Log.trn' 
with noformat, init, name = N'MyBigOrderDb-Tran Log', 
	compression, stats = 2

The beauty of this situation that now you have data backups separated. You would have relatively small backup chain file(s) for operational data as well as large static backup file for read-only historical data. This will dramatically reduce time required to move operational data backup files over network.  It also helps with day-to-day backup strategy and reduces the time of backup operation and, server and network load, and storage space required to store the files. You can create new backup chains of operation data without taking new backup for historical data as long as those filegroups stay read-only.

If disaster occurs and you need to move database to another server, you can start with operation data backup chain as shown below. For simplicity sake, I am omitting tail-log backup – however, in real life you should always obtain it to avoid data loss.

-- Restoring on another server (Same folder structure for demo sake)
-- Full Backup (R/W filegroups only)
restore database [MyBigOrderDb] 
filegroup = 'primary', filegroup = 'Entities', filegroup = 'FG2014'
from disk = N'C:\DB\MyBigOrderDbRW_Full.bak' with file = 1,
move N'MyBigOrderDB' to N'c:\db\MyBigOrderDb.mdf',  
move N'MyBigOrderDB_Entities' to N'c:\db\MyBigOrderDb_Entities.ndf',  
move N'MyBigOrderDB_FG2014' to N'c:\db\MyBigOrderDb_2014.ndf',  
move N'MyBigOrderDb_log' to N'c:\db\MyBigOrderDb.ldf',  
norecovery, partial, stats = 2;

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDbRW_Diff.bak' with file = 1,
norecovery, stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDbRW_Log.trn' with file = 1,
norecovery, stats = 2;

-- Recovery
restore database [MyBigOrderDb] with recovery;

If you queried the status of the database filegroups after restore, you would see very similar picture – all operational filegroups are online and historical filegroup is in RECOVERY_PENDING state. Figure 5 illustrates that.

05. Filegroup Status During Partial Restore

Again, at this point system is available to users and you can continue working on historical data while system is operational with the following script.

-- Restoring Read/Only Filegroup
restore database [MyBigOrderDb] 
filegroup = 'FG2013'
from disk = N'C:\DB\MyBigOrderFG2013.bak' with file = 1,
move N'MyBigOrderDB_FG2013' to N'c:\db\MyBigOrderDb_2013.ndf',  
recovery, stats = 2;

Be careful when you change read-only status of the filegroups when partial backup is used. One of examples of such scenario is when operational period changed and you need to move some former-operational data that becomes historical to another filegroup. One of the approaches to accomplish it is making historical filegroup as read-write, copying data there and making it read-only again. You can still restore the database using old read-only filegroup backup as long as you have backup chain with LOG backups that cover data movement operations. However, you would not be able to recover historical data if you start new backup chain for updateable filegroups after the fact and did not take new backup of the read-only filegroup.

As the general recommendation, it is safer to start new backup chain together with backup of historical filegroup after you made historical filegroup read-only again. And, most importantly, regardless what solution you are using, test your backup and restore strategies. This would help you to avoid unpleasant surprises when things went south.

Clustered Columnstore Indexes: Performance Considerations

Last time we have looked at the internal structure of delta store and delete bitmap in the clustered columnstore indexes. Today, I would like us to discuss a few practical aspects affecting performance of ETL processes and queries against tables with clustered columnstore indexes.

There are two different ways how you can import data into a table with clustered columnstore index. The first approach is bulk insert, which can be done with bcp utility, BULK INSERT command and other applications that utilize the bulk insert API. The second type, called trickle inserts, are regular INSERT operations that do not use the bulk insert API.

Bulk insert operations provide the number of rows in the batch as part of the API call. SQL Server inserts data into newly created row groups if that size exceeds a threshold of a little bit over 100,000 rows. Depending on the size of the batch, one or more row groups can be created and some rows may be stored in delta store.

Figure 1 below illustrates how data from the different batches are distributed between row groups and delta stores based on batch size.

01. Batch size and data distribution during bulk insert

Let’s do some tests now and see how performance is affected based on the batch size and, therefore, number of row groups in the table. In those tests, I created a set of the tables with the structure similar to what is shown below.

create table dbo.FactSalesBig 
( 
        ProductKey int not null, 
        OrderDateKey int not null, 
        DueDateKey int not null, 
        ShipDateKey int not null, 
        CustomerKey int not null, 
        PromotionKey int not null, 
        CurrencyKey int not null, 
        SalesTerritoryKey int not null, 
        SalesOrderNumber nvarchar(20) not null, 
        SalesOrderLineNumber tinyint not null, 
        RevisionNumber tinyint not null, 
        OrderQuantity smallint not null, 
        UnitPrice money not null, 
        ExtendedAmount money not null, 
        UnitPriceDiscountPct float not null, 
        DiscountAmount float not null, 
        ProductStandardCost money not null, 
        TotalProductCost money not null, 
        SalesAmount money not null, 
        TaxAmt money not null, 
        Freight money not null, 
        CarrierTrackingNumber nvarchar(25) null, 
        CustomerPONumber nvarchar(25) null, 
        OrderDate datetime null, 
        DueDate datetime null, 
        ShipDate datetime null 
)

As the first step, I created CSV file with about 62M rows generated based on dbo.FactResellerSales table from the AdventureWorksDW2012 database and measured performance of the bulk import with bcp utility using 1,000,000-row batches and 102,500-row batches respectively in the 4-CPU virtual machine with 8GB of RAM allocated.

You can see row group statistics after the imports in Figure 2 below. The first import generated 62 1,000,000-row row groups while the second imported ended up with 604 102,500-row row groups.

02. Row groups after insert

Performance of import operation was affected by the batch size. Bcp utility were able to process about 103,500 rows per second with 1,000,000-row batches. In case of 102,500-row batches, the throughput was about 94,300 rows per second, which is about 9% slower.

It is also worth noting that in case of the smaller batches, SQL Server imports data into the delta stores converting them to fully-populated row groups later. While, on the one hand, it would generate efficient row groups, it significantly degraded performance of insert process. For example, in case of 99,999-row batches, the throughput in my environment was only 37,500 rows per second.

As the next test, I checked how partially populated row groups affected performance of the queries using the query shown below. That query performs a MAX() aggregation on 20 columns from a table. The result of the query is meaningless; however, it forces SQL Server to read data from 20 different column segments in each row group in the table.

select  
        max(ProductKey),max(OrderDateKey),max(DueDateKey) 
        ,max(ShipDateKey),max(CustomerKey),max(PromotionKey) 
        ,max(CurrencyKey),max(SalesTerritoryKey),max(SalesOrderLineNumber) 
        ,max(RevisionNumber),max(OrderQuantity),max(UnitPrice) 
        ,max(ExtendedAmount),max(UnitPriceDiscountPct),max(DiscountAmount) 
        ,max(ProductStandardCost),max(TotalProductCost),max(SalesAmount) 
        ,max(TaxAmt),max(Freight)  
from dbo.FactSalesBig

Figure 3 illustrates execution statistics of the query against tables with fully and partially populated row groups (shown in Figure 2). As you can see, the query against a table with partially populated row groups took a considerably longer time to execute.

03. Execution Statistics in case of fully and partially populated row groups

In the next step, let’s check how large delta store affects performance of the queries. For that test, I inserted one million rows to the table using small batches and run the test query. After that, I rebuilt the columnstore index, comparing the execution time of the test query before and after the index rebuild.

The index rebuild process moved all data from the delta store to row groups. You can see the status of row groups and the delta store before (on the left side) and after (on the right side) the index rebuild in Figure 4.

04. Row groups and delta store after insertion of 1,000,000 rows

Figure 5 illustrates the execution times of the test query in both scenarios, and it shows the overhead introduced by the large delta store scan during query execution.

05. Execution time and delta store size

Finally, let’s see how delete bitmaps affect query performance. For that test, I deleted almost 30,000,000 rows from a table (the one where I just rebuilt the index). You can see row groups’ information in Figure 6.

06.Row groups after deletion of 30,000,000 rows

The test query needs to validate that rows have not been deleted during query execution. Similar to the previous test, this adds considerable overhead. Figure 7 shows the execution time of the test query, comparing it to the execution time of the query before the data deletion.

07. Execution time and delete bitmap

The bottom line – partially populated row groups, and large delta stores and delete bitmaps, they all negatively affect performance of the systems that use clustered columnstore indexes. You can address all of these performance issues by rebuilding the columnstore index, which you can trigger with the ALTER INDEX REBUILD command. The index rebuild forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data. All column segments are recreated with row groups fully populated.

Similar to index creation, the index rebuild process is very resource intensive. Moreover, it prevents any data modifications in the table by holding shared (S) table lock. However, other sessions can still read data from a table while the rebuild is running.

One of the methods you can use to mitigate the overhead of index rebuild is table/index partitioning. You can rebuild indexes on a partition-basis and only perform it for partitions that have volatile data. Old facts table data in most Data Warehouse solutions is relatively static, and ETL processes usually load new data only. Partitioning by date in this scenario localizes modifications within the scope of one or very few partitions. This can help you dramatically reduce the overhead of an index rebuild.

A columnstore indexes maintenance strategy should depend on the volatility of the data and the ETL processes implemented in the system. You should rebuild indexes when a table has a considerable amount of deleted rows and/or a large number of partially populated row groups.

To summarize:

  1. You should design ETL processes in the way that data is bulk imported in the batches as close to 1,048,576 rows as possible. This will guarantee that every batch will become separate and fully populated row-group. Do not exceed this size and avoid spilling batches across multiple row groups
  2. Even though clustered columnstore indexes are updateable, you should minimize such updates. Large delta stores and/or delete bitmaps negatively affect query performance. You should monitor their sizes and design index maintenance strategy in the way that keep them as small as possible
  3. Columnstore index rebuild is very resource-intensive. Table partitioning would help you to mitigate performance impact by allowing index rebuild in the scope of the one or very few partitions. You should design partitioning strategy in the way, that  limits data modification and/or import into small subset of partitions rebuilding them afterwards

Clustered Columnstore Indexes: Exploring Delta Store and Delete Bitmap

I am OLTP guy. I cannot grasp concept of the columnstore indexes – indexes that do not care about an order of columns in the definition. It was a reason why Data Warehouses and I lived happily ever after. Just apart from each other.

It was not hard, after all. Even though you can find some use cases for nonclustered columnstore indexes in OLTP environment, inability to modify data after index was created makes those use cases quite rare. Technically, you can use data partitioning and partitioned views and benefit from them in some scenarios; however, such implementation requires large amount of work.

The situation changed after SQL Server 2014 release. Now you can define clustered columnstore indexes, which are updateable. It does not make them suited for OLTP environment – you should remember that they are optimized for large SCAN workloads. Moreover, as the opposite of nonclustered columnstore indexes, they are the only indexes you can define on the table. It is impossible to define B-Tree indexes on the same table and support both environments.

Anyway, I decided to spend some time and explore how clustered columnstore indexes handle data modifications. It was very easy to find some references online; however, neither of the references talks about internal implementation of the indexes. Today, we will try to close this gap.

As the first step, let’s talk about high level structure of clustered columnstore indexes. They use the same storage format as nonclustered columnstore indexes storing columnstore data in row groups. Each row group stores data for up to 1,048,576 rows in column-based format. Data from each column stored separately in highly compressed fashion.

Clustered columnstore indexes  have two additional elements to support data modifications. The first is delete bitmap that indicates what rows were deleted from a table. The second structure is delta store that includes newly inserted rows. Both, delta store and delete bitmap use B-Tree format to store data.

SQL Server works with delete bitmap and delta stores transparently to users, which makes terminology confusing. You can often see delta stores being referenced as another row group in the documentation and technical articles. Moreover, delete bitmap is often considered as a part of delta store and/or row groups. I will use the following terminology today to avoid confusion. A term row group references data stored in column-based storage format. I will explicitly reference delta stores and delete bitmap as two separate set of internal objects whenever needed.

You can see example of the structure of clustered columnstore index in a table that has two partitions in Figure 1 below. Each partition can have a single delete bitmap and multiple delta stores. It is worth mentioning that delete bitmap and delta stores are created on-demand, for example, delete bitmap would not be created unless some of the rows in the row groups were deleted.

01. Clustered Columnstore Index Structure

Every time when you delete a row that is stored in a row group (not in a delta store), SQL Server adds information about deleted row to delete bitmap. Nothing happens to original row. It is still stored in a row group; however, SQL Server checks delete bitmap during query execution excluding deleted rows from the processing. When you insert data into columnstore index, it goes into a delta store. Updating a row that is stored in a row group do not change row data either. Such update triggers deletion of a row, which is, in fact, insertion to delete bitmap, and insertion of a new version of a row to a delta store. However, any data modifications of the rows in delta store are done the same way as in the regular B-Tree indexes by updating and deleting actual rows there. You will see one of such examples later.

Each delta store can be either in open or closed state. Open delta stores accept new rows and allow modifications and deletions of the data. SQL Server closes a delta store when it reaches 1,048,576 rows, which is the maximum number of rows that can be stored in a row group. Another SQL Server process, called tuple mover, runs every five minutes and converts closed delta stores to row groups that store data in column-based storage format.

You can examine the state of row groups and delta store with sys.column_store_row_groups view. Figure 2 illustrates an output of this view, which returns combined information about all columnstore index objects. Rows in OPEN or CLOSED state correspond to delta stores. Rows in COMPRESSED state correspond to row groups with data in column-based storage format. Finally, deleted_rows column provide statistics about deleted rows stored in delete bitmap.

02. Row Groups and Delta Stores

As you see, the second row in a view output shows closed delta store that have yet to be picked up by tuple mover process. The situation would change after tuple mover process converted closed delta store to a row group. Figure 3 illustrates the output from a view after it happened. It is worth mentioning that row_group_id of converted row group changed. Tuple mover created new row group dropping closed delta store afterwards.

03. Row Groups and Delta Store After Tuple Mover Process

Let’s look at  the structure of delta store and delete bitmap rows. Listing below creates a table and populates it with the data creating clustered columnstore index afterwards. I am using MAXDOP=1 option to reduce the number of partially populated row groups.

create table dbo.CCI
(
    Col1 int  not null,
    Col2 varchar(4000) not null
);

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,N6(C) as -- 1,048,592 rows
(
    select 0 from N5 as T1 cross join N3 as T2
    union all
    select 0 from N3
)
,IDs(ID) as (select ROW_NUMBER() over (order by (select NULL)) from N6)
insert into dbo.CCI(Col1,Col2)
    select ID, 'aaa'
    from IDS
go

create clustered columnstore index IDX_CS_CLUST on dbo.CCI
with (maxdop=1)
go

select g.state_description, g.row_group_id, s.column_id
    ,s.row_count, s.min_data_id, s.max_data_id, g.deleted_rows
from
    sys.column_store_segments s join sys.partitions p on
        s.partition_id = p.partition_id
    join sys.column_store_row_groups g on
        p.object_id = g.object_id and
        s.segment_id = g.row_group_id
where
    p.object_id = object_id(N'dbo.CCI')
order by
    g.row_group_id, s.column_id;

As you see in Figure 4, columnstore index has two row groups and does not have delta store nor delete bitmap. You can see Col1 values that are stored in both row groups in min_data_id and max_data_id columns for the rows that have column_id=1.

04. Row Groups after Clustered Columnstore Index Creation

As the next step, let’s perform some data modifications in the table. First statement inserts two new rows into the table. Second statement deletes three rows, including one row we just inserted. Finally, we will update another, newly inserted, row.

insert into dbo.CCI(Col1,Col2) 
values
    (2000000,REPLICATE('c',4000)),
    (2000001,REPLICATE('d',4000));

delete from dbo.CCI 
where Col1 in 
    (
    100  		-- Row group 0
    ,16150  		-- Row group 1
    ,2000000	  -- Newly inserted row (Delta Store)
    );

update dbo.CCI 
set Col2 = REPLICATE('z',4000) 
where Col1 = 2000001; -- Newly inserted row (Delta Store)

Now it is a time to find data pages that used by delta store and delete bitmap. We will use undocumented sys.dm_db_database_page_allocations system function as shown below.

select object_id, index_id, partition_id
    ,allocation_unit_type_desc as [Type]
    ,is_allocated,is_iam_page,page_type,page_type_desc
    ,allocated_page_file_id as [FileId]
    ,allocated_page_page_id as [PageId]
from sys.dm_db_database_page_allocations
    (db_id(), object_id('dbo.CCI'),NULL, NULL, 'DETAILED')

You can see an output of the query in Figure 5. SQL Server stores columnstore segments in LOB_DATA allocation units. Delta store and delete bitmap are using IN_ROW_DATA allocation.

05. Table Allocation Units

Let’s look at the data pages using another undocumented DBCC PAGE command with the code shown below. Obviously, in your environment, database, file and page IDs would be different.

-- Redirecting output to console
dbcc traceon(3604)

-- Analyzing content of a page
dbcc page
(
	9	-- Database Id
	,1	-- FileId
	,306	-- PageId
	,3	-- Output style
)

Figure 6 shows partial content of a data page, which is a delta store page. As you can see, SQL Server stores data in regular row-based storage. There is one internal column CSILOCATOR in addition to two table columns. CSILOCATOR is used as internal unique identifier of the row in delta store.  Finally, it is worth mentioning that a row with Col1=2000000, which we have inserted and deleted after clustered columnstore index was created, is not present in delta store. SQL Server deletes (and updates) rows in B-Tree delta store the same way as with regular B-Tree tables.

06. Delta Store Data Page

You can use the same approach to examine content of a deleted bitmap data page. In my case, the page id is 308. Figure 7 shows the partial output of DBCC PAGE command. As you see, delete bitmap includes two columns, which are uniquely identifying a row. The first column is a row group id and the second column is offset of the row in the segment. Do not be confused by the fact that column names match table columns. DBCC PAGE uses table metadata to prepare an output.

07. Delete Bitmap Data Page

As you see, both delta store and deleted bitmap pages were using row compression in our example, which means delta store and delete bitmap either row- or page-compressed. As you know, in case of page compression, SQL Server performs page compression only when page is full and retain it only if it provides significant space savings. Otherwise, data is kept in row-compressed format even when index is defined with page compression.

Let’s run a test that inserts large batch of rows that can benefit from page compression using code shown in Listing below.

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select ROW_NUMBER() over (order by (select NULL)) from N5)
insert into dbo.CCI(Col1,Col2) 
	select ID, REPLICATE('a',255)
	from IDS

Figure 8 illustrates content of the data page from delta store after insert. The presence of compression info record indicates that delta store is using page compression

08. Delta Store Data Page (with Page Compression)

Let’s examine what happens with delete bitmap and delete all rows from compressed row groups with code shown in Listing below.

delete from dbo.CCI

As you can see in Figure 9 below, page is still uses row compression even though now it is fully populated. Obviously, we cannot guarantee that delete bitmap is not defined with page compression – after all it is not documented – however, it could be logical to use row compression in this case when we have two small integer values. Row compression would perform perfectly here.

09. Delete Bitmap Data Page (Full with Row Compression)

Hope, that information can shed some light on clustered columnstore index internal structure.

Next: Clustered Columnstore Indexes: Performance Considerations

Locking in Microsoft SQL Server (Part 19) – Concurrency model in in-memory OLTP (Hekaton)

It is impossible to resist the urge of exploring in-memory OLTP Engine (code name Hekaton) released as part of SQL Server 2014. This technology can provide you huge performance boost, assuming, of course, that you can live within surface area limitations. Nevertheless, internal implementation of in-memory OLTP is fascinating. Almost everything is done differently than what you get used to with SQL Server Storage Engine. To put things into prospective, I seriously considered to name this post as “Concurrency – upside down”. 🙂

Today, I want to focus on particular aspect of in-memory OLTP, such as its concurrency model. While implementation of SNAPSHOT isolation is more or less obvious, I was intrigued, how higher isolation levels, such as REPEATABLE READ and SERIALIZABLE, would work in latch- and lock-free environment.

I assume, that you have a basic understanding of key principles used in-memory OLTP. Otherwise, you can consider to read MSDN documentation and Kalen Delaney’ whitepaper at first.

Even though, I am not going to focus much on in-memory OLTP indexes and access methods, I would like to reiterate how Hekaton works with the data. It uses completely different  mechanism comparing to regular on-disk tables. The data rows live in memory and linked to each other in single-linked list of pointers – one pointer chain per index.

Concurrency model in in-memory OLTP is a version-based supporting multiple versions of the rows with different lifetime. SQL Server maintains two different unique values, such as:

  • Global Transaction Timestamp is auto-incremented value, which is uniquely identifying every transaction in the system. SQL Server increments this value at transaction pre-commit stage.
  • TransactionId is another identifier (timestamp), which is also uniquely identifies a transaction. SQL Server obtains and increments its value at moment when transaction starts.

Every row has BeginTs and EndTs timestamps, which correspond to a Global Transaction Timestamp of the transaction that created or deleted this version of a row.  A special timestamp value, called Infinity, is used to indicate rows that have not been deleted (EndTs=Infinity). SQL Server never updates rows. When row needs to be modified, it deletes (updates EndTs) of original row and create a new row version with a new timestamp and EndTs of Infinity.

A transaction can only see rows that existed at time of transaction start, which is similar to SNAPSHOT isolation levels for on-disk tables. However, for in-memory data that behavior does not change with isolation level. REPEATABLE READ and SERIALIZABLE isolation levels follow exactly the same rules.

Figure 1 illustrates an example of data access and visibility. It shows hash index on Name (on left  side) and multiple data rows linked into that index pointer chain. Again, if you do not know what hash index is, consider to read about it in documentation. For simplicity sake, let’s consider that hash function is based on the first letter of the Name.

01. Hash index and data rows

Let’s assume that we need to run a query that selects all rows with Name=’Ann’ in the transaction that started when Global Transaction Timestamp was 65. SQL Server calculates hash value for Ann, which is ‘A‘ and find corresponding bucket in the hash index. It follows the pointer from that bucket, which references a row with Name=’Adam’. This row has BeginTs of 10 and EndTs of Infinity; therefore, it is visible to the transaction. However, Name value does not match the predicate and row is ignored.

As the next step, SQL Server follows the pointer from Adam index pointer array, which references first Ann row. This row has BeginTs of 50 and EndTs of Infinity; therefore, it is visible to the transaction and needs to be selected.

As the final step, SQL Server follows the next pointer in the index. Even though, last row also has Name=’Ann’, it has EndTs of 50, which indicates that row has been deleted before transaction started and it is invisible to the transaction.

I hope, that provides you very basic example of access methods and data visibility used in in-memory OLTP. However, before we start diving deeper into internal implementation of concurrency model in Hekaton, I would like us to remember about data logical consistency rules provided by different transaction isolation levels.

Any transaction isolation level resolve write/write conflicts. Multiple transactions cannot update a same row simultaneously. Different outcomes are possible, in some cases, SQL Server uses blocking and preventing transactions from accessing uncommitted changes until transaction that made those changes is committed. In other cases, SQL Server rolls back one of transactions due to update conflict. In-memory OLTP uses latter method to resolve write/write conflicts and abort the transaction. We will discuss this situation later, and now let’s focus on the read data consistency.

There are three major data inconsistency issues possible in multi-user environments, such as:

Dirty Reads: Transaction reads uncommitted (dirty) data from the other uncommitted transactions.

Non-Repeatable Reads: Subsequent attempts to read the same data from within the same transaction returns different results. This data inconsistency issue arises when the other transactions modified, or even deleted, data between the reads done by affected transaction.

Phantom Reads: This phenomenon occurs when subsequent reads within the same transaction return the new rows (the ones transaction did not read before). This happens when another transaction inserted the new data in between the reads done by affected transaction.

Figure 2 below shows data inconsistency issues that are possible for different transaction isolation levels.

02. Transaction isolation levels and data consistency

With exception of SNAPSHOT isolation level, SQL Server uses locking to address data inconsistency issues when dealing with on-disk tables. It blocks sessions from reading or modifying data to prevent data inconsistency. Such behavior also means that in case of write/write conflict, last modification wins. For example, when two transactions are trying to modify a same row, SQL Server blocks one of transactions until another transaction is committed allowing blocked transactions to modify data afterwards. No errors or exceptions would be raised, however changes from the first transactions would be lost.

SNAPSHOT isolation level uses row-versioning model where all data modifications done by other transactions are invisible for the transaction. It is implemented differently in case of on-disk and memory-optimized tables however, logically it behaves the same. Write/write conflicts in that model are resolved by aborting and rolling back the transactions.

It is also worth mentioning that even though SERIALIZABLE and SNAPSHOT isolation levels provide the same level of protection against data inconsistency issues, there is a subtle difference in their behavior. With SNAPSHOT isolation level transaction sees a data as of at beginning of transaction. With SERIALIZABLE isolation level, transaction sees a data as of a time when data was accessed a first time. Consider a situation when session is reading data from a table in the middle of transaction. If another session changed data in that table after transaction started but before data was read, transaction in SERIALIZABLE isolation level would see the changes while SNAPSHOT transaction would not.

As I already mentioned, In-memory OLTP supports three transaction isolation levels – SNAPSHOT, REPEATABLE READ and SERIALIZABLE. However, in-memory OLTP uses completely different approach to enforce data consistency rules comparing to on-disk tables. Rather than block or being blocked by the other sessions, in-memory OLTP validates data consistency at transaction commit time throwing exception and rolling back the transaction if rules were violated. This is very confusing behavior comparing to on-disk tables – transaction is continue working without being blocked. It returns data to clients; however it is failed to commit in the end.

Let’s look at a few examples that demonstrates such behavior. As the first step let’s create memory-optimized table and insert a few rows there.

create table dbo.HKData
(
     ID int not null,
     Col int not null,
     constraint PK_HKData
     primary key nonclustered hash(ID)
     with (bucket_count=64),
)
with (memory_optimized=on, durability=schema_only);

insert into dbo.HKData(ID, Col)
values(1,1),(2,2),(3,3),(4,4),(5,5);

Figure 3 shows two examples how REPEATABLE READ transactions handle non-repeatable and phantom reads. Session 1 transaction starts at time when first SELECT operator executes. Remember, that SQL Server starts transaction at moment of first data access rather than at time of BEGIN TRAN statement.

03. REPEATABLE READ behavior

As you see, with memory-optimized tables, other sessions were able to modify data that was read by active REPEATABLE READ transaction, which led to transaction abort at time of commit. This is completely different behavior from on-disk tables, where other sessions would be blocked until REPEATABLE READ transaction successfully commits.

It is also worth mentioning that in case of memory-optimized tables, REPEATABLE READ isolation level protects you from Phantom Read phenomenon, which is not the case with on-disk tables.

As the next step, let’s repeat our tests in SERIALIZABLE isolation level. You can see a code and results of the execution in Figure 4.

04. SERIALIZABLE behavior

As you see, SERIALIZABLE isolation level prevents session to commit transaction when another session inserted a new row and violate serializable validation. Similar to REPEATABLE READ isolation level, this behavior is different from on-disk tables, where SERIALIZABLE transaction would successfully commit blocking other sessions until it is done.

Finally, let’s repeat our tests in SNAPSHOT isolation level. The code and results are shown in Figure 5.

05. SNAPSHOT behavior

SNAPSHOT isolation level works similar to on-disk tables and protects from Non-Repeatable Reads and Phantom Reads phenomenon. As you can guess, it does not need to perform repeatable read and serializable validations at commit stage and, therefore, reduces the load to SQL Server.

Write/write conflicts work the same way regardless of transaction isolation level in in-memory OLTP. SQL Server does not allow transaction to modify a row that has been modified by other uncommitted transactions. Figures 6 and 7  illustrate such behavior. It uses SNAPSHOT isolation level, however behavior does not change in different isolation levels.

06. Write/write conflict (1)

07. Write/write conflict (2)

Now, let’s dive deeper and look what happens under the hood. Figure 8 illustrates lifetime of in-memory OLTP transaction.

08. In-memory OLTP transaction lifetime

At time, when new transaction starts, it generates new TransactionId and obtains current Global Transaction Timestamp value. Global Transaction Timestamp value dictates what version of the rows are visible to the transaction and timestamp value should be in between BeginTs and EndTs for row to be visible. During data modifications, however, transaction analyzes if there are any uncommitted versions of the rows preventing write/write conflicts when multiple sessions modify the same data.

When transaction needs to delete a row, it updates EndTs timestamp with TransactionId value, which also has an indicator that timestamp contains TransactionId rather than Global Transaction Timestamp. Insert operation creates of a new row with BeginTs of TransactionId and EndTs of Infinity. Finally, update operation consists of delete and insert operations internally.

Figure 9 shows the data rows after we created and populated dbo.HKData table. I am omitting hash index structure for simplicity sake.

09. Data rows after table creation

Let’s assume that we have transaction started at time when Global Transaction Timestamp value was 10 and TransactionId generated as -5. I am using negative value for TransactionId to illustrate a difference between two values in the figures below.

Let’s assume that transaction performs a few data modification operations as shown below.

insert into dbo.HKData with (snapshot)
(ID, Col)
values(10,10);

update dbo.HKData with (snapshot)
set Col = -2
where ID = 2;

delete from dbo.HKData with (snapshot)
where ID = 4;

Figure 10 illustrates the state of a data after data modifications. INSERT statement created a new row, DELETE statement updated EndTs value in the row with ID=4 and UPDATE statement changed EndTs value of the row with ID=2 and created a new version of a row with same ID.

It is important to mention that transaction maintains a write set – pointers to rows that have been inserted and deleted by transaction. Moreover, in SERIALIZABLE and REPEATABLE READ isolation levels, transactions maintains read set of the rows that were read by a transaction. Write set is used to generate transaction log records, while read set is used to perform REPEATABLE READ and SERIALIZABLE rules validation.

10. Data Rows after update (transaction is active)

When COMMIT request is issued, transaction starts validation phase. First, it generates new Global Transaction Timestamp value and replaces TransactionId with this value in all BeginTs and EndTs timestamps in the rows it modified. Figure 11 illustrates that, assuming that Global Transaction Timestamp value is 11.

11. Committing transaction (pre-commit stage)

At this moment, rows modified by transactions become visible to other transactions in the system even though transaction has yet to be committed. Other transactions can see uncommitted rows, which leads to the situation called commit dependency. Those transactions would not be blocked at time when they access those rows, however they would not return data to clients nor commit until original transaction they have commit dependency on would commit itself. If, for some reason, that transaction failed to commit, other dependent transactions would be rolled back and error would be generated.

Commit dependency is technically a case of blocking in in-memory OLTP. However, validation and commit phases of transactions are relatively short and that blocking should not be excessive.

After timestamps in rows were replaced, transaction validates REPEATABLE READ and SERIALIZABLE rules and waits for commit dependencies to clear. When it is done, transaction moves to commit phase, generate one or more log records, save them to transaction log and complete the transaction.

Obviously, validation phase of transactions in REPEATABLE READ and SERIALIZABLE isolation levels is longer than in SNAPSHOT isolation level due to rules validation. Do not use them unless you have legitimate use-case for such data consistency. To be frank, I do not see much use-cases for them besides importing and exporting data to/from in-memory tables.