Off-Topic: The Author’s Dilemma

The journey of a thousand miles begins with a single step. The journey of writing, on the other hand, begins with the planning. There are hundreds of the ways how to structure the information in the book. And, unfortunately, the author’s choice is not always the best one – it is easy to forget that readers can have different preferences.

Today I need your feedback. I would like to ask all of you – how you, as the readers, prefer books or large articles to be structured. Let I elaborate a little bit.

Usually, every large article or book consists of several interrelated topics. Moreover, each topic can be split to basic (introductory) and advanced content. Let’s think about Clustered Columnstore Indexes as the example. If I ever wrote a book on them, I would have to cover their internal structure, explain how they are handling data modifications, how to design ETL processes in the most efficient way and, perhaps, explain what is the batch-mode execution and how to utilize it. Please, do not consider it to be the full coverage on the topic – this is just an example.

The conventional approach for such book would introduce Table of Content similar to the one below. Please, do not judge its quality – it is something I just invented on the fly to illustrate the point.

  • Overview and Introduction
  • Internal Structure
    • Row Groups, Delta Store, Delete Bitmaps
    • Internals of data modifications
  • ETL and Design Considerations
    • Best practices for ETL processes
    • CCI and Data Partitioning
    • ETL and Index Maintenance
  • Batch-Mode Execution
    • Batch-Mode Execution – overview and implementation
    • Query patterns that lead to Batch-Mode Execution

There is another approach, however. We can split the article or book based on the level of content, such as basic and advanced. With such an approach, you would be able to read half of the book, get some ideas on the topic and start using it. Meanwhile, you can master your skills with advanced content.

Obviously, there are some downsides. First, basic content would have plenty of references to advanced chapters. For example, it can say “using batches of 1M rows for import would provide the best results. We will cover why it happened in Chapter X“. Second, the same topics will be covered in two different chapters, granted on the different levels. And it will look poorly structured, especially if you are already familiar with the topic.

The table of content for such an approach could look like that:

  • CCI: Basics Concepts
    • Overview and Introduction
    • CCI components overview
    • Best practices during data modification and ETL (*mainly tips)
    • Batch-Mode execution overview and tips and tricks how to utilize it
  • CCI: Advanced Concepts
    • Internal Structure of CCI components
    • Internals of Data Modifications
    • ETL processes – design and performance considerations
    • CCI maintenance
    • Deep dive into Batch-Mode execution.

So, what would be your choice? What would you prefer to read?

Can you please answer one question survey or leave the reply to the post?

I really appreciate all your help! Trust me, it matters!

Size does matter: 10 ways to reduce the database size and improve performance in SQL Server

As the database professional, I like multi-terabyte databases. They are fun to deal with; they give you priceless experience and look cool in your resume. Last but not least, customers with multi-terabyte databases do not have problems with multi-kilodollar invoices. Customers usually understand amount of work involved in such projects.

The problem, however, is that the large databases are not good for the customers. Those databases are more expensive to support and maintain. You need to have highly skilled professionals in the team. Professionals, who can design nontrivial solutions in all areas – architecture, availability, maintenance, performance tuning, to name just a few.

Large databases usually require powerful hardware to run. There is, of course, very subtle difference between the size of the database and size of the active (hot) data. It is entirely possible that applications deal only with the fraction of the data stored in the database and, therefore, even mediocre server can handle the load. However, on the bare minimum, there is always the storage cost.

The projects when you have to reduce the size of the databases are very common. Even though size reduction is rarely the primary objectives of such projects, reducing the size often helps to achieve other goals. Think about designing Disaster Recovery (DR) strategy. Plenty of things that can help to meet strict RTO requirements and smaller database size definitely helps.

Today, I am going to discuss several methods that can help in reducing database size. Some of them are fully transparent to the client applications; others require regression testing and/or code refactoring. I would also focus on the data files only – troubleshooting and reducing transaction log size is the different topic.

0. Find the worst offenders

In the nutshell, the database files are just the containers for the data. Some space in the data files is allocated and used by the database objects; however, there is usually some unallocated space. For example, if you created the database with one 1GB data file, you would have 1 GB file on disk. However, immediately after creation, it would have just a handful of the data pages allocated inside the file and most part of the file would be free (unallocated).

It is completely normal to have free space in the data files, especially if amount of data is constantly growing. However, excessive amount of the free space could consume unnecessary space on the disks. The script below helps you to analyze amount of allocated and unallocated space on per-database file basis.

select 
    f.type_desc as [Type]
    ,f.name as [FileName]
    ,fg.name as [FileGroup]
    ,f.physical_name as [Path]
    ,f.size / 128.0 as [CurrentSizeMB]
    ,f.size / 128.0 - convert(int,fileproperty(f.name,'SpaceUsed')) / 
        128.0 as [FreeSpaceMb]
from 
    sys.database_files f with (nolock) left outer join 
        sys.filegroups fg with (nolock) on
            f.data_space_id = fg.data_space_id
option (recompile)

Figure 1 illustrates the output of the script from one of the production servers. Even though the output shows significant amount of free space in the data files, it may or may not be a problem. The decision if that space needs to be deallocated depends on the other factors, such as data growth expectations, disk system configuration and a few others.

01. Allocated and Free space in the database files

We will talk how to handle the situations when free space needs to be deallocated later, for now, let’s focus on the data, and discuss what we can do to reduce its size.

All of us are familiar with Pareto principle, which is also known as 80/20 rule. To simplify, in most projects, you can achieve 80% of improvements by spending 20% of time or resources. That ratio is even more severe when we search for the most space consuming objects in the database. Even with very large databases, usually the most part of the space is consumed by just a handful of tables. Obviously, we would like to focus on them – at least at the beginning of the process.

Let’s do one step backwards, however, and remember how SQL Server stores the data. Every on-disk table has the main copy of the data, which stored either in the clustered index or in heap. In addition, every table can have the set of nonclustered indexes that store the copy of the data for some of columns and reference the main copy of the rows (in the clustered index or heap). For the purpose of this discussion, let’s talk about generic indexes without any further differentiation between their types.

Every index consists of the data pages that can store the data that belong to the different allocation units. IN_ROW allocation unit contains main data rows, which includes internal row structures, fixed-length column data and variable-length column data that fits in IN_ROW data page. LOB allocation unit contains data for variable-length column data (including data types such as XML, CLR UDT, etc)that is greater than 8,000 bytes in size. Finally, ROW_OVERFLOW allocation unit contains data for variable-length column data that does not exceed 8,000 bytes but does not fit IN_ROW.

For example, if you created the table of the following structure and inserted one row there as it is shown below, you would have data for column C5 stored in LOB allocation units, data for one of either C3 or C4 columns stored in ROW_OVERFLOW allocation unit and data for all other columns stored in IN_ROW allocation unit. It is also worth mentioning that main row data in IN_ROW data would have the pointers to the data stored in the other allocation unit.

create table dbo.T1
(
    C1 int,
    C2 datetime,
    C3 varchar(5000),
    C4 varchar(5000),
    C5 varchar(max)
);

insert into dbo.T1(C1, C2, C3, C4, C5)
values
(
    1 /* C1 */
    ,GetUtcDate() /* C2 */
    ,Replicate('A',5000) /* C3 */
    ,Replicate('B',5000) /* C4 */
    ,Replicate(convert(varchar(max),'A'),10000) /* C5 */
)

You can read more about row structure and allocation units here: http://aboutsqlserver.com/storage-engine/

The script below provides you the information about space allocation on per-index basis in the database. You can easily modify it to aggregate on per-table basis or drill-down on per-allocation unit basis; however, at that stage I usually prefer to work on the index level.

;with SpaceInfo(ObjectId, IndexId, TableName, IndexName
    ,Rows, TotalSpaceMB, UsedSpaceMB)
as
( 
    select  
        t.object_id as [ObjectId]
        ,i.index_id as [IndexId]
        ,s.name + '.' + t.Name as [TableName]
        ,i.name as [Index Name]
        ,sum(p.[Rows]) as [Rows]
        ,sum(au.total_pages) * 8 / 1024 as [Total Space MB]
        ,sum(au.used_pages) * 8 / 1024 as [Used Space MB]
    from    
        sys.tables t with (nolock) join 
            sys.schemas s with (nolock) on 
                s.schema_id = t.schema_id
            join sys.indexes i with (nolock) on 
                t.object_id = i.object_id
            join sys.partitions p with (nolock) on 
                i.object_id = p.object_id and 
                i.index_id = p.index_id
            cross apply
            (
                select 
                    sum(a.total_pages) as total_pages
                    ,sum(a.used_pages) as used_pages
                from sys.allocation_units a with (nolock)
                where p.partition_id = a.container_id 
            ) au
    where   
        i.object_id > 255
    group by
        t.object_id, i.index_id, s.name, t.name, i.name
)
select 
    ObjectId, IndexId, TableName, IndexName
    ,Rows, TotalSpaceMB, UsedSpaceMB
    ,TotalSpaceMB - UsedSpaceMB as [ReservedSpaceMB]
from 
    SpaceInfo		
order by
    TotalSpaceMB desc
option (recompile)

Figure 2 shows the output of this script (table and index names are obscured). As you can see, this could quickly pinpoint the indexes that consume most part of the space in the database.

02. Indexes that consume the most space in the database

Now, let’s see what we can do to reduce their size.

1. Reducing Index Fragmentation

As you know, SQL Server stores on-disk table data on the 8KB data pages. Each data page contains data for one or multiple rows. With the exception of index creation or rebuild, SQL Server tries to populate pages in full during normal data modification operations. When data does not fit, for example, when the data pages does not have enough space to accommodate the new row, SQL Server performs the page split operation. In the nutshell, SQL Server allocates another data page and moves about half of the data from original to the new page, which frees up some space to accommodate the new row on the original data page.

Page split operations lead to the index fragmentation, which exists in two kinds. External fragmentation means that the logical order of the pages does not match their physical order, and/or logically subsequent pages are not located in the same or adjacent extents (extent is the group of 8 pages). Such fragmentation forces SQL Server to jump around reading the data from the disk, which makes read-ahead less efficient and increases the number of physical reads required. Moreover, it increases random disk I/O, which is far less efficient when compared to sequential I/O in the case of magnetic hard drives.

Internal fragmentation, on the other hand, means that data pages in the index have free space. As a result, the index uses more data pages to store data. It also increases the number of reads during query execution and amount of memory in buffer pool to cache index pages.

A small degree of internal fragmentation is not necessarily bad. It reduces page splits during insert and update operations when data is inserted into or updated from different parts of the index. Nonetheless, a large degree of internal fragmentation wastes index space and reduces the performance of the system. Moreover, for indexes with ever-increasing keys, for example on identity columns, internal fragmentation is not desirable because the data is always inserted at the end of the index.

You can monitor both, internal and external fragmentation with sys.dm_db_index_physical_stats DMV. Internal fragmentation can be monitored with avg_page_space_used_in_percent column. Lower value in the column indicates higher degree of internal fragmentation.

Let’s take a look at the example and analyze internal fragmentation of one of the indexes with the script below. For simplicity sake, I am using relatively small table; however, you would obviously like to focus on the largest indexes during the tuning process.

select 
    index_id, partition_number, alloc_unit_type_desc
    ,index_level, page_count, avg_page_space_used_in_percent
from 
    sys.dm_db_index_physical_stats
    (
        db_id() /*Database */
        ,object_id(N'dbo.MyTable') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */
    )

Figure 3 illustrates partial output of the script. The table is partitioned and, as result, you will see separate rows in the result – one per partition per allocation unit.

03. Internal Fragmentation in the Index

You can remove internal fragmentation by rebuilding the index. Figure 4 illustrates the output of sys.dm_db_index_physical_stats after the index rebuild with FILLFACTOR=100 (more on it later)

04. Internal Fragmentation after Index Rebuild

Figure 5 illustrates amount of space index used before (on the left side) and after (on the right side) rebuild. As you can see, index rebuild was able to free up more than 40% of the space index consumed before the rebuild.

05. Index Size before and after rebuild

FILLFACTOR parameter controls amount of free space SQL Server reserves on the pages during index creation and rebuild operation. For example, FILLFACTOR=80 forces SQL Server to make data pages about 80% full keeping 20% of the page space reserved. This could reduce the number of page splits and internal fragmentation when new rows are inserted to the middle of the index or updated in the way that increase their size. It is very important to remember that FILLFACTOR is applied only during index creation or rebuild stages. After index is created, SQL Server continue populates pages in full up to 100% performing page splits as needed.

As you can guess, the optimal solution would require fine-tuning FILLFACTOR and designing index maintenance strategy in the way that keeps internal fragmentation at minimum most part of the time. Unfortunately, there is no “one size fits all” advice in terms of FILLFACTOR. You should try to figure out the most optimal value by using different FILLFACTOR values and monitor how it would affect your fragmentation. You can start with FILLFACTOR value close to 100 and gradually decrease it by 5 until you find the sweet spot. Alternatively, you can monitor page splits in real time using transaction_log extended event tracking LOP_DELETE_SPLIT operation changing value based on amount of splits (you can see more on it at Jonathan Kehayias’ blog),

Lastly, you should remember than index rebuild creates another copy of the index during the process. In fact, it could increase the size of the data file on disk during the operation. Moreover, it generates large amount of transaction log records that could also affect transaction log size, network load and size of send and redo queues if database mirroring or AlwaysOn AG are in use.

2. Implementing Data Compression

If you are the lucky enough to have Enterprise Edition of SQL Server, you can reduce the size of the data by implementing Data Compression. There are two types of Data Compression supported in SQL Server – Row and Page.

Row compression addresses the storage inefficiency introduced by fixed-length data types. By default, in non-compressed row, size of the fixed-length data is based on the data type size. For example, INT column would always use 4 bytes, regardless of the value – even when it is NULL. Row compression addresses that and removes such an overhead. For example, INT value of 255 would use just 1 byte rather than 4 bytes.

Page compression goes one-step further and implements dictionary-based compression removing repetitive sequences of bytes on the page. I am not going to dive deep in the storage format of the compressed rows here and will follow up with additional blog post at some point. Alternatively, you can read about it in my book.

The actual results would greatly depend on the data and the schema. As you can guess, row compression would be beneficial when table has fixed-length data columns. More fixed-length columns you have, better the space savings are. Results of the page compression, on the other hand, depend on how repetitive is the data on the page. You can use sp_estimate_data_compression_savings system stored procedure to estimate compression results for your data. That procedure works by copying and compressing the sample of your data in tempdb measuring compression results.

It is also very important to remember that data compression works with IN_ROW allocation units only. It does not compress LOB nor ROW_OVERFLOW data.

Obviously, there is an overhead. Compression and decompression adds additional CPU load to the system. That overhead is relatively small for the ROW compression, especially when you read the data; however, for PAGE compression that overhead is more significant. There is the catch, though. While compression adds the load to CPU, it reduces I/O load in the system – SQL Server needs to issue less I/O operations due to the smaller data size. In the end, the queries could execute even faster especially on the systems that are not heavily CPU bound.

There is also the overhead during the batch operations. Batch inserts and updates could take more time when data is compressed. The same applies to the index maintenance. Just to give you some numbers, I ran a few tests at time when I worked on the book. I was using the data from one of the production tables with a decent number of fixed- and variable-length columns. Obviously, different table schema and data distribution will lead to slightly different results. However, in most cases, you would see similar patterns.

At the beginning of the tests, I have created three different heap tables and inserted one million rows into each of them. After that, I created clustered indexes with different compression settings and FILLFACTOR=100. This workflow led to zero index fragmentation and fully populated data pages.

During the first test, I ran SELECT statements to scan all of the clustered indexes accessing some row data. The second test updated every row in the tables, changing the value of the fixed-length column in a way that did not increase the row size. The third test inserted another batch of a million rows in the tables. Finally, I rebuilt all of the clustered indexes. You can see the execution statistics in Figure 5 below. All tests ran with warm cache with the data pages cached in the buffer pool. Cold cache could reduce the difference in execution times for the queries against compressed and non-compressed data, because queries against compressed data perform less physical I/O.

All statements were forced to run on a single CPU by using a MAXDOP 1 query hint. Using parallelism would decrease the query execution times, however it would add the overhead of parallelism management during query execution, which I wanted to avoid during the tests.

Figure 6 demonstrates the results of the tests.

6. Data Compression – Storage Size and Performance

Obviously, it is impossible to provide generic advice how and when to compress the data. However, in case, if system is not heavily CPU bound, I would usually suggest implementing ROW compression on the indexes with volatile data. PAGE compression, on the other hand, could be the good choice for the old static data, especially when that data accessed infrequently.

It is also impossible to avoid mentioning data partitioning when we are discussing compression. It is very common to have the situation when just small subset of the data in the table is volatile. Unfortunately, you cannot apply different compression methods to hot and cold data unless the data is partitioned (either with table partitioning or with multiple tables utilizing partitioned views). Such partitioning helps you to implement different compression schemas to different table partitions (or tables) and will allow you to reduce index maintenance overhead by rebuilding the index on partition scope.

A word of caution, however. Partitioning is the great tool that can help you to address multiple challenges especially in database administration area. Even though, table partitioning can be implemented transparently to the client applications, it could introduce plan regressions. One of such examples is described here: http://aboutsqlserver.com/2012/07/10/cautionary-tale-about-triggers-version-store-and-fragmentation/. Make sure to carefully test your application if you decided to implement table partitioning.

Finally, if you are interested in data partitioning, I would like to reference my book again. The chapter on data partitioning is the largest one there and it discusses various examples and approaches of partitioning in various editions of SQL Server.

3. Removing unused indexes

It is often possible to reduce the size of the data during the index tuning process. The main goal of the index tuning is creating the right set of indexes, which also requires you to drop existing unused and redundant indexes. Keep in mind that you always need to carefully test your system when you change the indexes making sure that there is no plan regressions after the tuning.

There are two data management views that can help you to detect non-efficient indexes. The first one, sys.dm_db_index_usage_stats, shows you the statistics on the various index operations, such as index seek and scan, index updates and a few others, along with the time of the last operation. The second DMV, sys.dm_db_index_operational_stats, dives deeper and provides an information on I/O, access methods and locking statistics on the index.

The key difference between two DMOs is how they collect data. Sys.dm_db_index_usage_stats tracks how many times an operation appeared in the execution plan. Alternatively, sys.dm_db_index_operation_stats tracks operations at the row level. For example, if query execution plan includes Key Lookup operation and SQL Server ran it twice during query execution, sys.dm_db_index_usage_stats would track the single lookup operation, while sys.dm_index_operation_stats would track two of them.

You can obtain the information about index usage with sys.dm_db_index_usage_stats by running the statement below. You can use similar approach with sys.dm_db_index_operation_stats if you need more detailed analysis.

select 
    s.Name + N'.' + t.name as [Table]
    ,i.name as [Index] 
    ,i.is_unique as [IsUnique]
    ,ius.user_seeks as [Seeks], ius.user_scans as [Scans]
    ,ius.user_lookups as [Lookups]
    ,ius.user_seeks + ius.user_scans + ius.user_lookups as [Reads]
    ,ius.user_updates as [Updates], ius.last_user_seek as [Last Seek]
    ,ius.last_user_scan as [Last Scan], ius.last_user_lookup as [Last Lookup]
    ,ius.last_user_update as [Last Update]
from 
    sys.tables t with (nolock) join sys.indexes i with (nolock) on
        t.object_id = i.object_id
    join sys.schemas s with (nolock) on 
        t.schema_id = s.schema_id
    left outer join sys.dm_db_index_usage_stats ius on
        ius.database_id = db_id() and
        ius.object_id = i.object_id and 
        ius.index_id = i.index_id
order by
    s.name, t.name, i.index_id
option (recompile)

As you can see in the output in Figure 7, you can easily pinpoint the problematic indexes. Indexes in red ovals were not used in any queries for the duration of statistics collection. Those indexes consume space in the database. Moreover, they introduce update overhead in the system. As the side note, it is also beneficial to look at the indexes where update overhead exceeds their usefulness – for example, index in green oval is used only for scans even though it is constantly updated.

07. Index Usage Statistics

Obviously, you need to be careful making sure that you captured valid usage statistics. SQL Server does not persist statistics and it resets on restarts. Statistics is also cleared whenever the database is detached or shut down when the AUTO_CLOSE database property is on. Moreover, in SQL Server 2012 and 2014, statistics resets when the index is rebuilt.

You must keep this behavior in mind during index analysis. It is not uncommon to have indexes to support queries that execute on a given schedule. As an example, you can think about an index that supports a payroll process running on a bi-weekly or monthly basis. Index statistics information could indicate that the index has not been used for reads if SQL Server was recently restarted or, in the case of SQL Server 2012 and 2014, if index was recently rebuilt.

One of the ways to address statistics reset is collecting usage statistics based on some schedule and persists results in one of the tables in the database. This could help to catch the situations when index is required to support some of the rarely executed processes. As the side note, you can consider to recreate such indexes as part of the process dropping them when processes are completed.

You should also be careful with unique indexes. It is entirely possible that such indexes are created to support uniqueness constraints and removal of such indexes would violate business requirements for the system.

4. Removing Redundant Indexes

As you know, SQL Server can use composite index for an Index Seek operation as long as a query has a SARGable predicate on the leftmost query column. Ok, I know, it is confusing so let’s look at the example and create a table with clustered and two nonclustered indexes.

create table dbo.Employee
(
    EmployeeId int not null,
    LastName nvarchar(64) not null,
    FirstName nvarchar(64) not null,
    DateOfBirth date not null,
    Phone varchar(20) null,
    Picture varbinary(max) null
);
 
create unique clustered index IDX_Employee_EmployeeId 
on dbo.Employee(EmployeeId);
 
create nonclustered index IDX_Employee_LastName_FirstName
on dbo.Employee(LastName, FirstName);
 
create nonclustered index IDX_Employee_LastName
on dbo.Employee(LastName);

SQL Server can utilize IDX_Employe_LastName_FirstName index if query has the predicate on the LastName regardless of existence of the predicate on the FirstName. For example, both of the queries below would be able to use that index:

select EmployeeId, LastName, FirstName, DateOfBirth
from dbo.Employee
where LastName = @LastName and FirstName = @FirstName

select EmployeeId, LastName, FirstName, DateOfBirth
from dbo.Employee
where LastName = @LastName

Thus, the index IDX_Employee_LastName is redundant and can be dropped. There are, of course, always exceptions from the rule. IDX_Employee_LastName index stores less data and, therefore, it is more compact. If you have a process that constantly scan the index, smaller index size could be beneficial. However, those situations are very rare and usually update overhead of the extra index is not worth such small performance improvement on SELECT queries.

The script below shows you potentially redundant indexes by checking the indexes that have the same leftmost columns.

select
    s.Name + N'.' + t.name as [Table]
    ,i1.index_id as [Index1 ID], i1.name as [Index1 Name]
    ,dupIdx.index_id as [Index2 ID], dupIdx.name as [Index2 Name] 
    ,c.name as [Column]
from 
    sys.tables t join sys.indexes i1 on
        t.object_id = i1.object_id
    join sys.index_columns ic1 on
        ic1.object_id = i1.object_id and
        ic1.index_id = i1.index_id and 
        ic1.index_column_id = 1  
    join sys.columns c on
        c.object_id = ic1.object_id and
        c.column_id = ic1.column_id      
    join sys.schemas s on 
        t.schema_id = s.schema_id
    cross apply
    (
        select i2.index_id, i2.name
        from
            sys.indexes i2 join sys.index_columns ic2 on       
                ic2.object_id = i2.object_id and
                ic2.index_id = i2.index_id and 
                ic2.index_column_id = 1  
        where	
            i2.object_id = i1.object_id and 
            i2.index_id > i1.index_id and 
            ic2.column_id = ic1.column_id
    ) dupIdx     
order by
    s.name, t.name, i1.index_id

For example, for dbo.Employee table, script would provide the output shown in Figure 8.

08. Potentially Redundant Indexes

You can use such information for further analysis performing further index consolidation. In some cases, consolidation is trivial. For example, if a system has two indexes: IDX1(LastName, FirstName) include (Phone) and IDX2(LastName) include(DateOfBirth), you can consolidate them as IDX3(LastName, FirstName) include(DateOfBirth, Phone).

In the other cases, consolidation requires further analysis. For example, if a system has two indexes: IDX1(OrderDate, WarehouseId) and IDX2(OrderDate, OrderStatus), you have three options. You can consolidate it as IDX3(OrderDate, WarehouseId) include(OrderStatus) or as IDX4(OrderDate, OrderStatus) include(WarehouseId). Finally, you can leave both indexes in place. The decision primarily depends on the selectivity of the leftmost column and index usage statistics.

5. Implementing Filtered Indexes

Filtered indexes, introduced in SQL Server 2008, allowed you to index only a subset of the data and, therefore, reduce the index size. Consider a table with some data that needs to be processed as an example. This table can have a Processed bit column, which indicates the processing status as it is shown below.

create table dbo.Data
(
    RecId int not null,
    Processed bit not null,
    /* Other Columns */
)

Let’s assume that you have a backend process that loads unprocessed data based on the following query.

select top 1000 RecId, /* Other Columns */
from dbo.Data
where Processed = 0
order by RecId;

This query can benefit from the following index: CREATE NONCLUSTERED INDEX IDX_Data_Processed_RecId ON dbo.Data(Processed, RecId). Even though SQL Server rarely uses indexes on bit columns due to their low selectivity, such a scenario might be an exception if there are just a handful of unprocessed rows. SQL Server can use that index to select them; however, the index will never be used for selection of processed rows if a Key Lookup is required.

As a result, all index rows with a key value of Processed=1 would be useless. They will increase the index’s size, waste storage space, and introduce additional overhead during index maintenance.

Filtered indexes solve that problem by allowing you to index just unprocessed rows, making the index small and efficient as it is shown below.

create nonclustered index IDX_Data_Unprocessed_Filtered
on dbo.Data(RecId)
include(Processed)
where Processed = 0;

Obviously, there are several catches to remember. First, Query Optimizer has a design limitation, which can lead to suboptimal execution plans when columns from the filter are not present in leaf-level index rows. Always add all columns from the filter to the index, either as key or included columns. In our example, Processed column is added as an included column.

Second problem is filtered index statistics. SQL Server does not count the changes in the filtered columns towards statistics update threshold, which can lead to very inaccurate statistics. You should factor this behavior to statistics maintenance and, perhaps, update statistics manually on the regular basis.

Finally, SQL Server is very conservative when to use filtered indexes in case of plan caching. SQL Server would not generate and cache the plan with filtered index if there is the possibility that this plan would be invalid for some parameter values. For example, our filtered index would not be used for the case below:

select top 1000 RecId, /* Other Columns */
from dbo.Data
where Processed = @Processed
order by RecId;

As you can guess, auto-parameterization would make the matter worse. The bottom line, you should carefully test your system after implementing filtered indexes making sure that there is no plan regressions.

Finally, if you are using XML indexes in SQL Server 2012 and above, you can reduce their size (which, by the way, could be gigantic) by implementing Selective XML Indexes , which index just subset of the data. Pretty much the same approach with as filtered indexes.

6. Using Appropriate Data Types

All approaches we have already discussed could be implemented transparently to the client applications. Obviously, transparency here is very misleading – index tuning and partitioning require careful regression testing to be performed. However, all those changes are located in the database tier and you do not need to change any application code.

Now, it is the time to talk about several approaches that require such code changes. I would like to start with the general design principle that is often ignored during database design stage. The principle is simple – you should choose appropriate data types for the job.

Let’s consider the system that collects GPS location information from the multiple devices as the example. The main transaction entity in such system is Positions. One of the approaches to define such a table is the following (just a few columns from there):

create table dbo.Positions
(
    ATime datetime not null, -- 8 bytes 
    Latitude float not null, -- 8 bytes 
    Longitude float not null, -- 8 bytes 
    IsPrecise int not null,	-- 4 bytes
    IsAssistanceUsed int not null, -- 4 bytes
    -- Total: 32 bytes
    ...
)

Alternatively, you can define the same table a little bit differently:

(
    ATime datetime2(0) not null, -- 6 bytes 
    Latitude decimal(9,6) not null, -- 5 bytes 
    Longitude decimal(9,6) not null, -- 5 bytes 
    IsPrecise bit not null,	-- 1 bytes
    IsAssistanceUsed bit not null, -- 0 bytes
    -- Total: 17 bytes
    ...
)

As you see, even in the scope of those 6 columns you can save 15 bytes per row. It does not sound as significant saving; however, it quickly adds up as amount of data growth. For example, if such system collects 1M rows per day and stores it for a year, 15 bytes per row would become ~5.4GB of data on the leaf level of the index without counting any fragmentation overhead. And trust me, 1M positions per day is very small number for such systems.

While row compression can help to address some overhead, it would not help much when data types store excessive information. For example, row compression would cut an extra space from the boolean data stored in int columns; however, it would not help much with datetime in case if it has more precision that needed. Moreover, compression is the Enterprise Edition feature, which would not help you with the other editions.

One of the questions you should answer during database design stage is how precise the information should be. This could help you to choose correct data type for the column. As the example, consider the OrderDate column in Order Entry/Shopping cart system. Do you really need to store the time when order was placed with up to 3-millisecond precision provided by datetime column (8 bytes)? If this was not the case, you could use 1-second precision of datetime2(0) type (6 bytes). Or, for 1 minute precision, you can end up with 4-byte smalldatetime data.

You should also remember that smaller data rows help with the performance during the scans. The table with smaller rows would have more rows stored on the data page and, therefore, would have less data pages stored in the index. Queries that need to perform scans (including range scans) will be faster due to the less I/O operations involved. Last but not least, such indexes will use less memory in the buffer pool, which allows to cache more data and reduces the number of physical I/O in the system.

Finally, you should remember that table alteration never ever decreases size of the data row. You should rebuild the indexes that reference altered columns in order to see the space saving.

7. Storing LOB data outside of the database

So far, we have discussed how to reduce the size of IN_ROW data. Let’s talk a bit about LOB data. First, let’s discuss the situation with the binary data, which does not require any in-database processing. For example, images, binary documents and other similar entities. With such entities, you always have the question of how to store them. Either within the database or externally, keeping just a reference (perhaps file name) in the database.

There was the rule of thumb introduced by Microsoft at time of FILESTREAM release. The binary data greater than 1MB would benefit from external storage. The data smaller than 200KB should live within the database. Well, everything in between is in the grey area. While I do not want to charge the numbers, there are usually more factors involved that just a size. Obviously, I am not talking about huge binary objects when the choice is obvious, but in general, you should make the decision on case-by-case basis.

In-database storage of binary data is usually the simplest solution to implement. The obvious downside of this approach is the increase of the database size. However, you can mitigate it up to degree with proper architecture. For example, you can put binary data to the separate filegroup(s) that reside on the slower disk arrays. You can also implement partial backup and exclude static binary data from the dayly backup files. In Enterprise Edition of SQL Server, you can utilize piecemeal restore and achieve strict RTO requirements even with the binary data in the database; however, in non-Enterprise Edition, RTO requirements could become the deal breaker. Binary data could significantly slow down restore time (due to the database and backup size), which can prevent you from meeting RTO requirements.

In case, if you decided to store binary data outside of the database, there are several questions to answer. The first, and perhaps the most important one, is how to handle redundancy and high availability of external data. For example, if you decided to store binary data as the files and reference them in the database, you need to make sure that such schema is compatible with SQL Server High Availability solution and file storage itself is redundant. Redundancy question mainly relies on storage administrators; however, High Availability aspect could be tricky in this scenario. Especially, if you have geo-redundancy and/or hybrid solutions in place. Obviously, you can implement something based on SAN replication moving files across multiple data centers; however, it requires significant investments into the hardware and software as well as incur the implementation cost.

Consistency of the data is another important question. If binary data needs to be transactionally consistent, you have a little choice but using FILESTREAM. While it is technically possible to implement the consistency in the code without FILESTREAM – for example, if transaction modifies the data, application generates another file and replace the reference to this file in the database; it would be extremely hard to support disaster recovery in this scenario. As you can guess, you can easily have “out of sync” situation when restoring data from the backup.

FILESTREAM could help you here; however, it has a few caveats. It is incompatible with some of SQL Server features, for example with database mirroring and, in some cases, it is complicated to implement. Performance-wise, you should use Streaming API on the client side to get the most from it.

As I already said, there is no right or wrong solutions. You should consider pros and cons of all approaches and consider other requirements in the system. With Enterprise Edition, I personally prefer to store relatively small (up to a few MB) data in the database carefully architecting filegroup layout and backup/restore strategy in case if I am using Enterprise Edition of SQL Server. With Standard Edition, the choices are much more limited.

8. Compressing LOB data in the database

As you already know, data compression is Enterprise Edition feature that compresses IN_ROW data only. However, it is entirely possible, that large amount of space in the database is consumed by LOB data. Do not forget, that there are plenty of data types that are, in the nutshell, LOBs. Think about XML as the example – it is not uncommon to see that XML-centric systems with XML data that consume large amount of space in the system.

One of the approaches to address such an overhead is manually compress LOB data in the code. It is very easy to create the methods to compress and decompress data utilizing one of the classes from System.IO.Compression namespace, for example using GZipStream class. Moreover, that method could be implemented in CLR stored procedures and used directly in T-SQL code.

I am not going to provide the examples of how to code that; you should be able to find quite a few of them searching in Internet. I would like, however, to discuss a couple implementation-related questions.

First, compression is CPU intensive. It is better to run such code on the client whenever it is possible. I would still, however, suggest implementing CLR routines in the database and have them available to T-SQL code. This could help to address some of the use-cases, when client needs to work with uncompressed data. Consider, for example, some external analytical or reporting tools that query the database directly. You can create the view that call CLR function and decompress the data on the fly providing it to the clients.

You should be careful with the version management in such scenario making sure that the code is the same on both, client side and in the database and that algorithms remain the same and data can be decompressed on either side.

The second important consideration is performance. Obviously, decompression adds an overhead, which you would like to avoid on the large scope. For example, it is the bad idea to have a query that scans large amount of data and performs decompression on every row to evaluate the predicate against one of compressed attributes. For example, query shown below would be highly inefficient.

select 
from T
where convert(xml,dbo.DecompressData(CompressedXML)).value('..') = 1

One of the ways to address such an issue is creating persisted calculated columns for the attributes that are used in where clauses of the queries. The downside of this approach is that SQL Server would not be able to use parallel execution plans in such queries – this is one of the limitations of Query Optimizer when you are using columns calculated with scalar UDFs. However, it is often the small price to pay comparing to constant decompression overhead.

With all being said, compressing LOB data manually is definitely the option, which is worth considering. You can use sys.dm_index_physical_stats view to evaluate amount of such data on per-index basis. Obviously, Pareto 80/20 principle still applies – do not add extra complexity if benefits are not worth it.

UPDATE (2-15-04-07): More details about this method are here.

9. Storing Data in Clustered Columnstore Indexes

In case of Enterprise Edition of SQL Server 2014, you have another option to consider. You can store some of the data in columnstore format utilizing Clustered Columnstore Indexes. This format can provide significant space saving comparing to the regular B-Tree row-based storage. Moreover, you can also utilize Archival Columnstore Compression that applies gzip-like compression on columnstore data and reduces the size even further at cost of extra CPU load.

Just to give you an example, Figure 9 below shows you the difference in the storage space between row-based and column-based storage. The table in this example was generated based on FactSalesBig table from AdventureWorks2012DW database. Obviously, different data leads to the different results; however, in the most cases, clustered columnstore indexes would give you order of magnitude decrease on the storage space. It is also worth mentioning that nonclustered indexes on B-Tree tables would contribute to additional storage space, which is not the case with clustered columnstore indexes that are the single copy of the data in the table.

09. Storage Size based on different compression methods and storage formats

Obviously, clustered columnstore indexes are not for everyone. They are very beneficial for Data Warehouse workload that requires to scan and aggregate large amount of data. The same time they are the very bad choice for OLTP workload – they do not support point-lookups nor small range scans.

Same time, it is not uncommon to have different use-cases for the old and new data in OLTP systems. Customers can generate OLTP workload to support day-to-day operations with the new data; however, the old data can be used for analysis and reporting, which is mainly Data Warehouse workload. In such scenario, you can consider to partition your data into the multiple tables using columnstore format for the tables with the old data. You can abstract all those changes via partitioned views making the differences in the schema and storage format transparent to the clients.

This is by no means not the simplest thing to implement. However, such design could lead to significant performance improvements and storage space saving for the certain kind of workloads.

10. Reducing amount of free space in the database

Finally, let’s discuss what we can do when data files have large amount of free space.

As strange as it sounds, one of the best possible options in that case is leaving everything as is. Consider the situation when system implements sliding-window pattern keeping 1-year worth of data in the system. Typically, such systems purge the data based on some schedule. For example, it is possible that every 1st day of the month system purges the 13th month of data – the one with the data older than 1 year.

Let’s assume that system collects 500GB of data per month. In this scenario, if you measured amount of free space in the data files right after the purge, you would notice that files have more than 500GB of free space available. Obviously, you can shrink the files and release such space to OS; however, the database would reclaim it as data growth.

However, for the purpose of this discussion, let’s assume that we have legitimate case to decrease the size of the files. Unfortunately, it is not very easy to do. DBCC SHRINKFILE command is the terrible way to reduce the size of the database. That command works in the very simple matter – it starts to move the allocated extents from the end of the file to unallocated space in the beginning of the file. As you can guess, this leads to the terrible index fragmentation. Moreover, it generates excessive amount of log records, which can affect the system in the multiple ways.

Obviously, you can perform index maintenance after you are done with the shrink. However, there is the catch – index rebuild will grow the file again (it needs space to accommodate the new version of the index). Index reorg could be the better choice in this scenario even though it does not provide results in par with the index rebuild. In the end, everything depends on amount of the space you are clearing and the size of your data. For example, if you had terabytes of free space and your biggest index is just a couple hundred gigabytes, you could consider to shrink and run index rebuild afterwards. Files would grow; however, such growth is much smaller comparing to space saving after the shrink.

Another, and often the better way to accomplish the task is moving all the data to another filegroup dropping original filegroup afterwards. The actual implementation would vary based on the version and edition of SQL Server. In Enterprise Edition, you can perform online index rebuild to the different filegroup, which will keep system available during the process. In Standard Edition, you can rebuild indexes only offline.

There is another catch though – index rebuild does not move LOB data between filegroups by default. The only way to workaround it is by rebuilding index to the new partition schema instead of the filegroup. However, it requires Enterprise Edition, which supports partitioning. Unfortunately, in Standard Edition you are out of luck.

I am not providing the examples here; however, I would like to reference my book again where I have discussed it in details. Alternatively, you can download book demo scripts and see how data movement works in action.

Lastly, there is always the option of creating another table on another filegroup, copying data there and dropping original table and renaming the new table afterwards. This approach would work in either edition; however, in the most part of the cases it needs be done offline. Online implementation is, of course, possible but it is usually complicated if table has volatile data.

Wrapping Up

This blog post ended up being much bigger than I expected. Unfortunately, even with such size it was impossible to cover all the details for some of the methods. Anyway, I hope you found this information useful, at least as the high-level overview.

Please, do not take the order in which I outlined approaches as the guideline. Every system is unique and you need to design the solution targeted to particular system taking hardware, software and business requirements into consideration.

Reducing Offline Index Rebuild and Table Locking Time in SQL Server

Enterprise Edition of SQL Server is all about availability features. It allows you to build robust High Availability and Database Maintenance strategies, which are the must for the mission critical systems. Unfortunately, all those benefits come at cost. At very high cost. Enterprise Edition of SQL Server is anything but cheap. Enterprise Edition licenses usually several times more expensive comparing to the Standard Edition licenses. It applies to both, on-premises and Cloud installations. For example, in Microsoft Azure, VMs with Enterprise Edition are about 4 times more expensive comparing to Standard Edition VMs.

Not every business can afford Enterprise Edition. And, to be honest, not every business even need Enterprise edition. It is entirely possible to build relatively big and complex solutions with Standard Edition of SQL Server especially with SQL Server 2014, which can utilize 128GB of RAM. I have personally worked with several multi-TB databases using Standard Edition of SQL Server.

One of the common challenges with non-Enterprise Editions, however, is Index Maintenance. Enterprise Edition allows you to rebuild indexes keeping table online. It uses row-versioning similar to optimistic transaction levels under the hood. With exception of short-time shared (S) at the beginning and schema-modification (Sch-M) locks at the end of the execution, there are no full table locks held. Other users can access the table, and read, and modify data from there.

Unfortunately, this is not the case with the Standard Edition where only offline index rebuild is supported. Offline index rebuild acquires schema modification (Sch-M) lock for the duration of the execution, which blocks entire access to the table even in read uncommitted transaction isolation level. Based on the index size, rebuild can be very time consuming operation. Obviously, you can use index reorganize instead of rebuild., which is online operation. However, it does not work as good as index rebuild on heavily fragmented tables.

Today, I am going to show you very simple technique that can help to decrease offline index rebuild time, and, therefore, time when table is inaccessible for the other sessions. It is not the magic silver bullet that helps everywhere. In fact, in some of the cases, it could even hurt rather than help. We will discuss that later in the post.

I am going to use a table from one of my production systems. That table has a couple dozen columns including one nvarchar(max) column, which is hosting some LOB data. I have rebuilt the index before the tests making sure that initial fragmentation does not affect execution time on the first test run.

As you see in Figure 1 below, leaf level of the clustered index for IN-ROW data consist of 20,374 pages, which is about 159MB of data. LOB data consists of 76,708 pages, which is about 600MB of data.

01. Index Physical Stats

As the first step, let’s run ALTER INDEX REBUILD statement and measure execution time for the operation. I am clearing buffer pool emulating a cold cache before the test (DO NOT RUN IT IN PRODUCTION!). This is, obviously, the worst case scenario – in the real life, at least some of the data pages would be cached at time of the index rebuild operation.

I am also running index rebuild without parallelism involved just to make sure that results are consistent and simplify the analysis. As the side note, my test server has pretty slow I/O subsystem, and execution time without MAXDOP=1 hint is just about the same as with the hint.

-- Clearing buffer pool. DO NOT RUN IN PRODUCTION
checkpoint
go

dbcc dropcleanbuffers
go

-- Rebuilding Index
alter index IDX_CI on dbo.MyTable rebuild with (maxdop = 1)

As you can see in Figure 2, index rebuild time in my system is 1 minute and 24 seconds. As you know, other sessions would be unable to access the table during this time due to schema modification (Sch-M) lock held on the table.

02. Execution Time of Index Rebuild with the Cold Cache

Let’s repeat our test (again, with cold cache) and analyze session waits during the index rebuild. We can do it by creating Extended Event session with the code shown below. This code will work in SQL Server 2012 and 2014 and will require slight modifications in SQL Server 2008/2008 R2.

create event session [Waits] on server
add event sqlos.wait_info
(
	action (sqlserver.session_id)
	where
	(
		sqlserver.session_id = 141 -- @@SPID of the session with index rebuild)
		and opcode = 1 -- END of wait
	)
)
add target package0.event_file(SET filename=N'C:\ExtEvents\Waits.xel',max_rollover_files=(0))
with
(
	max_memory=4096 KB
	,event_retention_mode=allow_single_event_loss
	,max_dispatch_latency=10 SECONDS
)
go

alter event session [Waits] on server state = start

After index rebuild command, let’s stop the session and look at the waits that occur during index rebuild.

alter event session [Waits] on server
state = stop
go

;with EventData(event_data)
as
(
	select cast(event_data AS XML) AS event_data
	from sys.fn_xe_file_target_read_file
		('C:\ExtEvents\Waits*.xel',null,null,null)
)
,Waits
as
(
	select
        event_data.value ('(/event/@timestamp)[1]'
			, 'datetime') as [Time],
        event_data.value ('(/event/data[@name=''wait_type'']/text)[1]'
			,'varchar(255)') AS [Wait],
        event_data.value ('(/event/data[@name=''duration'']/value)[1]'
			,'bigint') AS [Duration],
        event_data.value ('(/event/data[@name=''signal_duration'']/value)[1]'
			,'bigint') AS [Signal]
	from EventData
)
select 
	w.[Wait]
	,count(*) as [Count]
	,sum(w.Duration) as [Total Duration (ms)]
	,sum(w.Duration) - sum(w.Signal) as [Resource Waits (ms)]
	,sum(w.Signal) as [Signal Waits (ms)]
from Waits w
group by w.[Wait]

As you can see in the Figure 3 below, the vast majority of the waits are PAGELATCHIO*, which occur when SQL Server is waiting for the data page to be read from the disk.

03. Waits during Index Rebuild with the Cold Cache

That discovery can point us to the right direction. If large portion of index rebuild was waiting for the data to be read from disk, we could speed up the process by pre-fetching and caching data in the memory. SQL Server would not need to perform physical read operations, which should reduce rebuild time.

Let’s try that and run the following select below (again, making sure that we are using cold cache for test sake):

checkpoint
go

dbcc dropcleanbuffers
go


declare
	@T1 bigint, @T2 bigint

select 
	@T1 = avg(RecId) -- Making sure that IN_ROW pages are read
	,@T2 = avg(len(LOB_Column)) -- Making sure that LOB pages are read
from dbo.MyTable with (nolock, index=IDX_CI) 
option (maxdop 1)

I am forcing the scan of the clustered index – the one, which we are going to rebuild. I am using read uncommitted transaction isolation level with NOLOCK hint. As you can guess, in this situation I do not worry about any data consistency issues that can arise.

However, there is very important part – we need to make sure that SQL Server reads LOB data pages in addition to IN_ROW data pages. We can force it by accessing our LOB column in the select list. You can use the same technique dealing with ROW_OVERFLOW pages in case, if index has such allocation units.

Finally, I am also referencing one of columns from IN_ROW data to make sure that those pages are also read into the memory. In some cases, with the large tables and read uncommitted transaction isolation level, SQL Server can choose to use allocation unit scans instead of index scan. I seriously doubt, that SQL Server can perform such operation and scan only LOB data pages in our select, especially with the index hint in use, but it is better to be safe than sorry.

The execution time of the select in my system is 1 minute and 3 seconds as it is shown in Figure 4 below.

04. Execution Time of Select Statement

Now, let’s run index rebuild again. As you see, with index pages cached in memory, index rebuild time is just 10 seconds as it is shown in Figure 5. Even though, total time of both operations (select and index rebuild) are comparable with original index rebuild over cold cache, the time when table was inaccessible due to schema modification (Sch-M) lock is about 8 times less than before.

05. Execution Time of Index Rebuild when Data is Cached

If you looked at the waits during the last index rebuild, you would see that PAGELATCHIO* waits practically disappeared as it is shown in Figure 6.

06. Waits During Index Rebuild with the Warm Cache

While this technique can significantly reduce blocking during offline index rebuild, it is not for everyone. First, and foremost, you should be careful in case of the large indexes especially when system has limited amount of memory and/or low page life expectancy. Caching large indexes will require significant amount of memory and, therefore, it could lead to the situations when you flush the buffer pool possibly flushing the index pages you just read. This is especially important if you expect data growth in the future. Small indexes today can become the large ones tomorrow.

Second, it is beneficial to analyze how many pages in the index are typically cached in memory. Obviously, if most part of the pages have been already cached, you would not get much gain from running the select and pre-fetching remaining pages to the memory. On the other hand, logical reads are fast anyway so select over warm cache would not take much time.

You can use the script below to analyze number of pages from the index that have been cached. You can compare those numbers with results of sys.dm_db_index_physical_stats to estimate the number of cached pages per allocation unit from the index.

select i.name, au.type_desc, count(*) as [Cached Pages]
from 
	sys.dm_os_buffer_descriptors bd with (nolock) 
        join sys.allocation_units au with (nolock) on 
		    bd.allocation_unit_id = au.allocation_unit_id
	join sys.partitions p with (nolock) on
		(au.type in (1,3) and au.container_id = p.hobt_id) or
		(au.type = 2 and au.container_id = p.partition_id)
	join sys.indexes i with (nolock) on
		p.object_id = i.object_id and 
		p.index_id = i.index_id
where
	bd.database_id = db_id() and
	p.object_id = object_id (N'dbo.MyTable') and 
	p.index_id = 1 -- ID of the index
group by
	i.name, au.type_desc

Finally, that technique is useful when the biggest bottleneck in the system is I/O performance as it happened in my case. This often happens in the Cloud-based systems (SQL Server in VMs) where I/O latency is usually high and Standard Edition of SQL Server is commonly used due to its cost. In that scenario, pre-fetching data to the cache could significantly reduce the locking time for offline index rebuild when you have enough memory for the index data.

Plan Cache: Plan Reuse

Last time we have discussed how parameter sniffing can affect the quality of generated execution plans. Today, I would like to talk about another aspect of plan caching, which is plan reuse. Plans, cached by SQL Server, must be valid for any combination of parameters during future calls that reuse the plan. In some cases, this can lead to situations where a cached plan is suboptimal for a specific set of parameter values.

One of the common code patterns that leads to such situations is the implementation of stored procedure that search for the data based on a set of the optional parameters. Let’s look at the typical implementation of such stored procedure as shown in the code below. That code uses dbo.Employees table from the “parameter sniffing” post and it also creates two nonclustered indexes on that table.

create proc dbo.SearchEmployee
(
	@Number varchar(32) = null
	,@Name varchar(100) = null
)
as
begin
	select Id, Number, Name, Salary, Country
	from dbo.Employees
	where 
		((@Number is null) or (Number=@Number)) and
		((@Name is null) or (Name=@Name))
end
go

create unique nonclustered index IDX_Employees_Number
on dbo.Employees(Number);

create nonclustered index IDX_Employees_Name
on dbo.Employees(Name);

There are several different approaches how SQL Server can execute the query from the stored procedure based on what parameters were provided. In the large number of cases, the most efficient approach would be using Nonclustered Index Seek and Key Lookup operators. Let’s run the stored procedure several times with different parameter combinations and check the execution plans:

exec dbo.SearchEmployee @Number = '10000';
exec dbo.SearchEmployee @Name = 'Canada Employee: 1';
exec dbo.SearchEmployee @Number = '10000', @Name = 'Canada Employee: 1';
exec dbo.SearchEmployee @Number = NULL, @Name = NULL;

01.Execution Plans When Plans Are Cached

SQL Server compiles stored procedure at time of the first call when only @Number parameter was provided. Even though, the most efficient execution plan for this case is IDX_Employees_Number Index Seek operation, SQL Server cannot cache such execution plan because it would not be valid for the case, when @Number parameter is NULL. Therefore, SQL Server generated and cached execution plan that utilizes Index Scan operation, which is highly inefficient in case, when @Number parameter is not provided. In that case, SQL Server performs Key Lookup operation for every row in the nonclustered index optionally evaluating predicate on @Name parameter afterwards.

Similar to parameter sniffing issues, you can address this problem with statement-level recompilation as it is shown below. SQL Server recompiles the query on every call, and therefore it can choose the most beneficial execution plan for every parameter set.

It is also worth mentioning that the plans are not cached in cases where statement-level recompile is used.

alter proc dbo.SearchEmployee
(
	@Number varchar(32) = null
	,@Name varchar(100) = null
)
as
begin
	select Id, Number, Name, Salary, Country
	from dbo.Employees
	where 
		((@Number is null) or (Number=@Number)) and
		((@Name is null) or (Name=@Name))
	option (recompile)
end

02.Execution Plans With Statement-Level Recompile

As we already discussed, query recompilation adds CPU overhead. That overhead can be acceptable in case of large and complex queries when compilation time is just the fraction of the query execution time and system is not CPU-bound. In that case, recompilation can even help producing the better execution plans, which would be generated based on the current parameter values, especially if table variables are involved. However, this recompilation overhead is usually not acceptable in case of OLTP queries that are called very often.

One of the options you can use to address the issue is writing multiple queries using IF statements covering all possible combinations of parameters. SQL Server would cache the plan for each statement in the procedure. Listing below shows such an approach, however it quickly becomes unmanageable with a large number of parameters. The number of combinations to cover is equal to the number of parameters squared.

alter proc dbo.SearchEmployee
(
	@Number varchar(32) = null
	,@Name varchar(100) = null
)
as
begin
	if @Number is null and @Name is null
		select Id, Number, Name, Salary, Country
		from dbo.Employees
	else if @Number is not null and @Name is null
		select Id, Number, Name, Salary, Country
		from dbo.Employees
		where Number=@Number
	else if @Number is null and @Name is not null
		select Id, Number, Name, Salary, Country
		from dbo.Employees
		where Name=@Name
	else 
		select Id, Number, Name, Salary, Country
		from dbo.Employees
		where Number=@Number and Name=@Name
end

In the case of a large number of parameters, dynamic SQL becomes the only option. SQL Server will cache the execution plans for each dynamically generated SQL statement. Remember that using dynamic SQL breaks ownership chaining, and it always executes in the security context of CALLER. You should also always use parameters with dynamic SQL to avoid SQL Injection.

alter proc dbo.SearchEmployee
(
	@Number varchar(32) = null
	,@Name varchar(100) = null
)
as
begin
	declare
		@SQL nvarchar(max) = N'	
select Id, Number, Name, Salary, Country
from dbo.Employees
where 1=1'
	
	if @Number is not null
		select @Sql = @SQL + N' and Number=@Number'
	if @Name is not null
		select @Sql = @SQL + N' and Name=@Name'
	exec sp_executesql @Sql, N'@Number varchar(32), @Name varchar(100)'
		,@Number=@Number, @Name=@Name
end

While most of us are aware about danger of using optional parameters and OR predicates in the queries, there is another less known issue associated with filtered indexes. SQL Server will not generate and cache a plan that uses a filtered index, in cases when that index cannot be used with some combination of parameter values.

Listing below shows an example. SQL Server will not generate the plan, which is using the IDX_RawData_UnprocessedData index, even when the @Processed parameter is set to zero because this plan would not be valid for a non-zero @Processed parameter value.

create unique nonclustered index IDX_RawData_UnprocessedData
on dbo.RawData(RecID)
include(Processed)
where Processed = 0;

-- Compiled Plan for the query would not use filtered index
select top 100 RecId, /* Other columns */
from dbo.RawData
where RecID > @RecID and Processed = @Processed
order by RecID;

In that particular case, rewriting the query using IF statement would be, perhaps, the best option.

if @Processed = 0
	select top 1000 RecId, /* Other Columns */
	from dbo.RawData
	where RecId > @RecId and Processed = 0
	order by RecId;
else 
	select top 1000 RecId, /* Other Columns */
	from dbo.Data
	where RecId > @RecId and Processed = 1
	order by RecId;

Unfortunately, IF statement does not always help. In some cases, SQL Server can auto-parametrize the queries and replace some of the constants with auto-generated parameters. That behavior allows SQL Server to reduce the size of the plan cache; however, it could lead to all plan reuse issues we have already discussed.

By default, SQL Server uses SIMPLE parametrization and parametrize only the simple queries. However, if database or query are using FORCED parametrization, that behavior can become the issue. Let’s look at a particular example and create a database with a table with a filtered index and populate it with some data, as shown below.

use master
go

create database ParameterizationTest
go

use ParameterizationTest
go

create table dbo.RawData
(
	RecId int not null identity(1,1), 
	Processed bit not null, 
	Placeholder char(100),
	constraint PK_RawData
	primary key clustered(RecId)
);

/* Inserting:
	Processed = 1: 65,536 rows
	Processed = 0: 16 rows */
;WITH N1(C) AS (SELECT 0 UNION ALL SELECT 0) -- 2 rows
,N2(C) AS (SELECT 0 FROM N1 AS T1 CROSS JOIN N1 AS T2) -- 4 rows
,N3(C) AS (SELECT 0 FROM N2 AS T1 CROSS JOIN N2 AS T2) -- 16 rows
,N4(C) AS (SELECT 0 FROM N3 AS T1 CROSS JOIN N3 AS T2) -- 256 rows
,N5(C) AS (SELECT 0 FROM N4 AS T1 CROSS JOIN N4 AS T2 ) -- 65,536 rows
,IDs(ID) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM N5)
insert into dbo.RawData(Processed)
	select 1
	from Ids;

insert into dbo.RawData(Processed)
	select 0
	from dbo.RawData
	where RecId <= 16;

create unique nonclustered index IDX_RawData_Processed_Filtered 
on dbo.RawData(RecId)
include(Processed)
where Processed = 0;

For the next step, let’s run the queries that count a number of unprocessed rows in both SIMPLE and FORCED parametrization modes.

select count(*)
from dbo.RawData
where Processed = 0			
go

alter database ParameterizationTest set parameterization forced
go

select count(*)
from dbo.RawData
where Processed = 0

If you examine the execution plans shown in Figure 3, you will notice that SQL Server utilized a filtered index in the case of a SIMPLE parametrization. SQL Server can cache this plan because of the constant in the Processed=0 predicate. Alternatively, with FORCED parametrization, SQL Server parametrizes the query using the parameter in the Processed=@0 predicate. Therefore, it cannot cache the plan with the filtered index because it would not be valid for the case when a query selects processed (Processed=1) rows. SQL Server generated the execution plan with a Clustered Index Scan, which is far less efficient in this case.

03.Execution Plans and Parametrization

The workaround in this case is using SIMPLE parametrization for the database, or forcing it on query level with plan guide (more on it later). In some cases, you would also need to rewrite the query and use one of the constructs that prevent parametrization in SIMPLE parametrization mode, such as IN, TOP, DISTINCT, JOIN, UNION, subqueries and quite a few others.

Plan caching and plan reuse are the great features that help to reduce CPU load on the server. However, they introduce several side effects you need to be aware of and keep them in mind when you write queries and stored procedures.

Source code is available for download

Plan Cache: Parameter Sniffing

There are several reasons why it is beneficial to use parameters in the queries. It mitigates security risks by protecting from SQL Injection. Moreover, it helps with the performance. Query optimization is expensive and resource-intensive process. SQL Server reduces that cost by keeping execution plans in the special part of the memory called Plan Cache and reusing them on subsequent calls.

Even though, this approach benefits the system by reducing the number of query compilations, it could also introduce some issues. Cached plans can become inefficient due to the data distribution and/or parameter values change. SQL Server tracks changes in the data by counting number of changes of the index columns outdating statistics and recompiling queries when number of changes reach about 20% of number of rows in the table (trace flag 2371 can reduce that threshold). However, tracking parameter values change is more complicated.

When SQL Server compiles parametrized query or stored procedure, it analyzes parameters and generates execution plan, which is optimal for those specific values. This process is called Parameter Sniffing. Furthermore, generated plans are cached and reused regardless of parameter values at time of reuse. That behavior can lead to the situations when query is compiled and plan is generated based on atypical parameter set. Cached plan, in this case, would be suboptimal for the calls that reuse that plan.

All of us experienced the situations when query or stored procedure suddenly became very slow even when they ran fine before. Parameter sniffing could be the reason of why it happened. SQL Server decided to recompile the query due to statistics update and the query that triggered recompilation had atypical parameter values. Even though, generated execution plan was efficient for parameters at time of compilation; that cached plan is not necessarily efficient for the other queries that reuses it.

Historically, parameter sniffing has the negative connotation in SQL Server community. It is used to explain why queries have suboptimal execution plans, like I just did in the previous paragraph. This is vastly wrong, however. Parameter sniffing is just the name for SQL Server behavior of analyzing parameters during query compilation.

Let’s look at a few examples of parameter sniffing and discuss how to address the issues it can introduce. In this post, I am using the stored procedure; however, everything would work the same in case of parametrized queries.

As the first step, let’s create a table dbo.Employees. We will populate it distributing data in the way that most rows have Country column value of USA with a few employees with Country value of Canada. Listing below shows the code.

create table dbo.Employees
(
	ID int not null,
	Number varchar(32) not null,
	Name varchar(100) not null,
	Salary money not null,
	Country varchar(64) not null,

	constraint PK_Employees
	primary key clustered(ID)
);

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2 ) -- 65,536 rows
,Nums(Num) as (select row_number() over (order by (select null)) from N5)
insert into dbo.Employees(ID, Number, Name, Salary, Country)
	select 
		Num, 
		convert(varchar(5),Num), 
		'USA Employee: ' + convert(varchar(5),Num), 
		40000,
		'USA'
	from Nums;

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,Nums(Num) as (select row_number() over (order by (select null)) from N3)
insert into dbo.Employees(ID, Number, Name, Salary, Country)
	select 
		65536 + Num, 
		convert(varchar(5),65536 + Num), 
		'Canada Employee: ' + convert(varchar(5),Num), 
		40000,
		'Canada'
	from Nums;

create nonclustered index IDX_Employees_Country
on dbo.Employees(Country);

As a next step, let’s create a stored procedure that calculates the average salary for employees in a specific country.

create proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
end

As you can guess, when we call this stored procedure with parameter @Country=’USA’, stored procedure needs to touch almost every row in the table and the most optimal execution plan is Clustered Index Scan. However, when we call the stored procedure with parameter @Country=’Canada’, the most efficient execution plan is Nonclustered Index Seek with Key Lookup operators.

Let’s call the stored procedure twice: the first time with @Country=’USA’ and the second time with @Country=’Canada’, as shown below.

exec dbo.GetAverageSalary @Country='USA';
exec dbo.GetAverageSalary @Country='Canada';

As you can see in Figure 1, SQL Server compiles the stored procedure and caches the plan with the first call and reuses it later. Even though such a plan is less efficient with the @Country=’Canada’ parameter value, it may be acceptable when those calls are rare, which is expected with such a data distribution.

01. Execution plans when the stored procedure is compiled with @Country=’USA’

Now let’s take a look what happens if we swap those calls when the plan is not cached. We will use the DBCC FREEPROCCACHE command, which clears plan cache (DO NOT RUN IT ON PRODUCTION SERVER). As I already mentioned, another instance when this might happen is with a statistics update that forces query to recompile.

dbcc freeproccache
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 2, SQL Server now caches the plan based on the @Country=’Canada’ parameter value. Even though this plan is more efficient when the stored procedure is called with @Country=’Canada’, it is highly inefficient for @Country=’USA’ calls.

02. Execution plans when the stored procedure is compiled with @Country=’Canada’

There are a few ways to address the issue. You can force the recompilation of either stored procedure using EXECUTE WITH RECOMPILE or a statement-level recompile with OPTION (RECOMPILE) clauses. Obviously, a statement-level recompile is better because it performs the recompilation on a smaller scope. SQL Server sniffs the parameter values at the time of the recompilation, generating the optimal execution plan for each parameter value. It is also worth mentioning, that statement-level recompile allows SQL Server to obtain number of rows stored in table variables, which can help to generate more efficient execution plans in some cases.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (recompile)
end
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 3, SQL Server recompiles the statement on every call, and it generates the most efficient execution plan for every parameter value. It is also worth noting that SQL Server does not cache plans for the statements when statement-level recompile is used.

03. Execution plans with OPTION (RECOMPILE) hint

The statement-level recompile may be a good option when the queries do not execute very often or in the case of complex queries, when the compilation time is just a fraction of the total execution time. However, it is hardly the best approach for OLTP queries, which are constantly running in the system due to the extra CPU load that recompilation introduces.

Another option is using an OPTIMIZE FOR hint, which forces SQL Server to optimize a query for the specific parameter values provided in the hint. Listing below illustrates such an approach.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (optimize for(@Country='USA'))
end
go

exec dbo.GetAverageSalary @Country='Canada';
exec dbo.GetAverageSalary @Country='USA';

As you can see in Figure 4, SQL Server ignores the parameter value during the compilation and optimizes the query for the @Country=’USA’ value.

04. Execution plans with OPTIMIZE FOR hint

Unfortunately, the OPTIMIZE FOR hint introduces supportability issues, and it can lead to suboptimal execution plans in cases where the data distribution has changed. Let’s consider the situation, albeit unrealistic, when a company and all of its employees moved from the United States to Germany.

update dbo.Employees set Country='Germany' where Country='USA';

exec dbo.GetAverageSalary @Country='Germany';

Statistics are outdated at the time of the update, which forces SQL Server to recompile the statement in the stored procedure. At this point, there are no rows in the table with Country=’USA’, and the recompilation produces a suboptimal execution plan, as shown in Figure 5. As a side note, the query uses more reads than before due to the index fragmentation introduced by the update.

05. Inefficient execution plan after data distribution change

SQL Server 2008 introduced another optimization hint, OPTIMIZE FOR UNKNOWN, which helps to address such situations. With this hint, SQL Server performs an optimization based on the most statistically common value in the table. Listing below shows the code involved in doing this.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @Country
	option (optimize for(@Country UNKNOWN))
end
go

exec dbo.GetAverageSalary @Country='Canada';

Figure 6 illustrates the execution plan. Country=’Germany’ is the most statistically common value in the table, and therefore SQL Server generates an execution plan that is optimal for such a parameter value.

06. Execution plan with OPTIMIZE FOR UNKNOWN hint

You can achieve the same results with an OPTIMIZE FOR UNKNOWN hint by using local variables instead of parameters. This method also works with SQL Server 2005, where the OPTIMIZE FOR UNKNOWN hint is not supported. Listing below illustrates that approach. It introduces the same execution plan with a Clustered Index Scan, as shown in Figure 6 above.

alter proc dbo.GetAverageSalary @Country varchar(64)
as
begin
	declare
		@CountryTmp varchar(64)
	set @CountryTmp = @Country

	select Avg(Salary) as [Avg Salary]
	from dbo.Employees
	where Country = @CountryTmp
end

Finally, in case when you are dealing with 3rd party systems and cannot add optimization hints to the code, you could use plan guides to achieve the same goal. We will discuss plan guides in one of the future blog posts.

Source code is available for download.

Partial Database Backup and Piecemeal Restore in Microsoft SQL Server

Unfortunately, disasters happen. Regardless of how good is High Availability strategy that you have implemented, there is always the chance that one day you will need to move the database to another server and quickly bring system online. And the chance is – you will have to do it under stress with your phone ringing every few minutes and your manager asking about ETA. It does not help that such process can be very time consuming in the case of the large databases.

Enterprise Edition of SQL Server supports concept of piecemeal restore and allows you to restore database on filegroup-by-filegroup basis keeping database online during the process. Queries that access data from online filegroups would work just fine. This is the great technique that can dramatically reduce system downtime.

In the large number of cases, the size of the operational data that is required for system to be operational is relatively small. Historical data, on the other hand, is often kept due to regulation/compliance reasons and rarely accessed by the applications. As you can guess, it allows you to quickly bring part of the database that supports operational activity online and work with remaining historical data afterwards while customers can connect and use the system.

Obviously, you should design data placement in the way that supports piecemeal restore. In the nutshell, it means the separation of the operational and historical data across different filegroups. As the example, let’s consider Order Entry/Shopping Cart system that stores data for several years. One of the data layout designs could be the following:

  • Empty Primary FG. Primary Filegroup should be online in order for database to be online. It is good idea to keep primary filegroup empty and do not place any objects there.
  • Entities FG. This filegroup could store catalog tables, such as Customers, Articles and others.
  • One or more filegroups for the operational data. For example, if operational period is the current year, this filegroup can store Orders, OrderLineItems and related entities that stores current-year data.
  • One or more filegroups for the historical data. Those filegroups store data that is not required to support operational activity in the system.

Piecemeal restore strategy will require you to bring online Primary, Entities and Operation data filegroups first. System will be available to the customers at this point. After that, you can work on restoring historical data filegroups, which in most part of the cases, will be significantly larger than operational data and, therefore, will take longer time to restore.

It is also worth noting, that in case of SQL Server 2014 In-memory OLTP, you should also have Hekaton filegroup online before database becomes available to the users. Usually, In-memory tables keep operational data anyway, so it should not be a problem in most part of the cases.

Let’s look at the example and create a database with the structure outlined above. For simplicity sake, every filegroup has only one data file. However, in the real-life you should consider creating multiple files to reduce allocation contention in the filegroups with volatile data.

create database [MyBigOrderDb]
on primary
(name = N'MyBigOrderDb', filename = N'c:\db\MyBigOrderDb.mdf'), 
filegroup [Entities] 
(name = N'MyBigOrderDB_Entities', filename = N'c:\db\MyBigOrderDB_Entities.ndf'), 
filegroup [FG2013] 
(name = N'MyBigOrderDB_FG2013', filename = N'c:\db\MyBigOrderDB_FG2013.ndf'), 
filegroup [FG2014] 
(name = N'MyBigOrderDB_FG2014', filename = N'c:\db\MyBigOrderDB_FG2014.ndf')
log on
(name = N'MyBigOrderDb_log', filename = N'c:\db\MyBigOrderDb_log.ldf')

As the next step, let’s create a few tables including partitioned table Orders.

create table dbo.Customers
(
    CustomerId int not null,
    CustomerName nvarchar(64) not null,
)
on [Entities];

create table dbo.Articless
(
 ArticlesId int not null,
 ArticleName nvarchar(64) not null,
)
on [Entities];

create partition function pfOrders(smalldatetime)
as range right
for values('2014-01-01');

create partition scheme psOrders
as partition pfOrders
to (FG2013,FG2014)
go

create table dbo.Orders
(
    OrderId int not null,
    OrderDate smalldatetime not null,
    OrderNum varchar(32) not null,
    constraint PK_Orders
    primary key clustered(OrderDate, OrderId)
    on psOrders(OrderDate)
)
go

insert into dbo.Customers(CustomerId, CustomerName) values(1,'Customer 1');
insert into dbo.Orders(OrderDate, OrderId, OrderNum)
values
    ('2013-01-01',1,'Order 1'),
    ('2013-02-02',2,'Order 2'),
    ('2014-01-01',3,'Order 3'),
    ('2014-02-02',4,'Order 4')

Next, let’s create the backup chain and perform FULL, DIFFIRENTIAL and LOG backups.

-- Full backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Full.bak' 
with noformat, init, name = N'MyBigOrderDb-Full Database Backup', 
    compression, stats = 2
go

-- Differential backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Diff.bak' 
with differential, noformat, init, 
    name = N'MyBigOrderDb-Differential Database Backup', 
    compression, stats = 2
go

-- Transaction log
backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_Log.trn' 
with noformat, init, name = N'MyBigOrderDb-Tran Log', 
    compression, stats = 2
go

And at this point, let’s assume that disaster happens and we need to move database to another server. In this example, I would assume that we still have access to transaction log of the original database and we will perform tail-log backup to avoid any data loss.

backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDb_TailLog.trn' 
with no_truncate, noformat, init, name = N'MyBigOrderDb-Tail Log', 
 compression, norecovery, stats = 2

At this point, we will need to copy all files from the backup chain to another server and start restore process. We will perform piecemeal restore of Primary, Entities and FG2014 filegroups to support operational activity of the system without bringing historical FG2013 data online. The first operation is performing restore of the FULL database backup specifying just the filegroups we need to restore.

-- Restoring on another server (Same folder structure for demo sake)
-- Full Backup
restore database [MyBigOrderDb] 
FILEGROUP = 'primary', FILEGROUP = 'Entities', FILEGROUP = 'FG2014'
from disk = N'C:\DB\MyBigOrderDb_Full.bak' with file = 1,
move N'MyBigOrderDB' to N'c:\db\MyBigOrderDb.mdf', 
move N'MyBigOrderDB_Entities' to N'c:\db\MyBigOrderDb_Entities.ndf', 
move N'MyBigOrderDB_FG2014' to N'c:\db\MyBigOrderDb_2014.ndf', 
move N'MyBigOrderDb_log' to N'c:\db\MyBigOrderDb.ldf', 
NORECOVERY, partial, stats = 2;

it is worth mentioning, that I am using WITH NORECOVERY clause in all RESTORE commands including tail-log backup restore. In the end, I am recovering database with the separate RESTORE statement. This is just a good practice and safety measurement. By default, Management Studio uses WITH RECOVERY option with restore, which can lead to the situation that you accidentally recovered database at intermediate restore stage. You would not be able to restore further backups after that and would be forced to start from scratch.

As the next steps, we need to restore DIFFERENTIAL, LOG and tail-log backups as shown below. You do not need to specify filegroups as part of restore statement anymore – SQL Server would perform restore only in scope of the filegroups we are working with and which are in RESTORING state.

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Diff.bak' with file = 1,
NORECOVERY, stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Log.trn' with file = 1,
NORECOVERY, stats = 2;

-- Tail-log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_TailLog.trn' with file = 1,
NORECOVERY, stats = 2;

-- Recovery
restore database [MyBigOrderDb] with RECOVERY;

At this point, our database is partially online. We can query operational data as it shown below in Figure 1.

select * from MyBigOrderDb.dbo.Customers
select * from MyBigOrderDb.dbo.Orders where OrderDate >= '2014-01-01'

01. Querying Data from Operational Filegroups

However, if we try to query historical data, we will get an error, as shown in Figure 2

select * from MyBigOrderDb.dbo.Orders where OrderDate < '2014-01-01'

02. Querying Data from Historical Filegroup

You can check the status of the database filegroups with the following query. Figure 3 shows that three filegroups are online while FG2013 filegroup is still in RECOVERY_PENDING stage.

select file_id, name, state_desc, physical_name
from MyBigOrderDb.sys.database_files

03. Filegroup Status Before Historical Filegroup Restore

As you see, customers can use the system while we are working on restoring of FG2013 filegroup. We can perform this with the following script:

-- Full Backup (restoring individual filegroup)
restore database [MyBigOrderDb] 
FILEGROUP = 'FG2013'
from disk = N'C:\DB\MyBigOrderDb_Full.bak' with file = 1,
move N'MyBigOrderDB_FG2013' to N'c:\db\MyBigOrderDb_2013.ndf',  
stats = 2;

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Diff.bak' with file = 1,
stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_Log.trn' with file = 1,
stats = 2;

-- Tail-log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDb_TailLog.trn' with file = 1,
stats = 2;

Now database is online as it is shown in Figure 4.

04. Filegroups Status After Historical Filegroup Restore

Piecemeal restore can significantly decrease database restore time. However, there is still time-consuming part in this process. Regardless how many filegroups you are restoring, you should copy/move backup file that contains initial FULL database backup to the new server. Time of this operation depends on network throughput and slow network and/or large backup files can lead to very long delays.

Obviously, you can take several approaches. You can always “be prepared” and copy files to reserved server after each backup. With such strategy, you would already have backup files in place when they need to be restored if/when disaster occurs.

However, you can use another approach if historical data is read-only. This approach is called “partial database backup”. Even though, Microsoft stated that it is designed for SIMPLE recovery models, it would work just fine with FULL recovery model, as long as you implemented it correctly.

Let’s look how it works and as the first step, let’s mark our FG2013 filegroup as read-only:

alter database MyBigOrderDb modify filegroup FG2013 readonly

At this point, you should start the new backup chain, which, in the nutshell, are two different sets of backup files. One set includes backup files for read-write filegroups (FULL, DIFFERENTIAL and LOG backups using READ_WRITE_FILEGROUPS option. In addition, you need to perform backup of read-only filegroup. You can do it with the script shown below:

-- Backing Up Read-Only Filegroup
backup database [MyBigOrderDb] 
FILEGROUP = N'FG2013'
to disk = N'c:\db\MyBigOrderFG2013.bak' 
with noformat, init, 
	name = N'MyBigOrderDb-FG2013 FG backup', 
	compression, stats = 2
go

-- Full backup of read_write filegroups
backup database [MyBigOrderDb] READ_WRITE_FILEGROUPS
to disk = N'c:\db\MyBigOrderDbRW_Full.bak' 
with noformat, init, name = N'MyBigOrderDb-Full Database Backup (R/W FG)', 
	compression, stats = 2
go

-- Differential backup
backup database [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDbRW_Diff.bak' 
with differential, noformat, init, 
	name = N'MyBigOrderDb-Differential Database Backup (R/W FG)', 
	compression, stats = 2
go

-- Transaction log
backup log [MyBigOrderDb] 
to disk = N'c:\db\MyBigOrderDbRW_Log.trn' 
with noformat, init, name = N'MyBigOrderDb-Tran Log', 
	compression, stats = 2

The beauty of this situation that now you have data backups separated. You would have relatively small backup chain file(s) for operational data as well as large static backup file for read-only historical data. This will dramatically reduce time required to move operational data backup files over network.  It also helps with day-to-day backup strategy and reduces the time of backup operation and, server and network load, and storage space required to store the files. You can create new backup chains of operation data without taking new backup for historical data as long as those filegroups stay read-only.

If disaster occurs and you need to move database to another server, you can start with operation data backup chain as shown below. For simplicity sake, I am omitting tail-log backup – however, in real life you should always obtain it to avoid data loss.

-- Restoring on another server (Same folder structure for demo sake)
-- Full Backup (R/W filegroups only)
restore database [MyBigOrderDb] 
filegroup = 'primary', filegroup = 'Entities', filegroup = 'FG2014'
from disk = N'C:\DB\MyBigOrderDbRW_Full.bak' with file = 1,
move N'MyBigOrderDB' to N'c:\db\MyBigOrderDb.mdf',  
move N'MyBigOrderDB_Entities' to N'c:\db\MyBigOrderDb_Entities.ndf',  
move N'MyBigOrderDB_FG2014' to N'c:\db\MyBigOrderDb_2014.ndf',  
move N'MyBigOrderDb_log' to N'c:\db\MyBigOrderDb.ldf',  
norecovery, partial, stats = 2;

-- Diff Backup
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDbRW_Diff.bak' with file = 1,
norecovery, stats = 2;

-- Tran Log
restore database [MyBigOrderDb] 
from disk = N'C:\DB\MyBigOrderDbRW_Log.trn' with file = 1,
norecovery, stats = 2;

-- Recovery
restore database [MyBigOrderDb] with recovery;

If you queried the status of the database filegroups after restore, you would see very similar picture – all operational filegroups are online and historical filegroup is in RECOVERY_PENDING state. Figure 5 illustrates that.

05. Filegroup Status During Partial Restore

Again, at this point system is available to users and you can continue working on historical data while system is operational with the following script.

-- Restoring Read/Only Filegroup
restore database [MyBigOrderDb] 
filegroup = 'FG2013'
from disk = N'C:\DB\MyBigOrderFG2013.bak' with file = 1,
move N'MyBigOrderDB_FG2013' to N'c:\db\MyBigOrderDb_2013.ndf',  
recovery, stats = 2;

Be careful when you change read-only status of the filegroups when partial backup is used. One of examples of such scenario is when operational period changed and you need to move some former-operational data that becomes historical to another filegroup. One of the approaches to accomplish it is making historical filegroup as read-write, copying data there and making it read-only again. You can still restore the database using old read-only filegroup backup as long as you have backup chain with LOG backups that cover data movement operations. However, you would not be able to recover historical data if you start new backup chain for updateable filegroups after the fact and did not take new backup of the read-only filegroup.

As the general recommendation, it is safer to start new backup chain together with backup of historical filegroup after you made historical filegroup read-only again. And, most importantly, regardless what solution you are using, test your backup and restore strategies. This would help you to avoid unpleasant surprises when things went south.

Clustered Columnstore Indexes: Performance Considerations

Last time we have looked at the internal structure of delta store and delete bitmap in the clustered columnstore indexes. Today, I would like us to discuss a few practical aspects affecting performance of ETL processes and queries against tables with clustered columnstore indexes.

There are two different ways how you can import data into a table with clustered columnstore index. The first approach is bulk insert, which can be done with bcp utility, BULK INSERT command and other applications that utilize the bulk insert API. The second type, called trickle inserts, are regular INSERT operations that do not use the bulk insert API.

Bulk insert operations provide the number of rows in the batch as part of the API call. SQL Server inserts data into newly created row groups if that size exceeds a threshold of a little bit over 100,000 rows. Depending on the size of the batch, one or more row groups can be created and some rows may be stored in delta store.

Figure 1 below illustrates how data from the different batches are distributed between row groups and delta stores based on batch size.

01. Batch size and data distribution during bulk insert

Let’s do some tests now and see how performance is affected based on the batch size and, therefore, number of row groups in the table. In those tests, I created a set of the tables with the structure similar to what is shown below.

create table dbo.FactSalesBig 
( 
        ProductKey int not null, 
        OrderDateKey int not null, 
        DueDateKey int not null, 
        ShipDateKey int not null, 
        CustomerKey int not null, 
        PromotionKey int not null, 
        CurrencyKey int not null, 
        SalesTerritoryKey int not null, 
        SalesOrderNumber nvarchar(20) not null, 
        SalesOrderLineNumber tinyint not null, 
        RevisionNumber tinyint not null, 
        OrderQuantity smallint not null, 
        UnitPrice money not null, 
        ExtendedAmount money not null, 
        UnitPriceDiscountPct float not null, 
        DiscountAmount float not null, 
        ProductStandardCost money not null, 
        TotalProductCost money not null, 
        SalesAmount money not null, 
        TaxAmt money not null, 
        Freight money not null, 
        CarrierTrackingNumber nvarchar(25) null, 
        CustomerPONumber nvarchar(25) null, 
        OrderDate datetime null, 
        DueDate datetime null, 
        ShipDate datetime null 
)

As the first step, I created CSV file with about 62M rows generated based on dbo.FactResellerSales table from the AdventureWorksDW2012 database and measured performance of the bulk import with bcp utility using 1,000,000-row batches and 102,500-row batches respectively in the 4-CPU virtual machine with 8GB of RAM allocated.

You can see row group statistics after the imports in Figure 2 below. The first import generated 62 1,000,000-row row groups while the second imported ended up with 604 102,500-row row groups.

02. Row groups after insert

Performance of import operation was affected by the batch size. Bcp utility were able to process about 103,500 rows per second with 1,000,000-row batches. In case of 102,500-row batches, the throughput was about 94,300 rows per second, which is about 9% slower.

It is also worth noting that in case of the smaller batches, SQL Server imports data into the delta stores converting them to fully-populated row groups later. While, on the one hand, it would generate efficient row groups, it significantly degraded performance of insert process. For example, in case of 99,999-row batches, the throughput in my environment was only 37,500 rows per second.

As the next test, I checked how partially populated row groups affected performance of the queries using the query shown below. That query performs a MAX() aggregation on 20 columns from a table. The result of the query is meaningless; however, it forces SQL Server to read data from 20 different column segments in each row group in the table.

select  
        max(ProductKey),max(OrderDateKey),max(DueDateKey) 
        ,max(ShipDateKey),max(CustomerKey),max(PromotionKey) 
        ,max(CurrencyKey),max(SalesTerritoryKey),max(SalesOrderLineNumber) 
        ,max(RevisionNumber),max(OrderQuantity),max(UnitPrice) 
        ,max(ExtendedAmount),max(UnitPriceDiscountPct),max(DiscountAmount) 
        ,max(ProductStandardCost),max(TotalProductCost),max(SalesAmount) 
        ,max(TaxAmt),max(Freight)  
from dbo.FactSalesBig

Figure 3 illustrates execution statistics of the query against tables with fully and partially populated row groups (shown in Figure 2). As you can see, the query against a table with partially populated row groups took a considerably longer time to execute.

03. Execution Statistics in case of fully and partially populated row groups

In the next step, let’s check how large delta store affects performance of the queries. For that test, I inserted one million rows to the table using small batches and run the test query. After that, I rebuilt the columnstore index, comparing the execution time of the test query before and after the index rebuild.

The index rebuild process moved all data from the delta store to row groups. You can see the status of row groups and the delta store before (on the left side) and after (on the right side) the index rebuild in Figure 4.

04. Row groups and delta store after insertion of 1,000,000 rows

Figure 5 illustrates the execution times of the test query in both scenarios, and it shows the overhead introduced by the large delta store scan during query execution.

05. Execution time and delta store size

Finally, let’s see how delete bitmaps affect query performance. For that test, I deleted almost 30,000,000 rows from a table (the one where I just rebuilt the index). You can see row groups’ information in Figure 6.

06.Row groups after deletion of 30,000,000 rows

The test query needs to validate that rows have not been deleted during query execution. Similar to the previous test, this adds considerable overhead. Figure 7 shows the execution time of the test query, comparing it to the execution time of the query before the data deletion.

07. Execution time and delete bitmap

The bottom line – partially populated row groups, and large delta stores and delete bitmaps, they all negatively affect performance of the systems that use clustered columnstore indexes. You can address all of these performance issues by rebuilding the columnstore index, which you can trigger with the ALTER INDEX REBUILD command. The index rebuild forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data. All column segments are recreated with row groups fully populated.

Similar to index creation, the index rebuild process is very resource intensive. Moreover, it prevents any data modifications in the table by holding shared (S) table lock. However, other sessions can still read data from a table while the rebuild is running.

One of the methods you can use to mitigate the overhead of index rebuild is table/index partitioning. You can rebuild indexes on a partition-basis and only perform it for partitions that have volatile data. Old facts table data in most Data Warehouse solutions is relatively static, and ETL processes usually load new data only. Partitioning by date in this scenario localizes modifications within the scope of one or very few partitions. This can help you dramatically reduce the overhead of an index rebuild.

A columnstore indexes maintenance strategy should depend on the volatility of the data and the ETL processes implemented in the system. You should rebuild indexes when a table has a considerable amount of deleted rows and/or a large number of partially populated row groups.

To summarize:

  1. You should design ETL processes in the way that data is bulk imported in the batches as close to 1,048,576 rows as possible. This will guarantee that every batch will become separate and fully populated row-group. Do not exceed this size and avoid spilling batches across multiple row groups
  2. Even though clustered columnstore indexes are updateable, you should minimize such updates. Large delta stores and/or delete bitmaps negatively affect query performance. You should monitor their sizes and design index maintenance strategy in the way that keep them as small as possible
  3. Columnstore index rebuild is very resource-intensive. Table partitioning would help you to mitigate performance impact by allowing index rebuild in the scope of the one or very few partitions. You should design partitioning strategy in the way, that  limits data modification and/or import into small subset of partitions rebuilding them afterwards

Clustered Columnstore Indexes: Exploring Delta Store and Delete Bitmap

I am OLTP guy. I cannot grasp concept of the columnstore indexes – indexes that do not care about an order of columns in the definition. It was a reason why Data Warehouses and I lived happily ever after. Just apart from each other.

It was not hard, after all. Even though you can find some use cases for nonclustered columnstore indexes in OLTP environment, inability to modify data after index was created makes those use cases quite rare. Technically, you can use data partitioning and partitioned views and benefit from them in some scenarios; however, such implementation requires large amount of work.

The situation changed after SQL Server 2014 release. Now you can define clustered columnstore indexes, which are updateable. It does not make them suited for OLTP environment – you should remember that they are optimized for large SCAN workloads. Moreover, as the opposite of nonclustered columnstore indexes, they are the only indexes you can define on the table. It is impossible to define B-Tree indexes on the same table and support both environments.

Anyway, I decided to spend some time and explore how clustered columnstore indexes handle data modifications. It was very easy to find some references online; however, neither of the references talks about internal implementation of the indexes. Today, we will try to close this gap.

As the first step, let’s talk about high level structure of clustered columnstore indexes. They use the same storage format as nonclustered columnstore indexes storing columnstore data in row groups. Each row group stores data for up to 1,048,576 rows in column-based format. Data from each column stored separately in highly compressed fashion.

Clustered columnstore indexes  have two additional elements to support data modifications. The first is delete bitmap that indicates what rows were deleted from a table. The second structure is delta store that includes newly inserted rows. Both, delta store and delete bitmap use B-Tree format to store data.

SQL Server works with delete bitmap and delta stores transparently to users, which makes terminology confusing. You can often see delta stores being referenced as another row group in the documentation and technical articles. Moreover, delete bitmap is often considered as a part of delta store and/or row groups. I will use the following terminology today to avoid confusion. A term row group references data stored in column-based storage format. I will explicitly reference delta stores and delete bitmap as two separate set of internal objects whenever needed.

You can see example of the structure of clustered columnstore index in a table that has two partitions in Figure 1 below. Each partition can have a single delete bitmap and multiple delta stores. It is worth mentioning that delete bitmap and delta stores are created on-demand, for example, delete bitmap would not be created unless some of the rows in the row groups were deleted.

01. Clustered Columnstore Index Structure

Every time when you delete a row that is stored in a row group (not in a delta store), SQL Server adds information about deleted row to delete bitmap. Nothing happens to original row. It is still stored in a row group; however, SQL Server checks delete bitmap during query execution excluding deleted rows from the processing. When you insert data into columnstore index, it goes into a delta store. Updating a row that is stored in a row group do not change row data either. Such update triggers deletion of a row, which is, in fact, insertion to delete bitmap, and insertion of a new version of a row to a delta store. However, any data modifications of the rows in delta store are done the same way as in the regular B-Tree indexes by updating and deleting actual rows there. You will see one of such examples later.

Each delta store can be either in open or closed state. Open delta stores accept new rows and allow modifications and deletions of the data. SQL Server closes a delta store when it reaches 1,048,576 rows, which is the maximum number of rows that can be stored in a row group. Another SQL Server process, called tuple mover, runs every five minutes and converts closed delta stores to row groups that store data in column-based storage format.

You can examine the state of row groups and delta store with sys.column_store_row_groups view. Figure 2 illustrates an output of this view, which returns combined information about all columnstore index objects. Rows in OPEN or CLOSED state correspond to delta stores. Rows in COMPRESSED state correspond to row groups with data in column-based storage format. Finally, deleted_rows column provide statistics about deleted rows stored in delete bitmap.

02. Row Groups and Delta Stores

As you see, the second row in a view output shows closed delta store that have yet to be picked up by tuple mover process. The situation would change after tuple mover process converted closed delta store to a row group. Figure 3 illustrates the output from a view after it happened. It is worth mentioning that row_group_id of converted row group changed. Tuple mover created new row group dropping closed delta store afterwards.

03. Row Groups and Delta Store After Tuple Mover Process

Let’s look at  the structure of delta store and delete bitmap rows. Listing below creates a table and populates it with the data creating clustered columnstore index afterwards. I am using MAXDOP=1 option to reduce the number of partially populated row groups.

create table dbo.CCI
(
    Col1 int  not null,
    Col2 varchar(4000) not null
);

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,N6(C) as -- 1,048,592 rows
(
    select 0 from N5 as T1 cross join N3 as T2
    union all
    select 0 from N3
)
,IDs(ID) as (select ROW_NUMBER() over (order by (select NULL)) from N6)
insert into dbo.CCI(Col1,Col2)
    select ID, 'aaa'
    from IDS
go

create clustered columnstore index IDX_CS_CLUST on dbo.CCI
with (maxdop=1)
go

select g.state_description, g.row_group_id, s.column_id
    ,s.row_count, s.min_data_id, s.max_data_id, g.deleted_rows
from
    sys.column_store_segments s join sys.partitions p on
        s.partition_id = p.partition_id
    join sys.column_store_row_groups g on
        p.object_id = g.object_id and
        s.segment_id = g.row_group_id
where
    p.object_id = object_id(N'dbo.CCI')
order by
    g.row_group_id, s.column_id;

As you see in Figure 4, columnstore index has two row groups and does not have delta store nor delete bitmap. You can see Col1 values that are stored in both row groups in min_data_id and max_data_id columns for the rows that have column_id=1.

04. Row Groups after Clustered Columnstore Index Creation

As the next step, let’s perform some data modifications in the table. First statement inserts two new rows into the table. Second statement deletes three rows, including one row we just inserted. Finally, we will update another, newly inserted, row.

insert into dbo.CCI(Col1,Col2) 
values
    (2000000,REPLICATE('c',4000)),
    (2000001,REPLICATE('d',4000));

delete from dbo.CCI 
where Col1 in 
    (
    100  		-- Row group 0
    ,16150  		-- Row group 1
    ,2000000	  -- Newly inserted row (Delta Store)
    );

update dbo.CCI 
set Col2 = REPLICATE('z',4000) 
where Col1 = 2000001; -- Newly inserted row (Delta Store)

Now it is a time to find data pages that used by delta store and delete bitmap. We will use undocumented sys.dm_db_database_page_allocations system function as shown below.

select object_id, index_id, partition_id
    ,allocation_unit_type_desc as [Type]
    ,is_allocated,is_iam_page,page_type,page_type_desc
    ,allocated_page_file_id as [FileId]
    ,allocated_page_page_id as [PageId]
from sys.dm_db_database_page_allocations
    (db_id(), object_id('dbo.CCI'),NULL, NULL, 'DETAILED')

You can see an output of the query in Figure 5. SQL Server stores columnstore segments in LOB_DATA allocation units. Delta store and delete bitmap are using IN_ROW_DATA allocation.

05. Table Allocation Units

Let’s look at the data pages using another undocumented DBCC PAGE command with the code shown below. Obviously, in your environment, database, file and page IDs would be different.

-- Redirecting output to console
dbcc traceon(3604)

-- Analyzing content of a page
dbcc page
(
	9	-- Database Id
	,1	-- FileId
	,306	-- PageId
	,3	-- Output style
)

Figure 6 shows partial content of a data page, which is a delta store page. As you can see, SQL Server stores data in regular row-based storage. There is one internal column CSILOCATOR in addition to two table columns. CSILOCATOR is used as internal unique identifier of the row in delta store.  Finally, it is worth mentioning that a row with Col1=2000000, which we have inserted and deleted after clustered columnstore index was created, is not present in delta store. SQL Server deletes (and updates) rows in B-Tree delta store the same way as with regular B-Tree tables.

06. Delta Store Data Page

You can use the same approach to examine content of a deleted bitmap data page. In my case, the page id is 308. Figure 7 shows the partial output of DBCC PAGE command. As you see, delete bitmap includes two columns, which are uniquely identifying a row. The first column is a row group id and the second column is offset of the row in the segment. Do not be confused by the fact that column names match table columns. DBCC PAGE uses table metadata to prepare an output.

07. Delete Bitmap Data Page

As you see, both delta store and deleted bitmap pages were using row compression in our example, which means delta store and delete bitmap either row- or page-compressed. As you know, in case of page compression, SQL Server performs page compression only when page is full and retain it only if it provides significant space savings. Otherwise, data is kept in row-compressed format even when index is defined with page compression.

Let’s run a test that inserts large batch of rows that can benefit from page compression using code shown in Listing below.

;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select ROW_NUMBER() over (order by (select NULL)) from N5)
insert into dbo.CCI(Col1,Col2) 
	select ID, REPLICATE('a',255)
	from IDS

Figure 8 illustrates content of the data page from delta store after insert. The presence of compression info record indicates that delta store is using page compression

08. Delta Store Data Page (with Page Compression)

Let’s examine what happens with delete bitmap and delete all rows from compressed row groups with code shown in Listing below.

delete from dbo.CCI

As you can see in Figure 9 below, page is still uses row compression even though now it is fully populated. Obviously, we cannot guarantee that delete bitmap is not defined with page compression – after all it is not documented – however, it could be logical to use row compression in this case when we have two small integer values. Row compression would perform perfectly here.

09. Delete Bitmap Data Page (Full with Row Compression)

Hope, that information can shed some light on clustered columnstore index internal structure.

Next: Clustered Columnstore Indexes: Performance Considerations

Locking in Microsoft SQL Server (Part 19) – Concurrency model in in-memory OLTP (Hekaton)

It is impossible to resist the urge of exploring in-memory OLTP Engine (code name Hekaton) released as part of SQL Server 2014. This technology can provide you huge performance boost, assuming, of course, that you can live within surface area limitations. Nevertheless, internal implementation of in-memory OLTP is fascinating. Almost everything is done differently than what you get used to with SQL Server Storage Engine. To put things into prospective, I seriously considered to name this post as “Concurrency – upside down”. 🙂

Today, I want to focus on particular aspect of in-memory OLTP, such as its concurrency model. While implementation of SNAPSHOT isolation is more or less obvious, I was intrigued, how higher isolation levels, such as REPEATABLE READ and SERIALIZABLE, would work in latch- and lock-free environment.

I assume, that you have a basic understanding of key principles used in-memory OLTP. Otherwise, you can consider to read MSDN documentation and Kalen Delaney’ whitepaper at first.

Even though, I am not going to focus much on in-memory OLTP indexes and access methods, I would like to reiterate how Hekaton works with the data. It uses completely different  mechanism comparing to regular on-disk tables. The data rows live in memory and linked to each other in single-linked list of pointers – one pointer chain per index.

Concurrency model in in-memory OLTP is a version-based supporting multiple versions of the rows with different lifetime. SQL Server maintains two different unique values, such as:

  • Global Transaction Timestamp is auto-incremented value, which is uniquely identifying every transaction in the system. SQL Server increments this value at transaction pre-commit stage.
  • TransactionId is another identifier (timestamp), which is also uniquely identifies a transaction. SQL Server obtains and increments its value at moment when transaction starts.

Every row has BeginTs and EndTs timestamps, which correspond to a Global Transaction Timestamp of the transaction that created or deleted this version of a row.  A special timestamp value, called Infinity, is used to indicate rows that have not been deleted (EndTs=Infinity). SQL Server never updates rows. When row needs to be modified, it deletes (updates EndTs) of original row and create a new row version with a new timestamp and EndTs of Infinity.

A transaction can only see rows that existed at time of transaction start, which is similar to SNAPSHOT isolation levels for on-disk tables. However, for in-memory data that behavior does not change with isolation level. REPEATABLE READ and SERIALIZABLE isolation levels follow exactly the same rules.

Figure 1 illustrates an example of data access and visibility. It shows hash index on Name (on left  side) and multiple data rows linked into that index pointer chain. Again, if you do not know what hash index is, consider to read about it in documentation. For simplicity sake, let’s consider that hash function is based on the first letter of the Name.

01. Hash index and data rows

Let’s assume that we need to run a query that selects all rows with Name=’Ann’ in the transaction that started when Global Transaction Timestamp was 65. SQL Server calculates hash value for Ann, which is ‘A‘ and find corresponding bucket in the hash index. It follows the pointer from that bucket, which references a row with Name=’Adam’. This row has BeginTs of 10 and EndTs of Infinity; therefore, it is visible to the transaction. However, Name value does not match the predicate and row is ignored.

As the next step, SQL Server follows the pointer from Adam index pointer array, which references first Ann row. This row has BeginTs of 50 and EndTs of Infinity; therefore, it is visible to the transaction and needs to be selected.

As the final step, SQL Server follows the next pointer in the index. Even though, last row also has Name=’Ann’, it has EndTs of 50, which indicates that row has been deleted before transaction started and it is invisible to the transaction.

I hope, that provides you very basic example of access methods and data visibility used in in-memory OLTP. However, before we start diving deeper into internal implementation of concurrency model in Hekaton, I would like us to remember about data logical consistency rules provided by different transaction isolation levels.

Any transaction isolation level resolve write/write conflicts. Multiple transactions cannot update a same row simultaneously. Different outcomes are possible, in some cases, SQL Server uses blocking and preventing transactions from accessing uncommitted changes until transaction that made those changes is committed. In other cases, SQL Server rolls back one of transactions due to update conflict. In-memory OLTP uses latter method to resolve write/write conflicts and abort the transaction. We will discuss this situation later, and now let’s focus on the read data consistency.

There are three major data inconsistency issues possible in multi-user environments, such as:

Dirty Reads: Transaction reads uncommitted (dirty) data from the other uncommitted transactions.

Non-Repeatable Reads: Subsequent attempts to read the same data from within the same transaction returns different results. This data inconsistency issue arises when the other transactions modified, or even deleted, data between the reads done by affected transaction.

Phantom Reads: This phenomenon occurs when subsequent reads within the same transaction return the new rows (the ones transaction did not read before). This happens when another transaction inserted the new data in between the reads done by affected transaction.

Figure 2 below shows data inconsistency issues that are possible for different transaction isolation levels.

02. Transaction isolation levels and data consistency

With exception of SNAPSHOT isolation level, SQL Server uses locking to address data inconsistency issues when dealing with on-disk tables. It blocks sessions from reading or modifying data to prevent data inconsistency. Such behavior also means that in case of write/write conflict, last modification wins. For example, when two transactions are trying to modify a same row, SQL Server blocks one of transactions until another transaction is committed allowing blocked transactions to modify data afterwards. No errors or exceptions would be raised, however changes from the first transactions would be lost.

SNAPSHOT isolation level uses row-versioning model where all data modifications done by other transactions are invisible for the transaction. It is implemented differently in case of on-disk and memory-optimized tables however, logically it behaves the same. Write/write conflicts in that model are resolved by aborting and rolling back the transactions.

It is also worth mentioning that even though SERIALIZABLE and SNAPSHOT isolation levels provide the same level of protection against data inconsistency issues, there is a subtle difference in their behavior. With SNAPSHOT isolation level transaction sees a data as of at beginning of transaction. With SERIALIZABLE isolation level, transaction sees a data as of a time when data was accessed a first time. Consider a situation when session is reading data from a table in the middle of transaction. If another session changed data in that table after transaction started but before data was read, transaction in SERIALIZABLE isolation level would see the changes while SNAPSHOT transaction would not.

As I already mentioned, In-memory OLTP supports three transaction isolation levels – SNAPSHOT, REPEATABLE READ and SERIALIZABLE. However, in-memory OLTP uses completely different approach to enforce data consistency rules comparing to on-disk tables. Rather than block or being blocked by the other sessions, in-memory OLTP validates data consistency at transaction commit time throwing exception and rolling back the transaction if rules were violated. This is very confusing behavior comparing to on-disk tables – transaction is continue working without being blocked. It returns data to clients; however it is failed to commit in the end.

Let’s look at a few examples that demonstrates such behavior. As the first step let’s create memory-optimized table and insert a few rows there.

create table dbo.HKData
(
     ID int not null,
     Col int not null,
     constraint PK_HKData
     primary key nonclustered hash(ID)
     with (bucket_count=64),
)
with (memory_optimized=on, durability=schema_only);

insert into dbo.HKData(ID, Col)
values(1,1),(2,2),(3,3),(4,4),(5,5);

Figure 3 shows two examples how REPEATABLE READ transactions handle non-repeatable and phantom reads. Session 1 transaction starts at time when first SELECT operator executes. Remember, that SQL Server starts transaction at moment of first data access rather than at time of BEGIN TRAN statement.

03. REPEATABLE READ behavior

As you see, with memory-optimized tables, other sessions were able to modify data that was read by active REPEATABLE READ transaction, which led to transaction abort at time of commit. This is completely different behavior from on-disk tables, where other sessions would be blocked until REPEATABLE READ transaction successfully commits.

It is also worth mentioning that in case of memory-optimized tables, REPEATABLE READ isolation level protects you from Phantom Read phenomenon, which is not the case with on-disk tables.

As the next step, let’s repeat our tests in SERIALIZABLE isolation level. You can see a code and results of the execution in Figure 4.

04. SERIALIZABLE behavior

As you see, SERIALIZABLE isolation level prevents session to commit transaction when another session inserted a new row and violate serializable validation. Similar to REPEATABLE READ isolation level, this behavior is different from on-disk tables, where SERIALIZABLE transaction would successfully commit blocking other sessions until it is done.

Finally, let’s repeat our tests in SNAPSHOT isolation level. The code and results are shown in Figure 5.

05. SNAPSHOT behavior

SNAPSHOT isolation level works similar to on-disk tables and protects from Non-Repeatable Reads and Phantom Reads phenomenon. As you can guess, it does not need to perform repeatable read and serializable validations at commit stage and, therefore, reduces the load to SQL Server.

Write/write conflicts work the same way regardless of transaction isolation level in in-memory OLTP. SQL Server does not allow transaction to modify a row that has been modified by other uncommitted transactions. Figures 6 and 7  illustrate such behavior. It uses SNAPSHOT isolation level, however behavior does not change in different isolation levels.

06. Write/write conflict (1)

07. Write/write conflict (2)

Now, let’s dive deeper and look what happens under the hood. Figure 8 illustrates lifetime of in-memory OLTP transaction.

08. In-memory OLTP transaction lifetime

At time, when new transaction starts, it generates new TransactionId and obtains current Global Transaction Timestamp value. Global Transaction Timestamp value dictates what version of the rows are visible to the transaction and timestamp value should be in between BeginTs and EndTs for row to be visible. During data modifications, however, transaction analyzes if there are any uncommitted versions of the rows preventing write/write conflicts when multiple sessions modify the same data.

When transaction needs to delete a row, it updates EndTs timestamp with TransactionId value, which also has an indicator that timestamp contains TransactionId rather than Global Transaction Timestamp. Insert operation creates of a new row with BeginTs of TransactionId and EndTs of Infinity. Finally, update operation consists of delete and insert operations internally.

Figure 9 shows the data rows after we created and populated dbo.HKData table. I am omitting hash index structure for simplicity sake.

09. Data rows after table creation

Let’s assume that we have transaction started at time when Global Transaction Timestamp value was 10 and TransactionId generated as -5. I am using negative value for TransactionId to illustrate a difference between two values in the figures below.

Let’s assume that transaction performs a few data modification operations as shown below.

insert into dbo.HKData with (snapshot)
(ID, Col)
values(10,10);

update dbo.HKData with (snapshot)
set Col = -2
where ID = 2;

delete from dbo.HKData with (snapshot)
where ID = 4;

Figure 10 illustrates the state of a data after data modifications. INSERT statement created a new row, DELETE statement updated EndTs value in the row with ID=4 and UPDATE statement changed EndTs value of the row with ID=2 and created a new version of a row with same ID.

It is important to mention that transaction maintains a write set – pointers to rows that have been inserted and deleted by transaction. Moreover, in SERIALIZABLE and REPEATABLE READ isolation levels, transactions maintains read set of the rows that were read by a transaction. Write set is used to generate transaction log records, while read set is used to perform REPEATABLE READ and SERIALIZABLE rules validation.

10. Data Rows after update (transaction is active)

When COMMIT request is issued, transaction starts validation phase. First, it generates new Global Transaction Timestamp value and replaces TransactionId with this value in all BeginTs and EndTs timestamps in the rows it modified. Figure 11 illustrates that, assuming that Global Transaction Timestamp value is 11.

11. Committing transaction (pre-commit stage)

At this moment, rows modified by transactions become visible to other transactions in the system even though transaction has yet to be committed. Other transactions can see uncommitted rows, which leads to the situation called commit dependency. Those transactions would not be blocked at time when they access those rows, however they would not return data to clients nor commit until original transaction they have commit dependency on would commit itself. If, for some reason, that transaction failed to commit, other dependent transactions would be rolled back and error would be generated.

Commit dependency is technically a case of blocking in in-memory OLTP. However, validation and commit phases of transactions are relatively short and that blocking should not be excessive.

After timestamps in rows were replaced, transaction validates REPEATABLE READ and SERIALIZABLE rules and waits for commit dependencies to clear. When it is done, transaction moves to commit phase, generate one or more log records, save them to transaction log and complete the transaction.

Obviously, validation phase of transactions in REPEATABLE READ and SERIALIZABLE isolation levels is longer than in SNAPSHOT isolation level due to rules validation. Do not use them unless you have legitimate use-case for such data consistency. To be frank, I do not see much use-cases for them besides importing and exporting data to/from in-memory tables.

Writing Triggers in the Right Way

The only correct way of writing triggers is not writing them. I would repeat – the only correct way of writing triggers is not writing them. Triggers negatively affect performance. AFTER triggers introduce fragmentation and increase tempdb load due to version store usage. Triggers are running in context of transaction, increase the time in which locks are held and contribute to locking and blocking in the system. However, in some cases, we have to use triggers and it is important to develop them correctly.

SQL Server fires triggers on statement rather than row level. For example, if update statement modified 10 rows, AFTER UPDATE and/or INSTEAD OF UPDATE trigger would fire once rather than 10 times. Therefore, trigger must handle situations when inserted and/or deleted tables have multiple rows.  Let’s look at the example and create a table with AFTER INSERT trigger as shown below:

create table dbo.Data 
( 
	ID int not null, 
	Value varchar(32) 
) 
go 

create trigger trgData_AI on dbo.Data 
after insert 
as 
	/* Some code */ 
	declare 
		@ID int 
	set @ID = (select ID from inserted) 
	/* Some code */

Everything is fine when you inserted a single row. However, multi-row insert would fail with exception shown below.

Msg 512, Level 16, State 1, Procedure trgData_AI, Line 9
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
The statement has been terminated.

And there is another consequence of statement-level nature of triggers. They fire even if DML statement did not modify any rows. Would it introduce any issues in the system or not depends on implementation. However, it would always lead to unnecessary performance overhead.

Let’s look at example. The code below creates another table and changes trigger in a way, that mimics behavior of audit functional, which is frequently implemented based on triggers.

create table dbo.Audit 
( 
	ID int not null identity(1,1), 
	OnDate datetime2(0) not null 
		constraint DEF_Audit_OnDate  
		default getutcdate(), 
	Message varchar(64) not null, 
)
go

alter trigger trgData_AI on dbo.Data 
after insert 
as 
begin
	declare 
		@Msg varchar(64) = 'Triggers are bad. And @@ROWCOUNT=' + convert(varchar(10),@@rowcount) 

	insert into dbo.Audit(Message) values(@Msg); 
end

Now let’s run insert statement, which does not insert any rows to the table.

insert into dbo.Data(ID, Value) 
	select 1, 'ABC'  
	where 1 = 0

If you checked content of Audit table, you would see that trigger was fired:

01. Content of Audit table

So the first important conclusion is that every trigger must checks @@ROWCOUNT variable as the very first statement in implementation.

SET NOCOUNT ON should be the second action trigger does. Without that SQL Server returns affected number of rows for each operator in the trigger in addition to original DML statement. Some client libraries rely on the single message in the output and would not work correctly in case of multiple messages. Therefore, the first two statements in the trigger should look like it is shown below.

alter trigger trgData_AI on dbo.Data 
after insert 
as 
begin
	if @@ROWCOUNT = 0
		return
	set nocount on

	/* Some Code Here */
end

Finally, there is another caveat. While implementation above works for insert, update and delete operators, it is not the case with merge@@ROWCOUNT in this case represents total number of rows affected by merge statement rather than by individual insert, update or delete action in the trigger. Let’s prove it with the example.

create table dbo.Data2(Col int not null)
go

create trigger trg_Data_AI on dbo.Data2
after insert 
as
	select 
		'After Insert' as [Trigger]
		,@@RowCount as [RowCount]
		,(select count(*) from inserted) as [Inserted Cnt]
		,(select count(*) from deleted) as [Deleted Cnt]
go

create trigger trg_Data_AU on dbo.Data2
after update
as
	select 
		'After Update' as [Trigger]
		,@@RowCount as [RowCount]
		,(select count(*) from inserted) as [Inserted Cnt]
		,(select count(*) from deleted) as [Deleted Cnt]
go

create trigger trg_Data_AD on dbo.Data2
after delete
as
	select 
		'After Delete' as [Trigger]
		,@@RowCount as [RowCount]
		,(select count(*) from inserted) as [Inserted Cnt]
		,(select count(*) from deleted) as [Deleted Cnt]
go

Now let’s run MERGE statement as shown below:

merge into dbo.Data2 as Target
using (select 1 as [Value]) as Source
on Target.Col = Source.Value
when not matched by target then
	insert(Col) values(Source.Value)
when not matched by source then
	delete
when matched then
	update set Col = Source.Value;

The table is empty, therefore only one insert statement would be executed. However, as you should see below, all three triggers were fired, It is worth mentioning that previous code example demonstrated very bad practice of returning result sets from trigger. You should avoid doing it in production code.

02. Triggers and MERGE statement

The right way to deal with such situation is checking content of inserted and deleted tables as shown below.

alter trigger trg_Data_AI on dbo.Data2
after insert 
as
	if @@rowcount = 0 
		return
	set nocount on
	if exists(select * from inserted)
		/* Some Code Here */      
go

alter trigger trg_Data_AU on dbo.Data2
after update
as
	if @@rowcount = 0 
		return
	set nocount on
	if	exists(select * from inserted) and 
		exists(select * from deleted)
		/* Some Code Here */      
go

alter trigger trg_Data_AD on dbo.Data2
after delete
as
	if @@rowcount = 0 
		return
	set nocount on
	if	exists(select * from deleted)
		/* Some Code Here */

I hope that those tips will help you to write trigger in the most optimal way. However, I would repeat – the best way of writing triggers is not writing them at all.