Tag Archives: Internal Structure

LOB and Row-Overflow Storage in In-Memory OLTP in SQL Server 2016

I think many of us felt quite excited and the same time disappointed with In-Memory OLTP in SQL Server 2014. It was the great and promising technology but it had way too many limitations. The biggest one for me, perhaps, was inability to use data types that required off-row storage. It was possible to address that by changing database schema, implementing data partitioning or performing other tricks; however, it required complex development efforts and often made In-Memory OLTP migration cost ineffective.

SQL Server 2016 removes this and many other limitations. Now we can create tables with (max) columns and with data rows that exceed 8,060 bytes. There is the catch, however. Off-row storage in In-Memory OLTP works very differently comparing to on-disk tables and incorrect design decisions could and would affect performance of the system. Today we will look at that in details.

As all of us know, In-Memory OLTP does not use the data pages. Well, it uses data pages in nonclustered range indexes but the data rows are stored as the separate in-memory objects. They are linked into the row chains through the regular memory pointers. Every index in In-Memory OLTP adds another pointer and creates another chain of the rows.

There are two types of indexes supported in In-Memory OLTP – hash and nonclustered (range) indexes. I do not want to dive into all the details but hash index, in the nutshell, consists of in-memory hash table where each element stores the pointer to the data row chain with the same hash value of the key. You can see the simplified version in Figure 1, which shows the table with two hash indexes on Name and City columns and assumes that hash function generates the hash based on the first letter of the string.

01. Hash Indexes

01. Hash Indexes

Each data row has two timestamps that indicate its lifetime. They store the Global Transaction Timestamp of the transactions that inserted and deleted them. For example, you can see two “Ann” rows in the diagram. The first one, with City = “New York” has been created by a transaction with timestamp of 10. The city was updated to Cincinnati by transaction with timestamp of 50, which deleted the old and created the new versions of the row.

The second In-Memory OLTP index type – range index is very similar to B-Tree index defined on on-disk table. The range index consists of the data pages on root, intermediate and leaf levels. On root and intermediate levels, every index row points to the data page on the next level. On the leaf level, index rows point to the actual data rows with the same value of index key. The data pages in the index reference each other through another array-life structure called the Mapping Table as it illustrated in Figure 2. For example, the index row Kevin on the root page references the first element (PID = 1) in the mapping table, which, in turn, stores the pointer to intermediate data page with the highest key value of Kevin.

02. Nonclustered (range) indexes

02. Nonclustered (Range) Indexes

One of very key elements in this schema is that every index, in the nutshell, is covering. Every memory pointer references the actual data row structure regardless of how many columns were included to the index keys.

Every In-Memory OLTP object uses memory and is called a memory consumer. Memory Consumers allocate memory from the structures called varheaps – one varheap per In-Memory OLTP object. You can think about varheaps as the data structures that respond to and track memory allocation requests and can grow and shrink in size when needed. You can track detail memory-allocation information per-memory consumer using sys.dm_db_xtp_memory_consumers view.

Let’s look at the example and create the table with one hash and one nonclustered indexes and look at memory consumers as shown below.

create table dbo.MemoryConsumers
    ID int not null
        constraint PK_MemoryConsumers
        primary key nonclustered hash with (bucket_count=1024),
    Name varchar(256) not null,
    index IDX_Name nonclustered(Name)
with (memory_optimized=on, durability=schema_only);

    i.name as [Index], i.index_id, a.xtp_object_id, a.type_desc, a.minor_id
    ,c.memory_consumer_id, c.memory_consumer_type_desc as [mc type]
    ,c.memory_consumer_desc as [description], c.allocation_count as [allocs]
    ,c.allocated_bytes, c.used_bytes
    sys.dm_db_xtp_memory_consumers c join
        sys.memory_optimized_tables_internal_attributes a on
            a.object_id = c.object_id and a.xtp_object_id = c.xtp_object_id
    left outer join sys.indexes i on
            c.object_id = i.object_id and 
            c.index_id = i.index_id and
            a.minor_id = 0 
    c.object_id = object_id('dbo.MemoryConsumers');

You can see the output of the query in Figure 3. The xtp_object_id column represents internal In-Memory OLTP object_id, which is different than SQL Server object_id.

03. Memory Consumers (In-Row Storage Only)

03. Memory Consumers (In-Row Storage Only)

As you can see, the table has three memory consumers. The Range Index Heap stores internal and leaf pages of nonclustered index. The Hash Index Heap stores the hash table of the index. Finally, the Table Heap stores actual table rows. Figure 4 illustrates that.

04. Memory Consumers

04. Memory Consumers

Now let’s see what will happen if we alter our table and add row-overflow and LOB columns using ALTER TABLE statement shown below. This alteration is offline operation, which rebuilds the table in the background.

alter table dbo.MemoryConsumers add
    RowOverflowCol varchar(8000),
    LOBCol varchar(max);

Now, if you get the list of memory consumers using the query from the first listing again, you would see the output as shown in Figure 5.

05. Memory Consumers with Off-Row Storage

05. Memory Consumers with Off-Row Storage

As you can see, both off-row columns introduce their own Range Index Heap and Table Heap memory consumers. In addition, LOB column adds LOB Page Allocator memory consumer (more about it later). The minor_id column indicates the column_id in the table to which memory consumers belong.

As you can guess from the output, SQL Server 2016 stores both, row-overflow and LOB columns in the separate internal tables. Those tables consist of 8-byte artificial primary key implemented as nonclustered index and off-row column value. The main row references off-row column through that artificial key, which is generated when row is created. It is worth repeating that this reference is done though the artificial value rather than the memory pointer.

This approach allows In-Memory OLTP to decouple off-row columns from the main row using different lifetime for them. For example, if you update the main row data without touching off-row columns, SQL Server would not generate new versions of off-row column rows avoiding large memory allocations. Vice versa, when only off-row data is modified, the main row stays intact.

In-Memory OLTP stores LOB data in the memory provided by LOB Page Allocator. That consumer is not limited to 8,060-byte row allocations and can allocate large amount of memory to store the data. The rows in the Table Heap of LOB columns contains pointers to the row data in LOB Page Allocator.

Let’s assume that we run several DML statements with imaginary Global Transaction Timestamp values as shown below.

-- Global Transaction Timestamp: 100
insert into dbo.MemoryConsumers(ID, Name, RowOverflowCol, LobCol)

-- Global Transaction Timestamp: 110
update dbo.MemoryConsumers set RowOverflowCol = 'B2' where ID = 2;

-- Global Transaction Timestamp: 120
update dbo.MemoryConsumers set Name= 'Greg' where ID = 2;

-- Global Transaction Timestamp: 130
update dbo.MemoryConsumers set LobCol = replicate(convert(varchar(max),'3'),100000) where ID = 1;

-- Global Transaction Timestamp: 140
delete from dbo.MemoryConsumers where ID = 1;

Figure 6 illustrates the state of the data and links between the rows. It is omitting hash table and nonclustered index structures in the main table along with internal pages of nonclustered indexes for off-row columns for simplicity sake.

06. In-Row and Off-Row Rows - Decoupled

06. In-Row and Off-Row Rows – Decoupled

Decoupling of in-row and off-row data reduces the overhead of creating extra row versions during data modifications. However, it will add additional overhead when you insert and delete the data. SQL Server should create several row objects on insert stage and update end timestamp of multiple rows during deletion. It also needs to maintain nonclustered indexes for off-row column tables.

There is also considerable overhead in terms of memory usage. Every non-empty off-row value adds 50+ bytes of the overhead regardless of its size. Those 50+ bytes consist of three artificial ID values (in-row, off-row in data row and leaf-level of the range index) and off-row data row structure. It is even larger in case of LOB columns where data is stored in LOB Page Allocator.

One of the key points to remember that decision which columns go off-row is made based on the table schema. This is very different from on-disk tables, where such decision is made on per-row basis and depends on the data row size. With on-disk tables, data is stored in row when it fits on the data page.

In-Memory OLTP works in the different way. (Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.

Let’s look at the example and create two tables of the similar schema. One of the tables has twenty varchar(3) columns while another uses twenty varchar(max) columns. Let’s populate those tables with 100,000 rows with 1-character value in each column. The code is shown in listing below.

create table dbo.DataInRow
    ID int not null
        constraint PK_DataInRow
        primary key nonclustered hash(ID)
        with (bucket_count = 262144)
    ,Col1 varchar(3) not null
    ,Col2 varchar(3) not null
    ,Col3 varchar(3) not null
    ,Col4 varchar(3) not null
    ,Col5 varchar(3) not null
    ,Col6 varchar(3) not null
    ,Col7 varchar(3) not null
    ,Col8 varchar(3) not null
    ,Col9 varchar(3) not null
    ,Col10 varchar(3) not null
    ,Col11 varchar(3) not null
    ,Col12 varchar(3) not null
    ,Col13 varchar(3) not null
    ,Col14 varchar(3) not null
    ,Col15 varchar(3) not null
    ,Col16 varchar(3) not null
    ,Col17 varchar(3) not null
    ,Col18 varchar(3) not null
    ,Col19 varchar(3) not null
    ,Col20 varchar(3) not null
with (memory_optimized = on, durability = schema_only);

create table dbo.DataOffRow
    ID int not null
        constraint PK_DataOffRow
        primary key nonclustered hash(ID)
        with (bucket_count = 262144)
    ,Col1 varchar(max) not null
    ,Col2 varchar(max) not null
    ,Col3 varchar(max) not null
    ,Col4 varchar(max) not null
    ,Col5 varchar(max) not null
    ,Col6 varchar(max) not null
    ,Col7 varchar(max) not null
    ,Col8 varchar(max) not null
    ,Col9 varchar(max) not null
    ,Col10 varchar(max) not null
    ,Col11 varchar(max) not null
    ,Col12 varchar(max) not null
    ,Col13 varchar(max) not null
    ,Col14 varchar(max) not null
    ,Col15 varchar(max) not null
    ,Col16 varchar(max) not null
    ,Col17 varchar(max) not null
    ,Col18 varchar(max) not null
    ,Col19 varchar(max) not null
    ,Col20 varchar(max) not null
with (memory_optimized = on, durability = schema_only);

set statistics time on
insert into dbo.DataInRow(ID,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17,Col18,Col19,Col20)
    select Num,'0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0'
    from dbo.Numbers
    where Num <= 100000;

insert into dbo.DataOffRow(ID,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17,Col18,Col19,Col20)
    select Num,'0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0'
    from dbo.Numbers
    where Num <= 100000;
set statistics time off

Figure 7 illustrates memory consumers in this scenario (in-row at top and off-row at the bottom). As you can see, every varchar(max) column adds another internal table to the mix.

07. Test Table Memory Consumers

07. Test Table Memory Consumers

The execution times of INSERT statements n my environment are 153 and 7,722 milliseconds respectively. With off-row storage, In-Memory OLTP needs to add data to twenty other internal tables, which is 40 times slower comparing to in-row storage.

Let’s look at the total memory usage of both tables using the queries below.

   sum(c.allocated_bytes) / 1024 as [Allocated KB]
    ,sum(c.used_bytes) / 1024 as [Used KB]	
    sys.dm_db_xtp_memory_consumers c join
        sys.memory_optimized_tables_internal_attributes a on
            a.object_id = c.object_id and a.xtp_object_id = c.xtp_object_id
    left outer join sys.indexes i on
            c.object_id = i.object_id and c.index_id = i.index_id
    c.object_id = object_id('dbo.DataInRow');

   sum(c.allocated_bytes) / 1024 as [Allocated KB]
    ,sum(c.used_bytes) / 1024 as [Used KB]	
    sys.dm_db_xtp_memory_consumers c join
        sys.memory_optimized_tables_internal_attributes a on
            a.object_id = c.object_id and a.xtp_object_id = c.xtp_object_id
    left outer join sys.indexes i on
            c.object_id = i.object_id and c.index_id = i.index_id
    c.object_id = object_id('dbo.DataOffRow');

As you can see in Figure 8, off-row storage uses about 252MB of RAM comparing to 12MB of RAM with in-row storage.

08. Test Tables Memory Usage

08. Test Tables Memory Usage

There is another important implication. Indexes defined on the table are not covering  off-row data. SQL Server needs to traverse nonclustered indexes on off-row column tables to obtain their values. Conceptually, it looks very similar to Key Lookup operations in on-disk tables done in reverse direction – from clustered to nonclustered indexes. Even though the overhead is significantly smaller comparing to on-disk tables, it is still the overhead you’d like to avoid.

You can see this overhead by running SELECT statements shown below. In case of off-row data, SQL Server will have to go through every internal table for each row.

select count(*)
from dbo.DataInRow
where Col1='0' and Col2='0' and Col3='0' and Col4='0' and Col5='0' and Col6='0' and Col7='0' and Col8='0' and Col9='0' and Col10='0' and Col11='0' and Col12='0' and Col13='0' and Col14='0' and Col15='0' and Col16='0' and Col17='0' and Col18='0' and Col19='0' and Col20='0';

select count(*)
from dbo.DataOffRow
where Col1='0' and Col2='0' and Col3='0' and Col4='0' and Col5='0' and Col6='0' and Col7='0' and Col8='0' and Col9='0' and Col10='0' and Col11='0' and Col12='0' and Col13='0' and Col14='0' and Col15='0' and Col16='0' and Col17='0' and Col18='0' and Col19='0' and Col20='0';

Execution time in my environment is 73 milliseconds for in-row data and 1,662 milliseconds for off-row data.

Finally, let’s look what happen when we delete the data from the tables by using the code below.

delete from dbo.DataInRow;
delete from dbo.DataOffRow;

As before, in-row storage is significantly faster – 28 milliseconds vs. 1,290 milliseconds. As you can guess In-Memory OLTP had to update end timestamp in every row in off-row tables.

You should remember about this behavior and avoid off-row storage unless you have legitimate reasons to use such columns. It is clearly the bad idea to define text columns as (n)varchar(max) just in case – when you do not store large amount of data there. As you see, off-row storage comes at very high cost.

Source code is available for download.

Slide deck on the topic is also available.

Size does matter: 10 ways to reduce the database size and improve performance in SQL Server

As the database professional, I like multi-terabyte databases. They are fun to deal with; they give you priceless experience and look cool in your resume. Last but not least, customers with multi-terabyte databases do not have problems with multi-kilodollar invoices. Customers usually understand amount of work involved in such projects.

The problem, however, is that the large databases are not good for the customers. Those databases are more expensive to support and maintain. You need to have highly skilled professionals in the team. Professionals, who can design nontrivial solutions in all areas – architecture, availability, maintenance, performance tuning, to name just a few.

Large databases usually require powerful hardware to run. There is, of course, very subtle difference between the size of the database and size of the active (hot) data. It is entirely possible that applications deal only with the fraction of the data stored in the database and, therefore, even mediocre server can handle the load. However, on the bare minimum, there is always the storage cost.

The projects when you have to reduce the size of the databases are very common. Even though size reduction is rarely the primary objectives of such projects, reducing the size often helps to achieve other goals. Think about designing Disaster Recovery (DR) strategy. Plenty of things that can help to meet strict RTO requirements and smaller database size definitely helps.

Today, I am going to discuss several methods that can help in reducing database size. Some of them are fully transparent to the client applications; others require regression testing and/or code refactoring. I would also focus on the data files only – troubleshooting and reducing transaction log size is the different topic.

0. Find the worst offenders

In the nutshell, the database files are just the containers for the data. Some space in the data files is allocated and used by the database objects; however, there is usually some unallocated space. For example, if you created the database with one 1GB data file, you would have 1 GB file on disk. However, immediately after creation, it would have just a handful of the data pages allocated inside the file and most part of the file would be free (unallocated).

It is completely normal to have free space in the data files, especially if amount of data is constantly growing. However, excessive amount of the free space could consume unnecessary space on the disks. The script below helps you to analyze amount of allocated and unallocated space on per-database file basis.

    f.type_desc as [Type]
    ,f.name as [FileName]
    ,fg.name as [FileGroup]
    ,f.physical_name as [Path]
    ,f.size / 128.0 as [CurrentSizeMB]
    ,f.size / 128.0 - convert(int,fileproperty(f.name,'SpaceUsed')) / 
        128.0 as [FreeSpaceMb]
    sys.database_files f with (nolock) left outer join 
        sys.filegroups fg with (nolock) on
            f.data_space_id = fg.data_space_id
option (recompile)

Figure 1 illustrates the output of the script from one of the production servers. Even though the output shows significant amount of free space in the data files, it may or may not be a problem. The decision if that space needs to be deallocated depends on the other factors, such as data growth expectations, disk system configuration and a few others.

01. Allocated and Free space in the database files

We will talk how to handle the situations when free space needs to be deallocated later, for now, let’s focus on the data, and discuss what we can do to reduce its size.

All of us are familiar with Pareto principle, which is also known as 80/20 rule. To simplify, in most projects, you can achieve 80% of improvements by spending 20% of time or resources. That ratio is even more severe when we search for the most space consuming objects in the database. Even with very large databases, usually the most part of the space is consumed by just a handful of tables. Obviously, we would like to focus on them – at least at the beginning of the process.

Let’s do one step backwards, however, and remember how SQL Server stores the data. Every on-disk table has the main copy of the data, which stored either in the clustered index or in heap. In addition, every table can have the set of nonclustered indexes that store the copy of the data for some of columns and reference the main copy of the rows (in the clustered index or heap). For the purpose of this discussion, let’s talk about generic indexes without any further differentiation between their types.

Every index consists of the data pages that can store the data that belong to the different allocation units. IN_ROW allocation unit contains main data rows, which includes internal row structures, fixed-length column data and variable-length column data that fits in IN_ROW data page. LOB allocation unit contains data for variable-length column data (including data types such as XML, CLR UDT, etc)that is greater than 8,000 bytes in size. Finally, ROW_OVERFLOW allocation unit contains data for variable-length column data that does not exceed 8,000 bytes but does not fit IN_ROW.

For example, if you created the table of the following structure and inserted one row there as it is shown below, you would have data for column C5 stored in LOB allocation units, data for one of either C3 or C4 columns stored in ROW_OVERFLOW allocation unit and data for all other columns stored in IN_ROW allocation unit. It is also worth mentioning that main row data in IN_ROW data would have the pointers to the data stored in the other allocation unit.

create table dbo.T1
    C1 int,
    C2 datetime,
    C3 varchar(5000),
    C4 varchar(5000),
    C5 varchar(max)

insert into dbo.T1(C1, C2, C3, C4, C5)
    1 /* C1 */
    ,GetUtcDate() /* C2 */
    ,Replicate('A',5000) /* C3 */
    ,Replicate('B',5000) /* C4 */
    ,Replicate(convert(varchar(max),'A'),10000) /* C5 */

You can read more about row structure and allocation units here: http://aboutsqlserver.com/storage-engine/

The script below provides you the information about space allocation on per-index basis in the database. You can easily modify it to aggregate on per-table basis or drill-down on per-allocation unit basis; however, at that stage I usually prefer to work on the index level.

;with SpaceInfo(ObjectId, IndexId, TableName, IndexName
    ,Rows, TotalSpaceMB, UsedSpaceMB)
        t.object_id as [ObjectId]
        ,i.index_id as [IndexId]
        ,s.name + '.' + t.Name as [TableName]
        ,i.name as [Index Name]
        ,sum(p.[Rows]) as [Rows]
        ,sum(au.total_pages) * 8 / 1024 as [Total Space MB]
        ,sum(au.used_pages) * 8 / 1024 as [Used Space MB]
        sys.tables t with (nolock) join 
            sys.schemas s with (nolock) on 
                s.schema_id = t.schema_id
            join sys.indexes i with (nolock) on 
                t.object_id = i.object_id
            join sys.partitions p with (nolock) on 
                i.object_id = p.object_id and 
                i.index_id = p.index_id
            cross apply
                    sum(a.total_pages) as total_pages
                    ,sum(a.used_pages) as used_pages
                from sys.allocation_units a with (nolock)
                where p.partition_id = a.container_id 
            ) au
        i.object_id > 255
    group by
        t.object_id, i.index_id, s.name, t.name, i.name
    ObjectId, IndexId, TableName, IndexName
    ,Rows, TotalSpaceMB, UsedSpaceMB
    ,TotalSpaceMB - UsedSpaceMB as [ReservedSpaceMB]
order by
    TotalSpaceMB desc
option (recompile)

Figure 2 shows the output of this script (table and index names are obscured). As you can see, this could quickly pinpoint the indexes that consume most part of the space in the database.

02. Indexes that consume the most space in the database

Now, let’s see what we can do to reduce their size.

1. Reducing Index Fragmentation

As you know, SQL Server stores on-disk table data on the 8KB data pages. Each data page contains data for one or multiple rows. With the exception of index creation or rebuild, SQL Server tries to populate pages in full during normal data modification operations. When data does not fit, for example, when the data pages does not have enough space to accommodate the new row, SQL Server performs the page split operation. In the nutshell, SQL Server allocates another data page and moves about half of the data from original to the new page, which frees up some space to accommodate the new row on the original data page.

Page split operations lead to the index fragmentation, which exists in two kinds. External fragmentation means that the logical order of the pages does not match their physical order, and/or logically subsequent pages are not located in the same or adjacent extents (extent is the group of 8 pages). Such fragmentation forces SQL Server to jump around reading the data from the disk, which makes read-ahead less efficient and increases the number of physical reads required. Moreover, it increases random disk I/O, which is far less efficient when compared to sequential I/O in the case of magnetic hard drives.

Internal fragmentation, on the other hand, means that data pages in the index have free space. As a result, the index uses more data pages to store data. It also increases the number of reads during query execution and amount of memory in buffer pool to cache index pages.

A small degree of internal fragmentation is not necessarily bad. It reduces page splits during insert and update operations when data is inserted into or updated from different parts of the index. Nonetheless, a large degree of internal fragmentation wastes index space and reduces the performance of the system. Moreover, for indexes with ever-increasing keys, for example on identity columns, internal fragmentation is not desirable because the data is always inserted at the end of the index.

You can monitor both, internal and external fragmentation with sys.dm_db_index_physical_stats DMV. Internal fragmentation can be monitored with avg_page_space_used_in_percent column. Lower value in the column indicates higher degree of internal fragmentation.

Let’s take a look at the example and analyze internal fragmentation of one of the indexes with the script below. For simplicity sake, I am using relatively small table; however, you would obviously like to focus on the largest indexes during the tuning process.

    index_id, partition_number, alloc_unit_type_desc
    ,index_level, page_count, avg_page_space_used_in_percent
        db_id() /*Database */
        ,object_id(N'dbo.MyTable') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */

Figure 3 illustrates partial output of the script. The table is partitioned and, as result, you will see separate rows in the result – one per partition per allocation unit.

03. Internal Fragmentation in the Index

You can remove internal fragmentation by rebuilding the index. Figure 4 illustrates the output of sys.dm_db_index_physical_stats after the index rebuild with FILLFACTOR=100 (more on it later)

04. Internal Fragmentation after Index Rebuild

Figure 5 illustrates amount of space index used before (on the left side) and after (on the right side) rebuild. As you can see, index rebuild was able to free up more than 40% of the space index consumed before the rebuild.

05. Index Size before and after rebuild

FILLFACTOR parameter controls amount of free space SQL Server reserves on the pages during index creation and rebuild operation. For example, FILLFACTOR=80 forces SQL Server to make data pages about 80% full keeping 20% of the page space reserved. This could reduce the number of page splits and internal fragmentation when new rows are inserted to the middle of the index or updated in the way that increase their size. It is very important to remember that FILLFACTOR is applied only during index creation or rebuild stages. After index is created, SQL Server continue populates pages in full up to 100% performing page splits as needed.

As you can guess, the optimal solution would require fine-tuning FILLFACTOR and designing index maintenance strategy in the way that keeps internal fragmentation at minimum most part of the time. Unfortunately, there is no “one size fits all” advice in terms of FILLFACTOR. You should try to figure out the most optimal value by using different FILLFACTOR values and monitor how it would affect your fragmentation. You can start with FILLFACTOR value close to 100 and gradually decrease it by 5 until you find the sweet spot. Alternatively, you can monitor page splits in real time using transaction_log extended event tracking LOP_DELETE_SPLIT operation changing value based on amount of splits (you can see more on it at Jonathan Kehayias’ blog),

Lastly, you should remember than index rebuild creates another copy of the index during the process. In fact, it could increase the size of the data file on disk during the operation. Moreover, it generates large amount of transaction log records that could also affect transaction log size, network load and size of send and redo queues if database mirroring or AlwaysOn AG are in use.

2. Implementing Data Compression

If you are the lucky enough to have Enterprise Edition of SQL Server, you can reduce the size of the data by implementing Data Compression. There are two types of Data Compression supported in SQL Server – Row and Page.

Row compression addresses the storage inefficiency introduced by fixed-length data types. By default, in non-compressed row, size of the fixed-length data is based on the data type size. For example, INT column would always use 4 bytes, regardless of the value – even when it is NULL. Row compression addresses that and removes such an overhead. For example, INT value of 255 would use just 1 byte rather than 4 bytes.

Page compression goes one-step further and implements dictionary-based compression removing repetitive sequences of bytes on the page. I am not going to dive deep in the storage format of the compressed rows here and will follow up with additional blog post at some point. Alternatively, you can read about it in my book.

The actual results would greatly depend on the data and the schema. As you can guess, row compression would be beneficial when table has fixed-length data columns. More fixed-length columns you have, better the space savings are. Results of the page compression, on the other hand, depend on how repetitive is the data on the page. You can use sp_estimate_data_compression_savings system stored procedure to estimate compression results for your data. That procedure works by copying and compressing the sample of your data in tempdb measuring compression results.

It is also very important to remember that data compression works with IN_ROW allocation units only. It does not compress LOB nor ROW_OVERFLOW data.

Obviously, there is an overhead. Compression and decompression adds additional CPU load to the system. That overhead is relatively small for the ROW compression, especially when you read the data; however, for PAGE compression that overhead is more significant. There is the catch, though. While compression adds the load to CPU, it reduces I/O load in the system – SQL Server needs to issue less I/O operations due to the smaller data size. In the end, the queries could execute even faster especially on the systems that are not heavily CPU bound.

There is also the overhead during the batch operations. Batch inserts and updates could take more time when data is compressed. The same applies to the index maintenance. Just to give you some numbers, I ran a few tests at time when I worked on the book. I was using the data from one of the production tables with a decent number of fixed- and variable-length columns. Obviously, different table schema and data distribution will lead to slightly different results. However, in most cases, you would see similar patterns.

At the beginning of the tests, I have created three different heap tables and inserted one million rows into each of them. After that, I created clustered indexes with different compression settings and FILLFACTOR=100. This workflow led to zero index fragmentation and fully populated data pages.

During the first test, I ran SELECT statements to scan all of the clustered indexes accessing some row data. The second test updated every row in the tables, changing the value of the fixed-length column in a way that did not increase the row size. The third test inserted another batch of a million rows in the tables. Finally, I rebuilt all of the clustered indexes. You can see the execution statistics in Figure 5 below. All tests ran with warm cache with the data pages cached in the buffer pool. Cold cache could reduce the difference in execution times for the queries against compressed and non-compressed data, because queries against compressed data perform less physical I/O.

All statements were forced to run on a single CPU by using a MAXDOP 1 query hint. Using parallelism would decrease the query execution times, however it would add the overhead of parallelism management during query execution, which I wanted to avoid during the tests.

Figure 6 demonstrates the results of the tests.

6. Data Compression – Storage Size and Performance

Obviously, it is impossible to provide generic advice how and when to compress the data. However, in case, if system is not heavily CPU bound, I would usually suggest implementing ROW compression on the indexes with volatile data. PAGE compression, on the other hand, could be the good choice for the old static data, especially when that data accessed infrequently.

It is also impossible to avoid mentioning data partitioning when we are discussing compression. It is very common to have the situation when just small subset of the data in the table is volatile. Unfortunately, you cannot apply different compression methods to hot and cold data unless the data is partitioned (either with table partitioning or with multiple tables utilizing partitioned views). Such partitioning helps you to implement different compression schemas to different table partitions (or tables) and will allow you to reduce index maintenance overhead by rebuilding the index on partition scope.

A word of caution, however. Partitioning is the great tool that can help you to address multiple challenges especially in database administration area. Even though, table partitioning can be implemented transparently to the client applications, it could introduce plan regressions. One of such examples is described here: http://aboutsqlserver.com/2012/07/10/cautionary-tale-about-triggers-version-store-and-fragmentation/. Make sure to carefully test your application if you decided to implement table partitioning.

Finally, if you are interested in data partitioning, I would like to reference my book again. The chapter on data partitioning is the largest one there and it discusses various examples and approaches of partitioning in various editions of SQL Server.

3. Removing unused indexes

It is often possible to reduce the size of the data during the index tuning process. The main goal of the index tuning is creating the right set of indexes, which also requires you to drop existing unused and redundant indexes. Keep in mind that you always need to carefully test your system when you change the indexes making sure that there is no plan regressions after the tuning.

There are two data management views that can help you to detect non-efficient indexes. The first one, sys.dm_db_index_usage_stats, shows you the statistics on the various index operations, such as index seek and scan, index updates and a few others, along with the time of the last operation. The second DMV, sys.dm_db_index_operational_stats, dives deeper and provides an information on I/O, access methods and locking statistics on the index.

The key difference between two DMOs is how they collect data. Sys.dm_db_index_usage_stats tracks how many times an operation appeared in the execution plan. Alternatively, sys.dm_db_index_operation_stats tracks operations at the row level. For example, if query execution plan includes Key Lookup operation and SQL Server ran it twice during query execution, sys.dm_db_index_usage_stats would track the single lookup operation, while sys.dm_index_operation_stats would track two of them.

You can obtain the information about index usage with sys.dm_db_index_usage_stats by running the statement below. You can use similar approach with sys.dm_db_index_operation_stats if you need more detailed analysis.

    s.Name + N'.' + t.name as [Table]
    ,i.name as [Index] 
    ,i.is_unique as [IsUnique]
    ,ius.user_seeks as [Seeks], ius.user_scans as [Scans]
    ,ius.user_lookups as [Lookups]
    ,ius.user_seeks + ius.user_scans + ius.user_lookups as [Reads]
    ,ius.user_updates as [Updates], ius.last_user_seek as [Last Seek]
    ,ius.last_user_scan as [Last Scan], ius.last_user_lookup as [Last Lookup]
    ,ius.last_user_update as [Last Update]
    sys.tables t with (nolock) join sys.indexes i with (nolock) on
        t.object_id = i.object_id
    join sys.schemas s with (nolock) on 
        t.schema_id = s.schema_id
    left outer join sys.dm_db_index_usage_stats ius on
        ius.database_id = db_id() and
        ius.object_id = i.object_id and 
        ius.index_id = i.index_id
order by
    s.name, t.name, i.index_id
option (recompile)

As you can see in the output in Figure 7, you can easily pinpoint the problematic indexes. Indexes in red ovals were not used in any queries for the duration of statistics collection. Those indexes consume space in the database. Moreover, they introduce update overhead in the system. As the side note, it is also beneficial to look at the indexes where update overhead exceeds their usefulness – for example, index in green oval is used only for scans even though it is constantly updated.

07. Index Usage Statistics

Obviously, you need to be careful making sure that you captured valid usage statistics. SQL Server does not persist statistics and it resets on restarts. Statistics is also cleared whenever the database is detached or shut down when the AUTO_CLOSE database property is on. Moreover, in SQL Server 2012 and 2014, statistics resets when the index is rebuilt.

You must keep this behavior in mind during index analysis. It is not uncommon to have indexes to support queries that execute on a given schedule. As an example, you can think about an index that supports a payroll process running on a bi-weekly or monthly basis. Index statistics information could indicate that the index has not been used for reads if SQL Server was recently restarted or, in the case of SQL Server 2012 and 2014, if index was recently rebuilt.

One of the ways to address statistics reset is collecting usage statistics based on some schedule and persists results in one of the tables in the database. This could help to catch the situations when index is required to support some of the rarely executed processes. As the side note, you can consider to recreate such indexes as part of the process dropping them when processes are completed.

You should also be careful with unique indexes. It is entirely possible that such indexes are created to support uniqueness constraints and removal of such indexes would violate business requirements for the system.

4. Removing Redundant Indexes

As you know, SQL Server can use composite index for an Index Seek operation as long as a query has a SARGable predicate on the leftmost query column. Ok, I know, it is confusing so let’s look at the example and create a table with clustered and two nonclustered indexes.

create table dbo.Employee
    EmployeeId int not null,
    LastName nvarchar(64) not null,
    FirstName nvarchar(64) not null,
    DateOfBirth date not null,
    Phone varchar(20) null,
    Picture varbinary(max) null
create unique clustered index IDX_Employee_EmployeeId 
on dbo.Employee(EmployeeId);
create nonclustered index IDX_Employee_LastName_FirstName
on dbo.Employee(LastName, FirstName);
create nonclustered index IDX_Employee_LastName
on dbo.Employee(LastName);

SQL Server can utilize IDX_Employe_LastName_FirstName index if query has the predicate on the LastName regardless of existence of the predicate on the FirstName. For example, both of the queries below would be able to use that index:

select EmployeeId, LastName, FirstName, DateOfBirth
from dbo.Employee
where LastName = @LastName and FirstName = @FirstName

select EmployeeId, LastName, FirstName, DateOfBirth
from dbo.Employee
where LastName = @LastName

Thus, the index IDX_Employee_LastName is redundant and can be dropped. There are, of course, always exceptions from the rule. IDX_Employee_LastName index stores less data and, therefore, it is more compact. If you have a process that constantly scan the index, smaller index size could be beneficial. However, those situations are very rare and usually update overhead of the extra index is not worth such small performance improvement on SELECT queries.

The script below shows you potentially redundant indexes by checking the indexes that have the same leftmost columns.

    s.Name + N'.' + t.name as [Table]
    ,i1.index_id as [Index1 ID], i1.name as [Index1 Name]
    ,dupIdx.index_id as [Index2 ID], dupIdx.name as [Index2 Name] 
    ,c.name as [Column]
    sys.tables t join sys.indexes i1 on
        t.object_id = i1.object_id
    join sys.index_columns ic1 on
        ic1.object_id = i1.object_id and
        ic1.index_id = i1.index_id and 
        ic1.index_column_id = 1  
    join sys.columns c on
        c.object_id = ic1.object_id and
        c.column_id = ic1.column_id      
    join sys.schemas s on 
        t.schema_id = s.schema_id
    cross apply
        select i2.index_id, i2.name
            sys.indexes i2 join sys.index_columns ic2 on       
                ic2.object_id = i2.object_id and
                ic2.index_id = i2.index_id and 
                ic2.index_column_id = 1  
            i2.object_id = i1.object_id and 
            i2.index_id > i1.index_id and 
            ic2.column_id = ic1.column_id
    ) dupIdx     
order by
    s.name, t.name, i1.index_id

For example, for dbo.Employee table, script would provide the output shown in Figure 8.

08. Potentially Redundant Indexes

You can use such information for further analysis performing further index consolidation. In some cases, consolidation is trivial. For example, if a system has two indexes: IDX1(LastName, FirstName) include (Phone) and IDX2(LastName) include(DateOfBirth), you can consolidate them as IDX3(LastName, FirstName) include(DateOfBirth, Phone).

In the other cases, consolidation requires further analysis. For example, if a system has two indexes: IDX1(OrderDate, WarehouseId) and IDX2(OrderDate, OrderStatus), you have three options. You can consolidate it as IDX3(OrderDate, WarehouseId) include(OrderStatus) or as IDX4(OrderDate, OrderStatus) include(WarehouseId). Finally, you can leave both indexes in place. The decision primarily depends on the selectivity of the leftmost column and index usage statistics.

5. Implementing Filtered Indexes

Filtered indexes, introduced in SQL Server 2008, allowed you to index only a subset of the data and, therefore, reduce the index size. Consider a table with some data that needs to be processed as an example. This table can have a Processed bit column, which indicates the processing status as it is shown below.

create table dbo.Data
    RecId int not null,
    Processed bit not null,
    /* Other Columns */

Let’s assume that you have a backend process that loads unprocessed data based on the following query.

select top 1000 RecId, /* Other Columns */
from dbo.Data
where Processed = 0
order by RecId;

This query can benefit from the following index: CREATE NONCLUSTERED INDEX IDX_Data_Processed_RecId ON dbo.Data(Processed, RecId). Even though SQL Server rarely uses indexes on bit columns due to their low selectivity, such a scenario might be an exception if there are just a handful of unprocessed rows. SQL Server can use that index to select them; however, the index will never be used for selection of processed rows if a Key Lookup is required.

As a result, all index rows with a key value of Processed=1 would be useless. They will increase the index’s size, waste storage space, and introduce additional overhead during index maintenance.

Filtered indexes solve that problem by allowing you to index just unprocessed rows, making the index small and efficient as it is shown below.

create nonclustered index IDX_Data_Unprocessed_Filtered
on dbo.Data(RecId)
where Processed = 0;

Obviously, there are several catches to remember. First, Query Optimizer has a design limitation, which can lead to suboptimal execution plans when columns from the filter are not present in leaf-level index rows. Always add all columns from the filter to the index, either as key or included columns. In our example, Processed column is added as an included column.

Second problem is filtered index statistics. SQL Server does not count the changes in the filtered columns towards statistics update threshold, which can lead to very inaccurate statistics. You should factor this behavior to statistics maintenance and, perhaps, update statistics manually on the regular basis.

Finally, SQL Server is very conservative when to use filtered indexes in case of plan caching. SQL Server would not generate and cache the plan with filtered index if there is the possibility that this plan would be invalid for some parameter values. For example, our filtered index would not be used for the case below:

select top 1000 RecId, /* Other Columns */
from dbo.Data
where Processed = @Processed
order by RecId;

As you can guess, auto-parameterization would make the matter worse. The bottom line, you should carefully test your system after implementing filtered indexes making sure that there is no plan regressions.

Finally, if you are using XML indexes in SQL Server 2012 and above, you can reduce their size (which, by the way, could be gigantic) by implementing Selective XML Indexes , which index just subset of the data. Pretty much the same approach with as filtered indexes.

6. Using Appropriate Data Types

All approaches we have already discussed could be implemented transparently to the client applications. Obviously, transparency here is very misleading – index tuning and partitioning require careful regression testing to be performed. However, all those changes are located in the database tier and you do not need to change any application code.

Now, it is the time to talk about several approaches that require such code changes. I would like to start with the general design principle that is often ignored during database design stage. The principle is simple – you should choose appropriate data types for the job.

Let’s consider the system that collects GPS location information from the multiple devices as the example. The main transaction entity in such system is Positions. One of the approaches to define such a table is the following (just a few columns from there):

create table dbo.Positions
    ATime datetime not null, -- 8 bytes 
    Latitude float not null, -- 8 bytes 
    Longitude float not null, -- 8 bytes 
    IsPrecise int not null,	-- 4 bytes
    IsAssistanceUsed int not null, -- 4 bytes
    -- Total: 32 bytes

Alternatively, you can define the same table a little bit differently:

    ATime datetime2(0) not null, -- 6 bytes 
    Latitude decimal(9,6) not null, -- 5 bytes 
    Longitude decimal(9,6) not null, -- 5 bytes 
    IsPrecise bit not null,	-- 1 bytes
    IsAssistanceUsed bit not null, -- 0 bytes
    -- Total: 17 bytes

As you see, even in the scope of those 6 columns you can save 15 bytes per row. It does not sound as significant saving; however, it quickly adds up as amount of data growth. For example, if such system collects 1M rows per day and stores it for a year, 15 bytes per row would become ~5.4GB of data on the leaf level of the index without counting any fragmentation overhead. And trust me, 1M positions per day is very small number for such systems.

While row compression can help to address some overhead, it would not help much when data types store excessive information. For example, row compression would cut an extra space from the boolean data stored in int columns; however, it would not help much with datetime in case if it has more precision that needed. Moreover, compression is the Enterprise Edition feature, which would not help you with the other editions.

One of the questions you should answer during database design stage is how precise the information should be. This could help you to choose correct data type for the column. As the example, consider the OrderDate column in Order Entry/Shopping cart system. Do you really need to store the time when order was placed with up to 3-millisecond precision provided by datetime column (8 bytes)? If this was not the case, you could use 1-second precision of datetime2(0) type (6 bytes). Or, for 1 minute precision, you can end up with 4-byte smalldatetime data.

You should also remember that smaller data rows help with the performance during the scans. The table with smaller rows would have more rows stored on the data page and, therefore, would have less data pages stored in the index. Queries that need to perform scans (including range scans) will be faster due to the less I/O operations involved. Last but not least, such indexes will use less memory in the buffer pool, which allows to cache more data and reduces the number of physical I/O in the system.

Finally, you should remember that table alteration never ever decreases size of the data row. You should rebuild the indexes that reference altered columns in order to see the space saving.

7. Storing LOB data outside of the database

So far, we have discussed how to reduce the size of IN_ROW data. Let’s talk a bit about LOB data. First, let’s discuss the situation with the binary data, which does not require any in-database processing. For example, images, binary documents and other similar entities. With such entities, you always have the question of how to store them. Either within the database or externally, keeping just a reference (perhaps file name) in the database.

There was the rule of thumb introduced by Microsoft at time of FILESTREAM release. The binary data greater than 1MB would benefit from external storage. The data smaller than 200KB should live within the database. Well, everything in between is in the grey area. While I do not want to charge the numbers, there are usually more factors involved that just a size. Obviously, I am not talking about huge binary objects when the choice is obvious, but in general, you should make the decision on case-by-case basis.

In-database storage of binary data is usually the simplest solution to implement. The obvious downside of this approach is the increase of the database size. However, you can mitigate it up to degree with proper architecture. For example, you can put binary data to the separate filegroup(s) that reside on the slower disk arrays. You can also implement partial backup and exclude static binary data from the dayly backup files. In Enterprise Edition of SQL Server, you can utilize piecemeal restore and achieve strict RTO requirements even with the binary data in the database; however, in non-Enterprise Edition, RTO requirements could become the deal breaker. Binary data could significantly slow down restore time (due to the database and backup size), which can prevent you from meeting RTO requirements.

In case, if you decided to store binary data outside of the database, there are several questions to answer. The first, and perhaps the most important one, is how to handle redundancy and high availability of external data. For example, if you decided to store binary data as the files and reference them in the database, you need to make sure that such schema is compatible with SQL Server High Availability solution and file storage itself is redundant. Redundancy question mainly relies on storage administrators; however, High Availability aspect could be tricky in this scenario. Especially, if you have geo-redundancy and/or hybrid solutions in place. Obviously, you can implement something based on SAN replication moving files across multiple data centers; however, it requires significant investments into the hardware and software as well as incur the implementation cost.

Consistency of the data is another important question. If binary data needs to be transactionally consistent, you have a little choice but using FILESTREAM. While it is technically possible to implement the consistency in the code without FILESTREAM – for example, if transaction modifies the data, application generates another file and replace the reference to this file in the database; it would be extremely hard to support disaster recovery in this scenario. As you can guess, you can easily have “out of sync” situation when restoring data from the backup.

FILESTREAM could help you here; however, it has a few caveats. It is incompatible with some of SQL Server features, for example with database mirroring and, in some cases, it is complicated to implement. Performance-wise, you should use Streaming API on the client side to get the most from it.

As I already said, there is no right or wrong solutions. You should consider pros and cons of all approaches and consider other requirements in the system. With Enterprise Edition, I personally prefer to store relatively small (up to a few MB) data in the database carefully architecting filegroup layout and backup/restore strategy in case if I am using Enterprise Edition of SQL Server. With Standard Edition, the choices are much more limited.

8. Compressing LOB data in the database

As you already know, data compression is Enterprise Edition feature that compresses IN_ROW data only. However, it is entirely possible, that large amount of space in the database is consumed by LOB data. Do not forget, that there are plenty of data types that are, in the nutshell, LOBs. Think about XML as the example – it is not uncommon to see that XML-centric systems with XML data that consume large amount of space in the system.

One of the approaches to address such an overhead is manually compress LOB data in the code. It is very easy to create the methods to compress and decompress data utilizing one of the classes from System.IO.Compression namespace, for example using GZipStream class. Moreover, that method could be implemented in CLR stored procedures and used directly in T-SQL code.

I am not going to provide the examples of how to code that; you should be able to find quite a few of them searching in Internet. I would like, however, to discuss a couple implementation-related questions.

First, compression is CPU intensive. It is better to run such code on the client whenever it is possible. I would still, however, suggest implementing CLR routines in the database and have them available to T-SQL code. This could help to address some of the use-cases, when client needs to work with uncompressed data. Consider, for example, some external analytical or reporting tools that query the database directly. You can create the view that call CLR function and decompress the data on the fly providing it to the clients.

You should be careful with the version management in such scenario making sure that the code is the same on both, client side and in the database and that algorithms remain the same and data can be decompressed on either side.

The second important consideration is performance. Obviously, decompression adds an overhead, which you would like to avoid on the large scope. For example, it is the bad idea to have a query that scans large amount of data and performs decompression on every row to evaluate the predicate against one of compressed attributes. For example, query shown below would be highly inefficient.

from T
where convert(xml,dbo.DecompressData(CompressedXML)).value('..') = 1

One of the ways to address such an issue is creating persisted calculated columns for the attributes that are used in where clauses of the queries. The downside of this approach is that SQL Server would not be able to use parallel execution plans in such queries – this is one of the limitations of Query Optimizer when you are using columns calculated with scalar UDFs. However, it is often the small price to pay comparing to constant decompression overhead.

With all being said, compressing LOB data manually is definitely the option, which is worth considering. You can use sys.dm_index_physical_stats view to evaluate amount of such data on per-index basis. Obviously, Pareto 80/20 principle still applies – do not add extra complexity if benefits are not worth it.

UPDATE (2-15-04-07): More details about this method are here.

9. Storing Data in Clustered Columnstore Indexes

In case of Enterprise Edition of SQL Server 2014, you have another option to consider. You can store some of the data in columnstore format utilizing Clustered Columnstore Indexes. This format can provide significant space saving comparing to the regular B-Tree row-based storage. Moreover, you can also utilize Archival Columnstore Compression that applies gzip-like compression on columnstore data and reduces the size even further at cost of extra CPU load.

Just to give you an example, Figure 9 below shows you the difference in the storage space between row-based and column-based storage. The table in this example was generated based on FactSalesBig table from AdventureWorks2012DW database. Obviously, different data leads to the different results; however, in the most cases, clustered columnstore indexes would give you order of magnitude decrease on the storage space. It is also worth mentioning that nonclustered indexes on B-Tree tables would contribute to additional storage space, which is not the case with clustered columnstore indexes that are the single copy of the data in the table.

09. Storage Size based on different compression methods and storage formats

Obviously, clustered columnstore indexes are not for everyone. They are very beneficial for Data Warehouse workload that requires to scan and aggregate large amount of data. The same time they are the very bad choice for OLTP workload – they do not support point-lookups nor small range scans.

Same time, it is not uncommon to have different use-cases for the old and new data in OLTP systems. Customers can generate OLTP workload to support day-to-day operations with the new data; however, the old data can be used for analysis and reporting, which is mainly Data Warehouse workload. In such scenario, you can consider to partition your data into the multiple tables using columnstore format for the tables with the old data. You can abstract all those changes via partitioned views making the differences in the schema and storage format transparent to the clients.

This is by no means not the simplest thing to implement. However, such design could lead to significant performance improvements and storage space saving for the certain kind of workloads.

10. Reducing amount of free space in the database

Finally, let’s discuss what we can do when data files have large amount of free space.

As strange as it sounds, one of the best possible options in that case is leaving everything as is. Consider the situation when system implements sliding-window pattern keeping 1-year worth of data in the system. Typically, such systems purge the data based on some schedule. For example, it is possible that every 1st day of the month system purges the 13th month of data – the one with the data older than 1 year.

Let’s assume that system collects 500GB of data per month. In this scenario, if you measured amount of free space in the data files right after the purge, you would notice that files have more than 500GB of free space available. Obviously, you can shrink the files and release such space to OS; however, the database would reclaim it as data growth.

However, for the purpose of this discussion, let’s assume that we have legitimate case to decrease the size of the files. Unfortunately, it is not very easy to do. DBCC SHRINKFILE command is the terrible way to reduce the size of the database. That command works in the very simple matter – it starts to move the allocated extents from the end of the file to unallocated space in the beginning of the file. As you can guess, this leads to the terrible index fragmentation. Moreover, it generates excessive amount of log records, which can affect the system in the multiple ways.

Obviously, you can perform index maintenance after you are done with the shrink. However, there is the catch – index rebuild will grow the file again (it needs space to accommodate the new version of the index). Index reorg could be the better choice in this scenario even though it does not provide results in par with the index rebuild. In the end, everything depends on amount of the space you are clearing and the size of your data. For example, if you had terabytes of free space and your biggest index is just a couple hundred gigabytes, you could consider to shrink and run index rebuild afterwards. Files would grow; however, such growth is much smaller comparing to space saving after the shrink.

Another, and often the better way to accomplish the task is moving all the data to another filegroup dropping original filegroup afterwards. The actual implementation would vary based on the version and edition of SQL Server. In Enterprise Edition, you can perform online index rebuild to the different filegroup, which will keep system available during the process. In Standard Edition, you can rebuild indexes only offline.

There is another catch though – index rebuild does not move LOB data between filegroups by default. The only way to workaround it is by rebuilding index to the new partition schema instead of the filegroup. However, it requires Enterprise Edition, which supports partitioning. Unfortunately, in Standard Edition you are out of luck.

I am not providing the examples here; however, I would like to reference my book again where I have discussed it in details. Alternatively, you can download book demo scripts and see how data movement works in action.

Lastly, there is always the option of creating another table on another filegroup, copying data there and dropping original table and renaming the new table afterwards. This approach would work in either edition; however, in the most part of the cases it needs be done offline. Online implementation is, of course, possible but it is usually complicated if table has volatile data.

Wrapping Up

This blog post ended up being much bigger than I expected. Unfortunately, even with such size it was impossible to cover all the details for some of the methods. Anyway, I hope you found this information useful, at least as the high-level overview.

Please, do not take the order in which I outlined approaches as the guideline. Every system is unique and you need to design the solution targeted to particular system taking hardware, software and business requirements into consideration.

SQL Server Storage Engine: Heap Tables

Heap tables are tables without a clustered index. The data in heap tables is unsorted.  SQL Server does not guarantee nor maintain any sorting order of the data in the heap tables.
When we insert data into heap tables, SQL Server tries to fill pages as much as possible, although it does not analyze the actual free space available on a page. It uses the Page Free Space (PFS) allocation map instead. SQL Server errs on the side of caution, and it uses the low value from the PFS free space percentage tier during the estimation.

For example, if a data page stores 4,100 bytes of data, and, as result, has 3,960 bytes of free space available, PFS would indicate that the page is 51-80 percent full. SQL Server would not put a new row to the page if its size exceeds 20 percent (8,060 bytes * 0.2 = 1,612 bytes) of the page size.

Let’s look at the example and create the table with the code shown below.

create table dbo.Heap(Val varchar(8000) not null);

;with CTE(ID,Val) as 
    select 1, replicate('0',4089)
    union all
    select ID + 1, Val from CTE where ID < 20 
insert into dbo.Heap(Val)
   select Val from CTE;

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.Heap'),0,null,'DETAILED');

01. Page Count after initial insert

At this point, the table stores 20 rows of 4,100 bytes each. SQL Server allocates 20 data pages—one page per row—with 3,960 bytes available. PFS would indicate that pages are 51-80 percent full.

As the next step, let’s inserts the small 111-byte row, which is about 1.4 percent of the page size. As a result, SQL Server knows that the row would fit into one of the existing pages (they all have at least 20 percent of free space available), and a new page should not be allocated.

insert into dbo.Heap(Val) values(replicate('1',100));

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent 
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.Heap'),0,null,'DETAILED');

02. Page Count after insertion of small row

Lastly, the third insert statement needs 2,011 bytes for the row, which is about 25 percent of the page size. SQL Server does not know if any of the existing pages have enough free space to accommodate the row and, as a result, allocates the new page. You can see that SQL Server does not access existing pages by checking the actual free space and uses PFS data for the estimation.

insert into dbo.Heap(Val) values(replicate('2',2000));

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent 
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.Heap'),0,null,'DETAILED');

03. Page Count after insertion of large row

That behavior leads to the situation where SQL Server unnecessarily allocates new data pages, leaving large amount of free space unused. It is not always a problem when the size of rows vary—in those cases, SQL Server eventually fills empty spaces with the smaller rows. However, especially in cases when all rows are relatively large, you can end up with large amounts of wasted space.

When selecting data from the heap table, SQL Server uses as Index Allocation Map (IAM) to find the pages and extents that need to be scanned. It analyzes what extents belong to the table and processes them based on their allocation order rather than on the order in which the data was inserted. You can see it in figure below.

04. IAM Scan

When you update the row in the heap table, SQL Server tries to accommodate it on the same page. If there is no free space available, SQL Server moves the new version of the row to another page and replaces the old row with a special 16-byte row called a forwarding pointer. The new version of the row is called forwarded row.

05. Forwarding Pointers

There are two main reasons why forwarding pointers are used. First, they prevent updates of nonclustered index keys, which referencing the row.

In addition, forwarding pointers helps minimize the number of duplicated reads – the situation when a single row is read multiple times during the table scan. As the example, let’s look at figure above and assume that SQL Server scans the pages in left-to-right order. Next, let’s assume that the row in page 3 was modified after the page was read at the time when SQL Server reads page 4. The new version of the row would be moved to page 5, which has yet to be processed. Without forwarding pointers, SQL Server would not know that the old version of the row had already been read, and it would read it again during the page 5 scan. With forwarding pointers, SQL Server would ignore the forwarded rows.

Forwarding pointers help minimize duplicated reads at cost of additional read operations. SQL Server follows the forwarding pointers and reads the new versions of the rows at the time it encounters them. That behavior can introduce an excessive number of I/O operations.

Let’s create the table and insert three rows there.

create table dbo.ForwardingPointers
   ID int not null,
   Val varchar(8000) null

insert into dbo.ForwardingPointers(ID,Val)

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent, forwarded_record_count
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.ForwardingPointers'),0,null,'DETAILED');

set statistics io on
select count(*) as [RowCnt] from dbo.ForwardingPointers
set statistics io off

You can see results below

06. Forwarding Pointers: I/O without forwarding pointers

All three rows fit into the single page, and SQL Server needs to read just that page when it scans the table.

07. Page Layout without Forwarding Pointers

Let’s update two of the table rows and increase their size. The new version of the rows would not fit into the page anymore, which introduces the allocation of the two new pages and two forwarding pointers.

update dbo.ForwardingPointers set Val = replicate('1',5000) where ID = 1;
update dbo.ForwardingPointers set Val = replicate('3',5000) where ID = 3;

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent, forwarded_record_count
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.ForwardingPointers'),0,null,'DETAILED');

set statistics io on
select count(*) as [RowCnt] from dbo.ForwardingPointers
set statistics io off

08. Forwarding Pointers: I/O with Forwarding Pointers

When SQL Server reads the forwarding pointer rows from page 1, it follows them and reads pages 2 and 3 immediately thereafter. After that, SQL Server reads those pages one more time during the regular IAM scan process. As a result, we have five read operations, even though our table has just three data pages.

09. Page Layout and I/O with Forwarding Pointers

It does not look as bad in case of the small table. Let’s look at the same issue in case, when table has more rows. Let’s insert 65,536 rows to our table.

truncate table dbo.ForwardingPointers

;with N1(C) as (select 0 UNION ALL select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select NULL)) from N5)
insert into dbo.ForwardingPointers(ID)
	select ID from IDs;

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent, forwarded_record_count
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.ForwardingPointers'),0,null,'DETAILED');

set statistics io on
select count(*) as [RowCnt] from dbo.ForwardingPointers
set statistics io off

10. Large table: I/O without Forwarding Pointers

As you see, there are 106 pages in the table and as result, SQL Server performs 106 reads during IAM scan. Let’s update our table and introduce forwarding pointers.

update dbo.ForwardingPointers set Val = replicate('a',500);

select page_count, avg_record_size_in_bytes, avg_page_space_used_in_percent, forwarded_record_count
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.ForwardingPointers'),0,null,'DETAILED');

set statistics io on
select count(*) as [RowCnt] from dbo.ForwardingPointers
set statistics io off

11. Large Table: I/O with Forwarding Pointers

Now our table has 4,461 pages however it requires almost 70 thousand reads to perform a scan. As you see, the large number of the forwarding pointers leads to extra I/O operations and significantly reduces the performance of the queries accessing the data.

When the size of the forwarded row is reduced by another update and the data page with forwarding pointer has enough space to accommodate updated version of the row, SQL Server might move it back to original data page and remove the forwarding pointer. Nevertheless, the only reliable way to get rid of the all forwarding pointers is by rebuilding the heap table. You can do that by using an ALTER TABLE REBUILD statement or by creating and dropping a clustered index on the table.

Heap tables can be useful in staging environment where you want to import a large amount of data into the system as fast as possible. Inserting data into heap tables can often be faster than inserting it into tables with clustered indexes. Nevertheless, during a regular workload, tables with clustered indexes usually outperform heap tables due to their suboptimal space control and forwarding pointers.

Table of content

SQL Server Storage Engine: Allocation Maps

SQL Server logically groups eight pages into 64KB units called extents. There are two types of extents available: Mixed extents store data that belongs to different objects. Uniform extents store the data for the same object.

When a new object is created, SQL Server stores first eight object pages in mixed extents. After that, all subsequent space allocation for that object is done with uniform extents.

SQL Server uses special kind of pages, called Allocation Maps, to track extent and page usage in a file. There are several different types of allocation maps pages in SQL Server.

Global Allocation Map (GAM) pages track if extents have been allocated by any objects. The data is represented as bitmaps where each bit indicates the allocation status of an extent. Zero bits indicate that the corresponding extents are in use. The bits with a value of one indicate that the corresponding extents are free. Every GAM page covers about 64,000 extents, or almost 4GB of data. This means that every database file has one GAM page for about 4GB of file size.

Shared Global Allocation Map (SGAM) pages track information about mixed extents. Similar to GAM pages, it is a bitmap with one bit per extent. The bit has a value of one if the corresponding extent is a mixed extent and has at least one free page available. Otherwise, the bit is set to zero. Like a GAM page, SGAM page tracks about 64,000 extents, or almost 4GB of data.

SQL Server can determine the allocation status of the extent by looking at the corresponding bits in GAM and SGAM pages. Figure below shows the possible combinations of the bits.

01. GAM and SGAM bit statuses

When SQL Server needs to allocate a new uniform extent, it can use any extent where a bit in the GAM page has the value of one. When SQL Server needs to find a page in a mixed extent, it searches both allocation maps looking for the extent with a bit value of one in a SGAM page and the corresponding zero bit in a GAM page. If there are no such extents available, SQL Server allocates the new free extent based on the GAM page, and it sets the corresponding bit to one in the SGAM page.

Every database file has its own chain of GAM and SGAM pages. The first GAM page is always the third page in the data file (page number 2). The first SGAM page is always the fourth page in the data file (page number 3). The next GAM and SGAM pages appear every 511,230 pages in the data files which allows SQL Server to navigate through them quickly when needed.

SQL Server tracks the pages and extents used by the different types of pages (in-row, row-overflow, and LOB pages), that belong to the object with another set of the allocation map pages, called Index Allocation Map (IAM). Every table/index has its own set of IAM pages, which are combined into separate linked lists called IAM chains. Each IAM chain covers its own allocation unit – IN_ROW_DATA, ROW_OVERFLOW_DATA, and LOB_DATA.

Each IAM page in the chain covers a particular GAM interval and represents the bitmap where each bit indicates if a corresponding extent stores the data that belongs to a particular allocation unit for a particular object. In addition, the first IAM page for the object stores the actual page addresses for the first eight object pages, which are stored in mixed extents.

The figure below shows a simplified version of the allocation map pages bitmaps.

02. Allocation maps

Partitioned tables and indexes have separate IAM chains for every partition. In fact, you can consider non-partitioned table as the partitioned with just a single partition.

There is another type of allocation map page called Page Free Space (PFS). Despite the name, PFS pages track a few different things. We can call PFS as a byte-mask, where every byte stores information about a specific page, as shown below.

03. PFS byte structure

The first three bits in the byte indicate the percent of used space on the page. SQL Server tracks the used space for row-overflow and LOB data, as well as for in-row data in the heap tables. These are the only cases when amount of free space on the page matters.

When you delete a data row from the table, SQL Server does not remove it from the data page but rather marks the row as deleted. Bit 4 indicates if the page has logically deleted (ghosted) rows.

Bit 5 indicates if the page is an IAM page. Bit 6 indicates whether or not the page is in the mixed extent. Finally, bit 7 indicates if the page is allocated.

Every PFS page tracks 8,088 pages or about 64MB of data space. It is always the second page (page 1) in the file and every 8,088 pages thereafter.

There are two more types of allocation map pages. The seventh page (page 6) in the file is called a Differential Changed Map (DCM). These pages keep track of extents that have been modified since the last FULL database backup. SQL Server uses DCM pages when it performs DIFFERENTIAL backups.

The last allocation map is called Bulk Changed Map (BCM). It is eighth page (page 7) in the file, and it indicates what extents have been modified in minimally-logged operations since the last transaction log backup. BCM pages are used only with a BULK_LOGGED database recovery model.

Both, DCM and BCM pages are the bitmasks that cover 511,230 pages in the data file.

Next: HEAP tables

Table of Content

SQL Server Storage Engine: LOB Storage

As you already know, the fixed-length data and the internal attributes of a row must fit into a single page. Fortunately, SQL Server can store the variable-length data on different data pages. There are two different ways to store the data, depending on the data type and length.


SQL Server stores variable-length column data, which does not exceed 8,000 bytes, on special pages called ROW_OVERFLOW pages. Let’s create a table and populate it with the data shown in listing below.

create table dbo.RowOverflow 
    ID int not null, 
    Col1 varchar(8000) null, 
    Col2 varchar(8000) null 

insert into dbo.RowOverflow(ID, Col1, Col2) 
values (1,replicate('a',8000),replicate('b',8000));

SQL Server creates the table and inserts the data row without any errors, even though the data row size exceeds 8,060 bytes. Let’s look at the table page allocation using the DBCC IND command.

DBCC IND('SqlServerInternals','dbo.RowOverflow',-1)


01. ROW_OVERFLOW data: DBCC IND results

Now you can see two different sets of IAM and data pages. The data page with PageType=3 represents the data page that stores ROW_OVERFLOW data.

Let’s look at data page 214647, which is the in-row data page that stores main row data. The partial output of the DBCC PAGE command for the page (1:214647) is shown below.

Slot 0 Offset 0x60 Length 8041

Record Size = 8041 
Memory Dump @0x000000000FB7A060
0000000000000000:30000800 01000000 03000002 00511f69 9f616161 0............Q.iŸaaa
0000000000000014:61616161 61616161 61616161 61616161 61616161 aaaaaaaaaaaaaaaaaaaa
0000000000000028:61616161 61616161 61616161 61616161 61616161 aaaaaaaaaaaaaaaaaaaa
000000000000003C:61616161 61616161 61616161 61616161 61616161 aaaaaaaaaaaaaaaaaaaa
0000000000000050:61616161 61616161 61616161 61616161 61616161 aaaaaaaaaaaaaaaaaaaa
0000000000001F2C:61616161 61616161 61616161 61616161 61616161 aaaaaaaaaaaaaaaaaaaa
0000000000001F40:61616161 61616161 61616161 61616161 61020000 aaaaaaaaaaaaaaaaa...
0000000000001F54:00010000 00290000 00401f00 00754603 00010000 .....)...@...uF.....

As you see, SQL Server stores Col1 data in-row. Col2 data, however, has been replaced with a 24-byte value. The first 16 bytes are used to store off-row storage metadata attributes, such as type, length of the data, and a few other attributes. The last 8 bytes is the actual pointer to the row on the row-overflow page, which is the file, page, and slot number. Figure below shows this in detail. Remember that all information is stored in byte-swapped order.

02. ROW_OVERFLOW page pointer

As you see, the slot number is 0, file number is 1, and page number is the hexadecimal value 0x00034675, which is decimal 214645. The page number matches the DBCC IND results shown earlier in the post.

The partial output of the DBCC PAGE command for the page (1:214645) is shown below.

Blob row at: Page (1:214645) Slot 0 Length: 8014 Type: 3 (DATA)
Blob Id:2686976

0000000008E0A06E: 62626262 62626262 62626262 62626262 bbbbbbbbbbbbbbbb
0000000008E0A07E: 62626262 62626262 62626262 62626262 bbbbbbbbbbbbbbbb
0000000008E0A08E: 62626262 62626262 62626262 62626262 bbbbbbbbbbbbbbbb

Col2 data is stored in the first slot on the page.

LOB Storage

For the text, ntext, or image columns, SQL Server stores the data off-row by default. It uses another kind of page called LOB data pages. You can control this behavior by using the “text in row” table option. For example, exec sp_table_option dbo.MyTable, ‘text in row’, 200 forces SQL Server to store LOB data less or equal to 200 bytes in-row. LOB data greater than 200 bytes would be stored in LOB pages.

The logical LOB data structure is shown below.

03: LOB data structure

Like ROW_OVERFLOW data, there is a pointer to another piece of information called the LOB root structure, which contains a set of the pointers to other data pages/rows. When LOB data is less than 32 KB and can fit into five data pages, the LOB root structure contains the pointers to the actual chunks of LOB data. Otherwise, the LOB tree starts to include an additional, intermediate level of pointers, similar to the index B-Tree.

Let’s create the table and insert one row of data there.

create table dbo.TextData
    ID int not null,
    Col1 text null

insert into dbo.TextData(ID, Col1) 
values (1, replicate(convert(varchar(max),'a'),16000));

The page allocation for the table is shown below.

04. LOB data: DBCC IND results

As you see, the table has one data page for in-row data and three data pages for LOB data. I am not going to examine the structure of the data row for in-row allocation; it is similar to the ROW_OVERFLOW allocation. However, with the LOB allocation, it stores less metadata information in the pointer and uses 16 bytes rather than the 24 bytes required by the ROW_OVERFLOW pointer.

The result of DBCC PAGE command for the page that stores the LOB root structure is shown below.

Blob row at: Page (1:3046835) Slot 0 Length: 84 Type: 5 (LARGE_ROOT_YUKON)
Blob Id: 131661824 Level: 0 MaxLinks: 5 CurLinks: 2
Child 0 at Page (1:3046834) Slot 0 Size: 8040 Offset: 8040 
Child 1 at Page (1:3046832) Slot 0 Size: 7960 Offset: 16000

As you see, there are two pointers to the other pages with LOB data blocks, which are similar to the blob data stored in ROW_OVERFLOW pages.

The format, in which SQL Server stores the data from the (MAX) columns, such as varchar(max), nvarchar(max), and varbinary(max), depends on the actual data size. SQL Server stores it in-row when possible. When in-row allocation is impossible, and data size is less or equal to 8,000 bytes, it stored as ROW_OVERFLOW data. The data that exceeds 8,000 bytes is stored as LOB data.

It is also worth mentioning that SQL Server always stores rows that fit into a single page using in-row allocations. When a page does not have enough free space to accommodate a row, SQL Server allocates a new page and places the row there rather than placing it on the half-full page and moving some of the data to ROW_OVERFLOW pages.

SELECT * and I/O

There are plenty of reasons why selecting all columns from a table with the select * operator is not a good idea. It increases network traffic by transmitting columns that the client application does not need. It also makes query performance tuning more complicated, and it introduces side effects when the table schema changes.

It is recommended that you avoid such a pattern and explicitly specify the list of columns needed by the client application. This is especially important with ROW_OVERFLOW and LOB storage, when one row can have data stored in multiple data pages. SQL Server needs to read all of those pages, which can significantly decrease the performance of queries.

As an example, let’s assume that we have table dbo.Employees with one column storing employee pictures.

create table dbo.Employees
    EmployeeId int not null,
    Name varchar(128) not null,
    Picture varbinary(max) null

,N2(C) AS (SELECT 0 FROM N1 AS T1 CROSS JOIN N1 AS T2) -- 4 rows
,N3(C) AS (SELECT 0 FROM N2 AS T1 CROSS JOIN N2 AS T2) -- 16 rows
,N4(C) AS (SELECT 0 FROM N3 AS T1 CROSS JOIN N3 AS T2) -- 256 rows
,N5(C) AS (SELECT 0 FROM N4 AS T1 CROSS JOIN N2 AS T2) -- 1,024 rows
insert into dbo.Employees(EmployeeId, Name, Picture)
        ,'Employee ' + convert(varchar(5),ID)
    from Ids;

The table has 1,024 rows with binary data of 120,000 bytes. Let’s assume that we have code in the client application that needs the EmployeeId and Name to populate a drop-down box. If a developer is not careful, he can write a select statement using the select * pattern, even though a picture is not needed for this particular use-case.

Let’s compare the performance of two selects; one selecting all data columns and another that selects only EmployeeId and Name

set statistics io on
set statistics time on

select * from dbo.Employees;
select EmployeeId, Name from dbo.Employees;

set statistics io off
set statistics time off
select EmployeeId, Name from dbo.Employee: 
Number of reads: 7;  Execution time (ms): 2

select * from dbo.Employee
Number of reads: 90,895; Execution time (ms): 343

As you see, the first select, which reads the LOB data and transmits it to the client, is a few orders of magnitude slower than the second select.

One case where this becomes extremely important is with client applications, which use Object Relational Mapping (ORM) frameworks. Developers tend to reuse the same entity objects in different parts of an application. As a result, an application may load all attributes/columns even though it does not need all of them in many cases.

It is better to define different entities with a minimum set of required attributes on an individual use-case basis. In our example, it would work best to create separate entities/classes, such as EmployeeList and EmployeeProperties. An EmployeeList entity would have two attributes: EmployeeId and Name. EmployeeProperties would include a Picture attribute in addition to the two mentioned.

This approach can significantly improve the performance of systems.

Next: Allocation Maps

Table of Content

SQL Server Storage Engine: Data Pages and Data Rows

The space in the database divided into logical 8KB pages. Those pages are continuously numbered starting with zero and can be referenced by specifying a file ID and a page number. The page numbering is always continuous – when SQL Server grows the database file, the new pages would have the numbers starting from the last highest page number in the file plus one. Similar, when SQL Server shrinks the file, it removes the highest number pages from the file.

Let’s look at the structure of a data page. All images are clickable.

01. Data Page Structure

96-bytes page header contains the various information about a page , such as the  object page belongs; number of rows and amount of free space available on the page; links to the previous and next pages if the page is in an index page chain, and so on.

Following the page header is the area where actual data is stored. It is followed by the free space. Finally, there is the slot array, which is the block of 2-byte entries indicating the offset at which the corresponding data rows begin on the page.

Slot array indicates the logical order of the data rows on the page. In case, if data on the page needs to be sorted in the order of the index key, SQL Server does not physically sort the data rows on the page but rather populates slot array based on the index sort order. The slot 0 (right-most) stores the offset for the data row with the lowest key value on the page, slot 1 – to the second lowest key value and so forth.

SQL Server system data types can be logically separated into two different groups, such as fixed-length and variable-length types. Fixed-length data types, such as int, datetime, char and others always use the same storage space regardless of the value even when it is NULL. For example, int column always uses 4 bytes and nchar(10) column always uses 20 bytes to store the information.

As the opposite, variable-length data types, such as varchar, varbinary and a few others, use as much storage space as required to store the data plus two extra bytes. For example nvarchar(4000) column would use only 12 bytes to store five characters string and, in most part of the cases, 2 bytes to store NULL value. We will discuss the case when variable-length columns do not use storage space for NULL values later.

Let’s look at the structure of the data row

02. Data Row Structure

The first 2 bytes of the row, called Status Bits A and Status Bits B, are the bitmaps containing the information about the row, such as row type; if the row has been logically deleted (ghosted); if the row has NULL values, variable-length columns and versioning tag.

The next two bytes in the row are used to store the length of the fixed-length portion of the data. They are followed by fixed-length data itself.

After the fixed-length data portion, there is the null bitmap, which includes two different data elements. The first 2-byte element is the number of columns in the row. It is followed by null bitmap array. That array is using one bit per every column from the table regardless if it is nullable or not.

The null bitmap is always present in the data rows in heap tables or clustered index leaf rows even when table does not have nullable columns. Although, the null bitmap is not present in non-leaf index rows nor leaf level rows of nonclustered indexes when there are no nullable columns in the index.

Following the null bitmap, there is the variable-length data portion of the row. It starts with two-byte number of variable-length columns in the row followed by variable-length column offset array. SQL Server stores two-byte offset value per each variable-length column in the row even when value is null. It followed by the actual variable-length portion of the data.

Finally, there is optional 14-bytes versioning tag at the end of the row. That tag is used during the operations, which require row-versioning, such as online index rebuild, optimistic isolation level and others.

Let’s look at the example. First, let’s create the table, populate it with some data and look at the actual row data.

use tempdb

create table dbo.DataRows
    ID int not null,
    Col1 varchar(255) null,
    Col2 varchar(255) null,
    Col3 varchar(255) null

insert into dbo.DataRows(ID, Col1, Col3)  values (1,replicate('a',255),replicate('c',255));
insert into dbo.DataRows(ID, Col2) values (2,replicate('b',255));

dbcc ind
    'tempdb' -- Database name
    ,'dbo.DataRows' -- Table Name
    ,-1 -- Display info about all pages from the table

Undocumented but well-known DBCC IND command returns us the information about table page allocations.

03. Page Allocation (DBCC IND results)

There are two pages that belong to the table. The first one with PageType=10 is the special type of the page called IAM allocation map. This page tracks the pages that belong to particular object. Let’s not focus on it now – we will cover allocation map pages in one of the following blog posts.

The page with PageType=1 is the actual data page that contains the data rows. PageFID and PagePID column shows the actual file and page numbers for the page. You can use another undocumented command DBCC PAGE to examine its content

-- Redirecting DBCC PAGE output to console rather than error log
dbcc traceon(3604);
dbcc page
    'tempdb' -- Database name
    ,1 -- File ID
    ,214643 -- Page ID
    ,3 -- Output mode: 3 - display page header and row details

You can see the output of DBCC PAGE that corresponds to the first data row below. SQL Server stores the data in byte-swapped order. For example, two-byte value of 0001 would be stored as 0100.

Slot 0 Offset 0x60 Length 39
Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 39                    
Memory Dump @0x000000000EABA060

0000000000000000:30000800 01000000 04000403 001d001d 00270061 0................'.a
0000000000000014:61616161 61616161 61636363 63636363 636363   aaaaaaaaacccccccccc

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4
ID = 1                              

Slot 0 Column 2 Offset 0x13 Length 10 Length (physical) 10
Col1 = aaaaaaaaaa                   

Slot 0 Column 3 Offset 0x0 Length 0 Length (physical) 0
Col2 = [NULL]                       

Slot 0 Column 4 Offset 0x1d Length 10 Length (physical) 10
Col3 = cccccccccc

Let’s look at the data row structure

04. Row Structure with the data (Row 1)

As you see, the row starts with the two status bits bytes following by two-byte value of 0800. This is byte-swapped value of 0008, which is the offset for the number of columns attribute in the row. This offset tells SQL Server where fixed-length data part of the row ends.

Next four bytes are used to store fixed-length data, which is ID column in our case. After that, there is the two-byte value that shows that data row has four columns followed by one-byte NULL bitmap. With just four columns one byte in the bitmap is enough. It stores the value of 04, which is 00000100 in the binary format. It indicates that the third column in the row contains NULL value.

The next two bytes stores the number of variable-length columns in the row, which is 3 (0300 in byte-swapped order). It follows by offset array, each two bytes there stores the offset where variable-length column data ends. As you see, even though Col2 is NULL, it still uses the slot in the offset-array. Finally, there is the actual data from variable-length columns.

Now let’s look at the second data row.

Slot 1 Offset 0x87 Length 27
Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 27       
Memory Dump @0x000000000EABA087

0000000000000000: 30000800 02000000 04000a02 0011001b 00626262 0................bbb
0000000000000014: 62626262 626262                              bbbbbbb

Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4
ID = 2                              

Slot 1 Column 2 Offset 0x0 Length 0 Length (physical) 0
Col1 = [NULL]

Slot 1 Column 3 Offset 0x11 Length 10 Length (physical) 10
Col2 = bbbbbbbbbb

Slot 1 Column 4 Offset 0x0 Length 0 Length (physical) 0
Col3 = [NULL]


05. Row Structure with the data (Row 2)

The NULL bitmap in the second row represents binary value of 00001010, which shows that Col1 and Col3 are NULL. Even though the table has three variable-length columns, number of variable-length columns in the row indicates, that there are just two columns/slots in the offset-array. SQL Server does not maintain the information about the trailing NULL variable-length columns in the row.

You can reduce the size of data row by creating tables in the way, when variable-length columns that often have null values are the last ones in the table definition.

Let’s do the exercise and calculate the actual size of the data row in dbo.DataRows table. We will have:

2 bytes for Status Bits bytes + 2 bytes for fixed-length data length + 4 bytes for ID column storage + 2 bytes for number of column + 1 byte for null bitmap + 2 bytes for number of variable-length columns + 6 bytes (3 * 2 bytes) for variable-length offset array – (2 bytes * number of trailing variable-length columns with null value) + variable-length data + 2 bytes for slot array = 21 bytes to store fixed-length data and overhead + length of variable-length data – (2 bytes * number of trailing variable-length columns with null value).

This approach can help you to calculate the actual size of the data rows in the table. Do not forget, that non-clustered indexes do not have null bitmap array in case if there are no nullable columns in the index.

The fixed-length data and internal attributes must fit into 8,060 bytes available on the single data page. SQL Server does not allow you to create the table when this is not the case. For example, the code below produces an error.

create table dbo.BadTable
    Col1 char(4000),
    Col2 char(4060)

Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'BadTable' failed because the minimum row size would be 8067, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.

Next: Large Objects Storage

Table of Content

SQL Server Storage Engine: Database Files and Filegroups

Recently I have received a few emails asking me to clarify a few things from the old blog posts I wrote way back in 2010. After I re-read those posts, I decided that it could make sense to refresh and rewrite some of them. I hope, it can be done better this time. 🙂

In the next a few months I will talk a bit about SQL Server Storage Engine covering how SQL Server stores the data; what is the format of data row and data page; what are the allocation maps; and so on. We will see how it goes and where to stop.

Today I will start writing a few words about SQL Server database files and filegroups in general.

SQL Server database is a collection of the objects that allow us to store and manipulate the data. In theory, SQL Server supports 32,767 databases per instance although the typical installation usually has just several databases. Obviously, the number of the databases SQL Server can handle depends on the load and hardware. It is not unusual to see the servers hosting dozens or even hundreds of small databases.

Every database consists of one or more transaction log and one or more data files. Transaction log stores the information about database transactions and all data modifications made by each session. Every time the data has been modified, SQL Server stores enough information in the transaction log to undo (rollback) or redo (replay) the action.

Every database has one primary data file, which, by default, has .mdf extension. In addition, every database can have secondary database files. Those files, by default, have .ndf extension.

All database files are grouped into the filegroups. Filegroup is the logical unit, which simplifies database administration. They allow the separation between logical object placement and physical database files. When you create the database objects-tables, for example-you specify in what filegroup they should be placed without worrying about underlying data files configuration.

The script shown below creates the database with name OrderEntryDb. That database consists of three filegroups. The primary filegroup has one data file stored on M: drive. Second filegroup- Entities– has one data file on N: drive. Last filegroup- Orders– has two data files stored on O: and P: drives. Finally, there is the transaction log file on L: drive.

create database [OrderEntryDb] on 
(name = N'OrderEntryDb', filename = N'm:\OEDb.mdf'),
filegroup [Entities] 
(name = N'OrderEntry_Entities_F1', filename = N'n:\OEEntities_F1.ndf'),
filegroup [Orders] 
(name = N'OrderEntry_Orders_F1', filename = N'o:\OEOrders_F1.ndf'),
(name = N'OrderEntry_Orders_F2', filename = N'p:\OEOrders_F2.ndf') 
log on
(name = N'OrderEntryDb_log', filename = N'l:\OrderEntryDb_log.ldf')

You can see the physical layout of the database and data files below. There are five disks with four data- and one transaction- log files. Dashed rectangles represent the filegroups.

01. Files and Filegroups

Ability to put multiple data files inside the filegroup allows us to spread the load across different storage devices, which would help to improve I/O performance of the system. Transaction log, on the other hand, does not benefit from the multiple files. SQL Server works with transaction log in sequential matter and multple log files just stay idle.

Let’s create a few tables in the database we created. The tables Clients and Articles are placed into Entities filegroup. The table Orders resides in Orders filegroup.

create table dbo.Customers
    -- Table columns
) on [Entities];

create table dbo.Articles
    -- Table columns
) on [Entities];

create table dbo.Orders
    -- Table columns
) on [Orders];

The physical layout of the tables in the database and disks is shown below.

02. Tables and Filegroups

The separation between logical object placement in the filegroups and physical database files allow us to fine-tune the database file layout getting the most from the storage subsystem. For example, independent software vendors (ISV), who are deploying their products to different customers, can adjust the number of database files based on underlying I/O configuration and expected amount of the data during deployment stage. Those changes would be transparent to the developers, who are placing the database objects to the filegroups rather than database files.

It is generally recommended to avoid using PRIMARY filegroup for anything but system objects. Creating separate filegroup or set of the filegroups for the user objects simplifies database administration and disaster recovery especially in case of the large databases.

You can specify initial file size and auto-growth parameters at time when you create the database or add new files to existing database. SQL Server uses proportional fill algorithm when choosing in what data file it should write data to. It writes an amount of data proportionally to the free space available in the files – more free space are in the file, more writes it would handle.

I would recommend that all files in the single filegroup would have the same initial size and auto-growth parameters with grow size defined in megabytes rather than percent. This would help proportional fill algorithm evenly balance write activities across data files.

Every time SQL Server grows the files, it fills newly allocated space in the files with zeros. That process blocks all sessions that need to write to the corresponding file or, in case of transaction log growth, generate transaction log records.

SQL Server always zeroing out transaction log and that behavior cannot be changed. Although, you can control if data files are zeroing out or not by enabling or disabling Instant File Initialization. Enabling Instant File Initialization helps to speed up data file growth and reduces the time required to create or restore the database.

There is the small security risk associated with Instant File Initialization. When this option is enabled, unallocated part of the data file can contain the information from the previously deleted OS files. Database administrators will be able to examine such data.

You can enable Instant File Initialization by adding SA_MANAGE_VOLUME_NAME permission also known as “Perform Volume Maintenance Task” to SQL Server startup account. This can be done under Local Security Policy management application (secpol.msc) as shown below. You need to open properties for “Perform volume maintenance task” permission and add SQL Server startup account to the list of users there.

03. Instant File Initialization: Local Security Policy

SQL Server checks if it has Instant File Initialization enabled on startup. You need to restart SQL Server service after you add corresponding permission. 

In order to check if permission is enabled, you can use the code from the listing below. This code sets two trace flags that forces SQL Server to put  additional information to the error log, creates the small database and reads the content of the log.

-- add more output to error log
dbcc traceon(3004,3605,-1)
create database Dummy
exec sp_readerrorlog
drop database Dummy
dbcc traceoff(3004,3605,-1)

In case, if Instant File Initialization is not enabled, SQL Server error log shows that SQL Server zeroing mdf data file in addition to zeroing log .ldf file as shown below. When Instant File Initialization is enabled, it would only mention zeroing of the log .ldf file.

04. Instant File Initialization: Checking if instant file initialization is enabled

Another important database option that controls the database file sizes is Auto Shrink. When this option is enabled, SQL Server regularly shrinks the database files, reduces their size and release space to operating system. This operation is very resource intensive and rarely useful – the database files grow up again after some time when new data comes to the system. Auto Shrink must never be enabled on the database. Moreover, Microsoft would remove that option in the future versions of the SQL Server.

Next: Data Pages and Data Rows

Table of Content