Tag Archives: CLR

Compressing LOB (XML) Data in the Database

Some time ago we have discussed the several techniques that can help reducing the database size. I have received quite a few responses on that post and several people asked if I could provide more details and examples. Today I will try to follow up on one of the methods, such as reducing the size of LOB data (in particular XML) in the database.

As you know, SQL Server stores the data in regular B-Tree indexes in three different sets of the data pages called allocation units. The main data row structure and fixed-length data are stored in IN-ROW data pages. Variable-length data greater than 8,000 bytes in size is stored in LOB (large object) pages. Such data includes (max) columns, XML, CLR UDT and a few other data types. Finally, variable-length data, which does not exceed 8,000 bytes, is stored either in IN-ROW data pages when it fits into the page, or in ROW-OVERFLOW data pages. You can read more about it in the previous post and here.

Enterprise Edition of SQL Server allows you to reduce the size of the data by implementing data compression. However, data compression is applied to IN-ROW data only and it does not compress ROW-OVERFLOW and LOB data. Any large objects that do not fit into IN-ROW data pages remain uncompressed.

Unfortunately, there is very little we can do to reduce the size of the LOB data. SQL Server does not provide any build-in functional to address it. The only remaining option is compressing it manually – either on the client or with CLR routines. Obviously, compression and decompression adds overhead and, from SQL Server load standpoint, it is preferable to do it on the client side. However, in the large number of cases, you will need to access compressed data from T-SQL, and CLR integration is the only choice. Ideal implementation in that case would combine compression and decompression code in both tiers and use CLR only when it is necessary.

Fortunately, .Net implementation of the compression code is very simple and can be done with DeflateStream or GZipStream classes. Below you can see the code of CLR functions that perform compression and decompression. You can also download entire project with the link at the end of the post.

/// <summary>
/// Compressing the data
/// </summary>
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
        DataAccess = DataAccessKind.None)]
public static SqlBytes BinaryCompress(SqlBytes input)
{
    if (input.IsNull)
        return SqlBytes.Null;

    using (MemoryStream result = new MemoryStream())
    {
        using (DeflateStream deflateStream = 
            new DeflateStream(result, CompressionMode.Compress, true))
        {
            deflateStream.Write(input.Buffer, 0, input.Buffer.Length);
            deflateStream.Flush();
            deflateStream.Close();
        }
        return new SqlBytes(result.ToArray());
    } 
}

/// <summary>
/// Decompressing the data
/// </summary>
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
        DataAccess = DataAccessKind.None)]
public static SqlBytes BinaryDecompress(SqlBytes input)
{
    if (input.IsNull)
        return SqlBytes.Null;
    
    int batchSize = 32768;
    byte[] buf = new byte[batchSize];

    using (MemoryStream result = new MemoryStream())
    {
        using (DeflateStream deflateStream = 
            new DeflateStream(input.Stream, CompressionMode.Decompress, true))
        {
            int bytesRead;
            while ((bytesRead = deflateStream.Read(buf, 0, batchSize)) > 0)
                result.Write(buf, 0, bytesRead);
        }
        return new SqlBytes(result.ToArray());
    } 
}

You can define the functions in the database with the following code (you either need to get byte sequence of the compiled assembly from the demo script or compile CLR project).

create assembly LOBCompress
authorization dbo
from /*..*/
go

create function dbo.BinaryCompress(@input varbinary (max))
returns varbinary (max)
as external name [LOBCompress].[Compress].[BinaryCompress];
go

create function dbo.BinaryDecompress(@input varbinary (max))
returns varbinary (max)
as external name [LOBCompress].[Compress].[BinaryDecompress];
go

Now let’s see the process in action, create the test table and populate it with some data.

create table dbo.DataWithXML
(
    ID int not null,
    Data xml not null,

    constraint PK_DataWithXML
    primary key clustered(ID)
)
go

declare
    	@X xml

select @X = 
    (
        select *
        from master.sys.objects
        for xml raw, root('Data')
    )

;with n1(c) as (select 0 union all select 0) -- 2 rows
,n2(c) as (select 0 from n1 as t1 cross join n1 as t2) -- 4 rows
,n3(c) as (select 0 from n2 as t1 cross join n2 as t2) -- 16 rows
,n4(c) as (select 0 from n3 as t1 cross join n3 as t2) -- 256 rows
,n5(c) as (select 0 from n4 as t1 cross join n3 as t2) -- 4,096 rows
,ids(id) as (select row_number() over (order by (select null)) from n5)
insert into dbo.DataWithXML(ID,Data)
    select id, @X
    from Ids;

update dbo.DataWithXML
set Data.modify('replace value of (/Data/row/@object_id)[1]
with sql:column("ID")');

Let’s examine the size of the table and average size of XML there using the following queries:

select 
    index_id, partition_number, alloc_unit_type_desc
    ,index_level
    ,page_count
    ,page_count * 8 / 1024 as [Size MB]
from 
    sys.dm_db_index_physical_stats
    (
        db_id() /*Database */
        ,object_id(N'dbo.DataWithXML') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */
    )
go

select avg(datalength(Data)) as [Avg XML Size]
from dbo.DataWithXML;

As you can see in Figure 1, LOB allocation units are using 96MB of space to store the data. The average size of the XML data is 20,897 bytes per row. It is also worth mentioning that actual storage size for XML is 3 data pages per row, which is 24,576 bytes.

1. Uncompressed data: Storage space and XML Size

Let’s see what we can achieve by using compression. The code below creates another table and copies the data there, compressing it on the fly. As you see, the data is stored in varbinary(max) column.

create table dbo.DataWithCompressedXML
(
    ID int not null,
    Data varbinary(max) not null,

    constraint PK_DataWithCompressedXML
    primary key clustered(ID)
)
go

insert into dbo.DataWithCompressedXML(ID,Data)
    select ID, dbo.BinaryCompress(convert(varbinary(max),Data))
    from dbo.DataWithXML

Let’s check the size of the data in compressed table:

select avg(datalength(Data)) as [Uncompressed] 
from dbo.DataWithXML;

select avg(datalength(Data)) as [Compressed] 
from dbo.DataWithCompressedXML
go

select 
    index_id, partition_number, alloc_unit_type_desc
    ,index_level
	,page_count
	,page_count * 8 / 1024 as [Size MB]
from 
    sys.dm_db_index_physical_stats
    (
        db_id() /*Database */
        ,object_id(N'dbo.DataWithCompressedXML') /* Table (Object_ID) */
        ,1 /* Index ID */
        ,null /* Partition ID – NULL – all partitions */
        ,'detailed' /* Mode */
    )

As you can see in Figure 2, we were able to reduce the size of the table from 96 to 10MB decreasing the size of the XML in every row from 20,897 to 2,674 bytes.

2. Compressed data: Storage size and compressed XML size

It is also worth noting that in our case, the size of compressed data is less than 8,000 bytes and SQL Server was able to accommodate all the data using IN-ROW allocation units. Even though the new table is almost 10 times smaller than the old one, it has significantly more IN-ROW data pages in the index. This could introduce some performance side effects in some cases. For example, when system has poorly optimized queries that perform clustered index scans. Again, it could become the issue only if compressed data is less than 8,000 bytes. Otherwise, SQL Server will still store it using the LOB data pages.

Obviously, this implementation requires schema and code changes. We can mitigate it a little bit by abstracting it with the views as it is shown below.

create view dbo.vDataWithXML(ID, Data)
as
    select ID, convert(xml,dbo.BinaryDecompress(Data))
    from dbo.DataWithCompressedXML

We can even create INSTEAD OF trigger (yuck!) on the view to minimize the changes if absolutely needed.

Speaking of the overhead, compressing and decompressing are CPU intensive and there is the performance penalty of calling CLR functions. It is not that noticeable when you need to decompress the single or very few rows; however, it could be very significant on the large data sets. For example, when you need to decompress and shred XML and use some of its elements in a where clause of the query. The code below shows such an example.

set statistics time on
select count(*) 
from dbo.DataWithXML 
where Data.value('(/Data/row/@object_id)[1]','int') = 3;

select count(*) 
from dbo.vDataWithXML 
where Data.value('(/Data/row/@object_id)[1]','int') = 3;
set statistics time off

The second SQL that accesses the compressed table has to decompress XML for every row in the table. The execution times of the statements on my laptop are 116ms and 6,899ms respectively. As you see, the compression definitely comes at cost.

One of the ways to reduce such an overhead, is storing XML elements that are used in the queries in the separate table columns. Unfortunately, by-the-book approach with persisted calculated columns would not always works. It is possible to create and persist such a column using user-defined functions; however, SQL Server would ignore it in some cases. The code below shows the example that creates calculated column that contains the attribute we are using in our queries.

create function dbo.fnGetCompressedObjectId(@Compressed varbinary(max))
returns int
with schemabinding
as
begin
    return (convert(xml,dbo.BinaryDecompress(@Compressed))
                .value('(/Data/row/@object_id)[1]','int'))
end
go	 

alter table dbo.DataWithCompressedXML
add
    ObjId as dbo.fnGetCompressedObjectId(Data)
        persisted
go

-- It is a good practice to rebuild index after alteration
alter index PK_DataWithCompressedXML 
on dbo.DataWithCompressedXML rebuild
go

alter view dbo.vDataWithXML(ID, Data, ObjId)
as
    select ID, convert(xml,dbo.BinaryDecompress(Data)), ObjId
    from dbo.DataWithCompressedXML
go

Unfortunately, if you ran the following query: select count(*) from dbo.vDataWithXML where ObjId = 3, you’d notice that SQL Server recalculates the value of the calculated column even though it is persisted and functions are defined as deterministic and precise. This is just the limitation of the Query Optimizer. Figure 3 shows that ObjId is recalculated.

3. Execution plan of the query

There is still the possibility of using persisted calculated columns. For example, you can define CLR function, which decompress and parse XML and return ObjID as the integer. Something like that:

.Net CLR:
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true,
            DataAccess = DataAccessKind.None)]
public static SqlInt32 GetObjId(SqlBytes input)
{
    if (input.IsNull)
        return SqlInt32.Null;
    /* Parsing XML with XmlReader and return ObjId attribute */
}

SQL: 
create function dbo.GetObjId(@input varbinary (max))
returns int
as external name [LOBCompress].[Compress].[GetObjId]
go

alter table dbo.DataWithCompressedXML drop column ObjId
go

alter table dbo.DataWithCompressedXML
add
    ObjId as dbo.GetObjId(Data)
        persisted
go

alter index PK_DataWithCompressedXML 
on dbo.DataWithCompressedXML rebuild
go

Now, if you run the previous query: select count(*) from dbo.vDataWithXML where ObjId = 3, it would work just fine and does not recalculate the column value. Figure 4 shows the execution plan in this case.

4. Execution plan that utilizes calculated column

It is worth mentioning that execution time of this query on my laptop is just 3 milliseconds comparing to 116 milliseconds of the query against dbo.DataWithXML table. The query is significantly faster because it does not need to shred XML to obtain ObjID value.

Obviously, creating separate CLR methods for each calculated column can lead to some coding overhead. You can consider using regular columns instead and populate them in the code or even in the triggers. Each approach has the own set of benefits and downsides based on the use-cases implemented in the system.

Compressing LOB data in the database could help you to significantly reduce the database size in the large number of cases. However, it adds an overhead of compressing and decompressing data. In some cases, such overhead would be easily offset by the smaller data size, less I/O and buffer pool usage but in any case, you should be careful and take all other factors into the consideration.

Source code is available for download.

Optimizing SQL Server Spatial Queries with Bounding Box

SQL Server supports two data types to store spatial information – geometry and geography. Geometry supports planar, or Euclidean, flat-earth data. Geography supports ellipsoidal round-earth surface. Both data types can be used to store location information, such as GPS latitude and longitude coordinates. Geography data type considers Earth roundness and provides slightly better accuracy although it has stricter requirements to the data. As a couple of examples, data must fit in the single hemisphere and polygons must be defined in specific ring orientation.

Storing location information in geometry data type introduces its own class of problems. It works fine and often has better performance than geography data type. Although, we cannot calculate the distance between points – the unit of measure for result is decimal degrees, which are useless in non-flat surface.

Let’s take a look at spatial data type performance in one of the specific use-cases, such as distance calculation between two points. Typical use-case for that scenario is the search for the point of interest (POI) close to specific location. First, let’s create three different tables storing POI information in the different format and populate them with some data.

01. Test tables

02. Populating test tables with data

The first table dbo.Locations stores the coordinates using decimal(9,6) data type. Two other tables are using geography data type. Finally, the table dbo.LocationsGeoIndexed has Location column indexed with special type of the index called spatial index. Those indexes help improving performance of some operations, for example distance calculation or check if objects are intersecting.

It is worth mentioning that the first table uses decimal(9,6) data type rather than float. This decision saves us six bytes per pair of values and provides the accuracy that exceed the accuracy of commercial-grade GPS receivers.

Let’s run the tests that measures performance of the queries that calculate the number of locations within one mile from Seattle City Center. In case of dbo.Locations table, we will use dbo.CalcDistanceCLR function, which we defined earlier. For two other tables we will call spatial method STDistance.

03. Test queries (table-wide lookup)

04. Execution plans (table-wide lookup)

The first and second queries perform clustered index scan and calculate the distance for every row from the table. Last query uses spatial index to lookup such rows. We can see execution times for the queries in the table below.

05. Execution time (Table-wide lookup)

As we see, spatial index greatly benefits the query. It is also worth mentioning, that without the index, performance of CalcDistanceCLR method is better comparing to STDistance.

Although spatial index greatly improves the performance, it has its own limitations. It works in the scope of entire table and all other predicates are evaluated after spatial index operations. That can introduce suboptimal plans in some cases.

As the example, let’s look at the use-case, when we store POI information on company-by-company basis .

06. Test table creation (company-wide lookup)

In case, when we perform POI lookup for specific company, CompanyId column must be included as the predicate to the queries. SQL Server has two choices how to proceed. The first choice is clustered index seek based on CompanyId value calling STDistance method for every POI that belongs to the company. Another choice is using spatial index, find all POIs within the specified distance regardless of what company they belong and, finally, join it with the clustered index data. Let’s run those queries.

07. Test queries (company-wide lookup)

Neither method is efficient in case when table stores the large amount of data for the large number of companies. Execution plan of the first query utilizing clustered index seek shows that it performed STDistance call 241,402 times – once per every company POI.

08. Execution plan (clustered index seek approach)

The execution plan for the second query shows that spatial index lookup returned 550 rows – all POI in the area, regardless of what company they belong. SQL Server had to join the rows with the clustered index before evaluating CompanyId predicate.

09. Execution plan (Spatial index approach)

One of the ways to solve such problem called Bounding Box approach. That method allows us to minimize the number of the calculations by filtering out POIs that are outside of the area of interest.

10. Bounding box

All points we need to select residing in the circle with location as the center point and radius specified by the distance. The only points we need to evaluate are residing within the box that inscribes that circle.

We can calculate the coordinates of the corner points of the box, persist it in the table and use regular non-clustered index to pre-filter the data. This would allow us to minimize the number of expensive distance calculations to perform.

Calculation of the bounding box corner points can be done with CLR table-valued function shown below.

11. Calculating bounding box corner points

Let’s alter our table and add bounding box points. We also need to create non-clustered index to support our query.

12. Table alteration (adding bounding box corner points)

Now, we can change the query to utilize the bounding box.

13. Query utilizing bounding box

14. Query utilizing bounding box (Execution plan)

As we see, last query calculated the distance 15 times. This is significant improvement comparing to 241,402 calculations from the original query. The execution times are shown below:

15. Execution time (Company-wide lookup)

As we see, bounding box outperforms both – clustered index seek and spatial index lookup. Obviously, it would be the case only when bounding box reduces the number of the calculations to degree that offsets the overhead of non-clustered index seek and key lookup operations. It is also worth mentioning that we do not need spatial index with such approach.

We can use bounding box for the other use-cases. For example, when we are checking if position belongs to the area defined by the polygon. Bounding box corner coordinates should store minimum and maximum latitude/longitude of the polygon corner points. Similarly to the distance calculation, we would filter-out the locations outside of the box before performing expensive spatial method call that validates if point is within the polygon area.

Source code is available for download

CLR: Security considerations

Last time we compared performance of Common Language Runtime (CLR) and T-SQL code in a few different areas. Today, I would like to discuss a few questions related to CLR integration and security. As before, I am not trying to make the point if we should or should not use CLR but rather outline a few things we must take into consideration.

SQL Server has CLR integration disabled by default. Although it would not prevent us from deploying the database with assemblies and CLR objects, we would not be able to call CLR routines until CLR is enabled on the server level. That could lead to the roadblocks for independent software vendors (ISV) who are trying to deploy their systems in Enterprise environments. Database and security administrators often oppose such requirement especially if assemblies require higher than SAFE permission set.

It is worth mentioning that system CLR code is always enabled. We can use system CLR types, such as HierarchyId, Geometry and Geography regardless of the configuration setting.

Second important factor is that we cannot run the code from assemblies with EXTERNAL_ACCESS or UNSAFE permission sets unless they are signed with the key that has such permissions. One of the options to “workaround” it is marking database as TRUSTWORTHY. Although, it violates the best security practices and again, would not necessarily works in the Enterprise environment.

Let’s look at the right way of dealing with such requirement. As the first step we will create the small CLR library with EXTERNAL_ACCESS permission set. You can download the source code if needed.

Let’s create the database and try to register CLR library there. We will get an error as shown below.

01. Registering assembly with EXTERNAL_ACCESS permission set

As we see, we cannot register the assembly as it is. I’m not going to demonstrate the approach with marking database as TRUSTWORTHY and will show how we can sign the assembly.

As the first step, we need to generate the key pair file. We will use sn.exe utility, which is part of Visual Studio and Windows SDK to do that.

02. Generating key pair file with sn.exe

Next, we need to add the generated key file to the CLR project. We can do it in project properties window. Different versions of Visual Studio have slightly different UI there. VS.Net 2012 UI looks like that:

03. Signing the project. VS.Net 2012 UI

When we click Signing button, we will see another popup window where we can specify the key file.

04. Signing the project: VS.Net 2012 UI

After we compiled the signed version of CLR dll we need to create master key in the master database, next create asymmetric key from key pair file, create login from the key and give that login corresponding permission. And after that we will be able to register the assembly.

05. Registering assembly

Now, everything would work just fine.

Finally, CLR objects that accessing the data, break the ownership chaining similarly to dynamic SQL. Let’s create two small test tables and populate it with the data:

07. Ownership chaining: Test tables

Now, let’s create CLR and T-SQL functions that calculate the total amount for specific OrderId provided as the parameter. After that let’s create the views that use those functions

07. Ownership chaining: CLR function

08. Ownership chaining: T-SQL function and views

Finally, let’s create the database user and give him SELECT permissions on both views.

09. Ownership chaining: Creating database user

Now, if user Bob selects data from the view that uses T-SQL function, everything would work fine. Bob has SELECT permission on the view and does not need to have SELECT permissions on the tables as long as owner of the tables and the view is the same because of the ownership chaining.

Although, if user Bob tries to query the view that utilizes CLR function, it would not work.

10. Ownership chaining: CLR breaks ownership chaining

User Bob needs to have SELECT permission on OrderLineItems table in order to CLR function to work

11. Ownership chaining: Permission needs to be granted on the table

That behavior makes permission management more complicated when CLR is involved.

As I mentioned earlier, I am not trying to make the case against CLR. SQL Server Common language runtime integration is very powerful technology. Although we need to be keep in mind performance and security implications, which comes with CLR.

Source code is available for download

 

 

 

CLR vs. T-SQL: Performance considerations

I am pretty sure that all of us read or even participated in quite a few heated discussions about Common Language Runtime (CLR) code in Microsoft SQL Server. Some people state that CLR code works faster than T-SQL, others oppose them. Although, as with the other SQL Server technologies, there is no simple answer to that question. Both technologies are different in nature and should be used for the different tasks. T-SQL is the interpreted language, which is optimized for set-based logic and data access. CLR, on the other hand, produces compiled code that works the best for imperative procedural-style code.

Even with imperative code, we need to decide if we want to implement it in CLR or as the client-side code, perhaps running on the application servers. CLR works within SQL Server process. While, on one hand, it eliminates network traffic and can provide us the best performance due to the “closeness” to the data, CLR adds the load to the SQL Server. It is usually easier and cheaper to scale out application servers rather than upgrading SQL Server box.

There are some cases when we must use CLR code though. For example, let’s think about the queries that performing RegEx evaluations as part of the where clause. It would be inefficient to move such evaluations to the client code and there is no regular expressions support in SQL Server. So CLR is the only choice we have. Although, in the other cases, when procedural-style logic can be moved to the application servers, we should consider such option. Especially when application servers are residing closely to SQL Server and network latency and throughput are not an issue.

Today we will compare performance of the few different areas of CLR and T-SQL. I am not trying to answer the question – “what technology is better”. As usual it fits into “It depends” category. What I want to do is looking how technologies behave in the similar tasks when they can be interchanged.

Before we begin, let’s create the table and populate it with some data.

01. Test table

As the first step, let’s compare the user-defined functions invocation cost. We will use the simple function that accepts the integer value as the parameter and returns 1 in case if that value is even. We can see CLR C# implementation below.

02. Invocation overhead: CLR code

As we can see, there are the attributes specified for each function. Those attributes describes different aspects of UDF behavior and can help Query Optimizer to generate more efficient execution plans. I would recommend specifying them explicitly rather than relying on default values.

One of the attributes – DataAccess – indicates if function performs any data access. When this is the case, SQL Server calls the function in the different context that will allow access to the data. Setting up such context introduces additional overhead during the functional call, which we will see in a few minutes.

T-SQL implementation of those functions would look like that:

03. Invocation overhead: T-SQL code

Let’s measure average execution time for the statements shown below. Obviously, different hardware leads to the different execution time although trends would be the same.

04. Invocation overhead: Test script

Each statement performs clustered index scan of dbo.Numbers table and checks if Num column is even for every row from the table. For CLR and T-SQL scalar user-defined functions, that introduces the actual function call. Inline multi-statement function, on the other hand, performed the calculation inline without function call overhead.

05. Invocation overhead: Avg. Execution Time

As we can see, CLR UDF without data access context performs about four times faster comparing to T-SQL scalar function. Even if establishing data-access context introduces additional overhead and increases execution time, it is still faster than T-SQL scalar UDF implementation.

The key point here though is than in such particular example the best performance could be achieved if we stop using the functions at all rather than converting T-SQL implementation to CLR UDF. Even with CLR UDF, the overhead of the function call is much higher than inline calculations.

Unfortunately, this is not always the case. While we should always think about code refactoring as the option, there are the cases when CLR implementation can outperform inline calculations even with all overhead it introduced. We are talking about mathematical calculations, string manipulations, XML parsing and serialization – to name just a few. Let’s test the performance of the functions that calculate the distance between two points defined by latitude and longitude.

06. Distance Calculation: CLR

07: Calculating Distance: Scalar T-SQL UDF

08. Calculating Distance: Inline function

09. Calculating Distance: Test script

10. Calculating Distance: Execution Time

We can see that CLR UDF runs almost two times faster comparing to inline table-valued functions and more than five times faster comparing to T-SQL scalar UDF. Even with all calling overhead involved.

Now let’s look at the data access performance. The first test compares performance of the separate DML statements from T-SQL and CLR stored procedures. In that test we will create the procedures that calculate the number of the rows in dbo.Numbers table for specific Num interval provided as the parameters. We can see the implementation below

11. Data Access: CLR

12. Data Access: T-SQL

Table below shows the average execution time for stored procedure with the parameters that lead to 50,000 individual SELECT statements. As we can see, data access from CLR code is much less efficient and works about five times slower than data access from T-SQL.

13. Data Access: Individual Statements – execution time

Now let’s compare performance of the row-by-row processing using T-SQL cursor and .Net SqlDataReader class.

14. SqlDataReader vs. cursor: CLR

15. SqlDataReader vs. cursor: T-SQL

16. SqlDataReader vs. Cursor: Execution time

As we can see, SqlDataReader implementation is faster.

Finally, let’s look at the performance of CLR aggregates. We will use standard implementation of the aggregate that concatenates the values into comma-separated string.

17. Building CSV list: Aggregate

As with user-defined functions, it is extremely important to set the attributes that tell Query Optimizer about CLR Aggregate behavior and implementation. This would help to generate more efficient execution plans and prevent incorrect results due to optimization. It is also important to specify MaxByteSize attribute that defines the maximum size of the aggregate output. In our case, we set it to -1 which means that aggregate could hold up to 2GB of data.

Speaking of T-SQL implementation, let’s look at the approach that uses SQL variable to hold intermediate results. That approach implements imperative row-by-row processing under the hood.

As another option let’s use FOR XML PATH technique. It is worth to mention that this technique could introduce different results by replacing XML special characters with character entities. For example, if our values contain < character, it would be replaced with &lt; string.

Our test code would look like that:

18. Building CSV list: T-SQL

When we compare the performance on the different row set sizes, we would see results below

19. Building CSV list: Execution time

As we can see, CLR aggregate has slightly higher startup cost comparing to T-SQL variable approach although it quickly disappears on the larger rowsets. Performance of both: CLR aggregate and FOR XML PATH methods linearly depend on the number of the rows to aggregate while performance of SQL Variable approach degrade exponentially. SQL Server needs to initiate the new instance of the string every time it concatenates the new value and it does not work efficiently especially when it needs to be populated with the large values.

The key point I would like to make with that example is that we always need to look at the options to replace imperative code with declarative set-based logic. While CLR usually outperforms procedural-style T-SQL code, set-based logic could outperform both of them.

While there are some cases when choice between technologies is obvious, there are the cases when it is not clear. Let us think about scalar UDF that needs to perform some data access. Lower invocation cost of CLR function can be mitigated by higher data access cost from there. Similarly, inline mathematical calculations in T-SQL could be slower than in CLR even with all invocation overhead involved. In those cases, we must test different approaches and find the best one which works in that particular case.

Source code is available for download

Next: CLR: Security considerations