The requirement of searching data by part of the value is very common in business applications. All of us are familiar with it – users want to be able to search by entering just a few letters from the client or article name; locate postal address by typing just a part of the street; or do something similar in dozens of the other cases.
Obviously, there are many ways to skin the cat and implement such a search. In some complex and performance-critical cases we can use external to SQL Server solutions, for example Apache Lucene. In others, we can use Full-Text Search or even do the brute force approach with LIKE operator. Today, I would like to talk about the latter one. After all, even though LIKE is not necessarily the fastest solution, its performance could often be acceptable especially with relatively small tables. Last but not least, it comes with very little implementation cost.
Unfortunately, LIKE operator cannot use Index Seek unless you are performing the prefix search. In that case, when you are searching by the beginning of the string – for example, LastName LIKE ‘Smit%’ condition – SQL Server is able to locate subset of the data where predicate needs to be evaluated. In our example, the predicate, in the nutshell, is the following condition: LastName >= ‘Smit’ and LastName < ‘Smiu’, which is perfectly SARGable and suitable for the fast Index Seek.
This is not the case, however, when LIKE expression requires SQL Server to find patterns in the middle of the string – for example, in LastName LIKE ‘%Smit%’ situation. The only option for SQL Server is evaluating expression against every row from the index, which leads to the Index Scan.
As strange as it sounds, you can often improve search performance by challenging business requirements. Even though customers want to be able to search by substring, in the very large number of cases prefix search would do. For example, when you are calling customer service somewhere and asking them to look up your account, you’d usually provide them the first few letters of your name rather than some letters from the middle of it.
Unfortunately, changing the business requirements is not always possible. In some cases, we do not have any choices but implementing substring search. In those occasions, there are two ways to improve the performance – reduce the number of rows where LIKE predicate must be evaluated and reduce predicate evaluation time.
Reducing the number of rows for predicate evaluation greatly depends on the indexes. While you cannot do much when LIKE is the only predicate in the query, such condition is usually an exception rather than the rule. In case when query has multiple predicates, the right composite indexes would help. The key here is adding evaluation column as the key or included column of the index and avoiding post-Key Lookup predicate evaluation.
For example, consider multi-tenant shopping cart system and the query that need to return the list of the articles that belong to particular tenant. The query could be implemented as follows:
select .. from dbo.Articles where TenantID = @TenantID and Name LIKE '%' + @paramName + '%' order by Name
Such query would benefit from the following index, which will limit LIKE predicate evaluation to the single tenant scope. As the side note, adding Name as the key rather than included column would help to avoid SORT operator in the execution plan – data in the index would be sorted according to order by clause of the query.
create index IDX_Articles_TenantID_Name on dbo.Articles(TenantID,Name)
Reducing the predicate evaluation time is the trickier subject. Fortunately, you can often achieve significant performance increase by utilizing binary collations during such an evaluation.
Let’s take a look at the example. As the first step, we will create a table and populate it with some random data. Col1, Col2 and Col3 columns are populated with randomly generated GUIDs and VarCol and NVarCol store concatenated values from them using SQL_Latin1_General_CP1_CI_AS collations. Finally, I am creating nonclustered indexes on VarCol and NVarCol columns to minimize amount of data pages SQL Server need to read during our tests and make those tests consistent.
create table dbo.Data ( ID int not null, Col1 uniqueidentifier not null default NEWID(), Col2 uniqueidentifier not null default NEWID(), Col3 uniqueidentifier not null default NEWID(), VarCol varchar(108) null, NVarCol nvarchar(108) null, constraint PK_Data primary key clustered(ID) ) go ;with N1(C) as (select 0 union all select 0) -- 2 rows ,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows ,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows ,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows ,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows ,IDs(ID) as (select row_number() over (order by (select NULL)) from N5) insert into dbo.Data(ID) select ID from IDs; update dbo.Data set VarCol = convert(varchar(36),Col1) + convert(varchar(36),Col2) + convert(varchar(36),Col3) ,NVarCol = convert(nvarchar(36),Col1) + convert(nvarchar(36),Col2) + convert(nvarchar(36),Col3) go create nonclustered index IDX_Data_VarCol on dbo.Data(VarCol); create nonclustered index IDX_Data_NVarCol on dbo.Data(NVarCol);
Next, let’s randomly choose substring to search using one of the rows from the table. You would obviously have different data in your case/
select * from dbo.Data where ID = 10000
Now, let’s run SELECT statements that perform substring search against both columns and measure the execution time of the selects. I am disabling parallelism with MAXDOP 1 hint to avoid any parallelism overhead during queries execution.
declare @V varchar(32) = '9D81AB12' ,@NV nvarchar(32) = N'9D81AB12' set statistics time on select count(*) from dbo.Data where VarCol like '%' + @V + '%' option (maxdop 1); select count(*) from dbo.Data where NVarCol like '%' + @NV + '%' option (maxdop 1); set statistics time off
In my environment, CPU times of the first and second statements are 203 and 844 milliseconds respectively. Obviously, you would get the different results in your system and performance would greatly depend on the data.
It is also worth mentioning, that index on NVarCol is about two times larger than index on VarCol column due to the fact, that unicode data uses 2-bytes per character to store the data as the opposite to 1-byte per character with non-unicode varchars. However, the overhead of the extra logical reads is minimal.
Now let’s measure execution time using binary collations. First, we will alter the table adding two calculated columns that represent our strings in binary collation and creating nonclustered indexes afterwards.
alter table dbo.Data add VarColBin as upper(VarCol) collate Latin1_General_100_Bin2 persisted; alter table dbo.Data add NVarColBin as upper(NVarCol) collate Latin1_General_100_Bin2 persisted; create nonclustered index IDX_Data_VarColBin on dbo.Data(VarColBin); create nonclustered index IDX_Data_NVarColBin on dbo.Data(NVarColBin);
PERSISTED keyword tells SQL Server to materialize calculated columns and store them in the data row. Technically speaking, you do not need to persist calculated columns in our case – you can index them even when they are not persisted, which helps to avoid clustered index row size increase. However, you need to be careful making sure that SQL Server always uses nonclustered index for the search. Otherwise, search performance could be even slower than with nonbinary collation – SQL Server will need to calculate column values on the fly before evaluating LIKE predicate.
Another very important factor to remember is case sensitivity of the binary collation. You need to convert your data to upper or lower case if your system uses case-insensitive collation. Obviously, you need to use the same conversion rule for the search predicate.
declare @V varchar(32) = '9D81AB12' ,@NV nvarchar(32) = N'9D81AB12' set statistics time on select count(*) from dbo.Data where VarColBin like '%' + Upper(@V) + '%' collate Latin1_General_100_Bin2 option (maxdop 1); select count(*) from dbo.Data where NVarColBin like '%' + Upper(@NV) + '%' collate Latin1_General_100_Bin2 option (maxdop 1); set statistics time off
The execution times in my environment are 125 and 62 milliseconds respectively. You can see all the results in Figure 2 below.
As you can see, we got more than 13 times performance improvement in case of the unicode data. Performance improvement with non-unicode strings are less drastic; however, it still ran about 40% faster than before. It is also worth mentioning that with the binary collation, predicate evaluation against unicode data is faster than with varchar data. At least with my test data.
Lastly, the word of caution. While that technique can be help to improve performance of substring search and reduce CPU load in the system, you should not treat it as the replacement of the proper indexing. After all, you can get much better ROI by investing your time into query optimization. However, it is the great technique to use in conjunction with query optimization and index tuning when you need to get the most from your queries.