Author Archives: Dmitri Korotkevitch

Data Partitioning – When to partition the data

Last week we were discussing why do we want to partition the data. I hope at this point we all agreed that data partitioning could help us with the large table. Which leads to the question – when the table becomes “big enough” for partitioning.

This is kind of theoretical discussion. I heard a lot of different opinions varied from hundreds MBs to TBs. We can spend a lot of time discussing that and will never end up with the right answer. Or, better say, all answers would be the right ones. Obviously your mileage may vary – a lot of things depend on your system, environment and hardware. But my personal take is very simple. If I expect my system to collect large amount of data, the right time (for me) to partition the data is:

undefined

Even if Data Partitioning introduces some overhead on development and testing there are two particular reasons why I’d suggest to do that. Most importantly – data partitioning changes execution plans. As result, when you finally implemented it you’d need to re-test entire system. And this could be very expensive and time consuming. Not even mention the possibility of another round of performance tuning. It’s much easier to deal with potential issues during initial development and testing especially when you don’t have a lot of data and impact of sub-optimal execution plans is smaller.

Second reason is that it’s very hard (if even possible) to partition the data keeping the system online. In most part of the cases we are going to deal with exclusive schema modification locks that prevent any access to the table even in read uncommitted transaction isolation level. And if you think how long physical data movement of 100s GBs or TBs of data could take.. Well, you get an idea.

Again, everything depends on the system and requirements. There is no need to spend time and efforts on data partitioning if you don’t expect to have a lot of data. But remember that situation and requirements can change. There is always the possibility that project would be more successful than we expect it to be 🙂

Next: Scaling Out by separation operational and historical data

SQL Server MCM – It was a long journey

It was the day when I did 2 presentations at Teched Russia in Moscow. I came from that event completely exhausted and found the email that I passed Microsoft Certified Master Lab exam. I’ve obviously expected the email – it was about a month after I took the lab in Seattle during PASS Summit. What I did not expect is that I passed. I already started to think how to adjust my home lab and practice for retake. But, perhaps, I’m going a bit ahead of myself with the story.

I passed my first MCP exam in 2007. Originally I started to obtain the certificates with simple purpose – improve my resume. But it changed quickly. It’s strange – you don’t appreciate learning much when you are young – there are always way too many other things to be excited about. And, as many other things, you miss those challenges when you are getting older. I really enjoyed the process of learning and was happy and proud of myself after each test I passed. Last but not least, it also helped with my day-to-day job. Obviously it’s impossible to retain all information from study materials but you at least know what is available and where to look when needed. No needs to re-invent the wheels. Anyway, to make the long story short – in 6 months I got MCPD and MCITP certificates.

And then I was stuck. Of course, not in terms of learning but speaking of certifications – there were no ways to go. As much as I wanted to attend MCM training in Redmond, I knew that my company would never approve it. Neither from cost nor from time prospective. So when Joe Sack announced that training is no longer required during PASS Summit 2010, I was quite excited. And quite lazy, of course. There were other priorities I had to take care of so I lost almost an year before even started to prepare to the exams.

I took the Knowledge exam in July 2012. This was very interesting experience. Never need to say that exam content was much more challenging than MCITP ones. The main difference from my point of view is that questions require you to think, not to know the answer. And of course, there were a lot of questions and not much time. I particularly enjoyed that there were very few questions from “It depends” category without enough information to make the decision. There were still some, but far less comparing to MCITP/MCTS ones.

I left Prometrics center with very mixed impressions – gave myself 50/50 chances. But in a month I found that I passed with relatively high score – higher than I expected. And I started to prepare to the lab. I took it in Seattle during PASS Summit last month. boB Taylor put some clarifications around the format. I, of course, cannot comment the content but to give you a tip – your main enemy is not the complexity of scenarios but time. You are out of time, you are under stress. And you are making mistakes. I’d probably disagree with boB’s advice about reading all scenarios first though – I spent 20 minutes on that – time I desperately needed later. What I would do is to scan through the scenarios and immediately take care of those you feel yourself comfortable with.
Reduce the scope as fast you can. And don’t kill yourself on testing. Do the basic tests, of course, but don’t spend any more time than absolutely needed.

How to prepare yourself to the exams? First, and most importantly, you need to understand how SQL Server works internally. It’s not enough to know, for example, how to restore the database from the backup. You’d also need to understand what happen under-the-hood when you run RESTORE command. Next practice, practice and practice. One of my challenges, for example, is that I rarely do low level day-to-day DBA stuff. For example, I usually design backup strategy but I rarely implement it by myself. Of course, I know how to do that, but sometimes it took me a couple extra minutes to find exact syntax in BOL. And those minutes are precious.

Speaking of specific resources – my first advice would be to start with Kalen Delaney books about SQL Server internals. After that, I’d focus on MCM training videos – (Huge thanks to all SQLSkills crew). And, of course, focus on MCM reading list. This is huge but stuff clicks together over time – further you are on the way, easier it is for you.

In any case, good luck! As for me – there is still a long way to go. Once you start, you would not be able to stop. 🙂

Several presentation slide decks are available for download

Leave a reply

A few new slide decks are available for download from Presentations page

Data Partitioning in VLDB – both English and Russian versions
Between Ground and Clouds – Russian version. It also has a couple demo videos I recorded for the presentation.
SQL Saturday #178 (Moscow) – in Russian

Again, thank you very much for the interest!

Data Partitioning – Why do we want to partition the data?

1 Reply

Even if we spoke about table partitioning a long time ago, I’d like to circle back to this question from the different, architectural standpoint. In the next a few posts we will discuss why we want to partition data and what options do we have to do that.

Data Partitioning is the complex and time consuming process. So as the first step I’d like to explain why we want to go through all the efforts to do that.

Reason #1 – Without partitioning everything is in the same place.

Quite obvious, is not it? And it’s not necessarily bad. One of advantages of when data is in the same place – it makes development and optimization very easy. Execution plans are predictable but that’s all about it. The biggest problem is that when everything is in the same table (or better say partition) it
stored on the same filegroup, same set of files and same disk array. Well, technically speaking we can separate our clustered and non-clustered indexes between different filegroups but at the end it introduces its own set of the issues especially in disaster recovery situations.

The problem here is that in every system we have data that belongs to the different “groups” / types – operational and historical. Duration of the operational period (which is basically customer facing OLTP portion of the system) varies – could be a day, week, month, maybe even year(s) but in the large system
there is always additional historical data and often there is a lot of historical data.

Obviously we care more about operational (customer facing) performance and would like to put operational data to the fastest disk array possible. But when everything is in the same place we don’t have a lot of choices – we either need to buy disk array big enough to store historical data, which could be
terribly expensive and generally speaking is waste of money, or compromise and go with slower disks due budget constraints.

Reason #2 – The same schema, indexes and compression

Operational and historical data usually have different access patterns. As I mentioned above, operational data is usually OLTP-related data. So we have a lot of short optimized queries. Historical data is often used for analysis and reporting – e.g. we are talking about Data Warehouse type access. Different
access patterns require different index structures. Again, when we have everything in the one table we don’t have a lot of choices. We either create additional indexes to support queries against historical data and hurt performance of OLTP part of the system or, alternatively, do not have those indexes and, as result, have bad performance of the queries against historical portion of the data.

Another thing is that in some cases we want to have slightly different schemas for those portions of the data. One of the examples – operational data can have some additional columns used for the processing. There is the good chance that historical data does not necessarily need them and when we have a lot of
records every byte counts.

Lastly, we cannot partially compress our data. Compression could help a lot with historical data which is relatively static – we are increasing performance by reducing number of pages and, as result, amount of IO operations. On the other hand, for operational data compression usually hurts performance because
data is changing quite often.

Reason #3 – Index maintenance

Same thing as above. Generally speaking we don’t need to rebuild or reorganize the indexes for historical data which is static. But there is no way we can do rebuild/reorg on the part of the data.

Reason #4 – Backup strategy

Same thing again. All data is in the same filegroup. We cannot exclude historical portion from the backups. As result, it increases backup time, size of backup file and introduces additional overhead during the process

Reason #5 – Disaster recovery

One of the biggest benefits of Enterprise Edition of SQL Server is piecemeal restore. It allows us to bring system partially online on filegroup by filegroup basis. If we had operational and historical data separated to the different filegroups we could restore operational part first and make system available
to the customers while we are working on historical part. With everything in the same non-partitioned table it’s impossible. We will need to restore everything first before system becomes available and online.

Reason #6 – Statistics

SQL Server needs to estimate number of rows on the every step of the execution plan to make it efficient. In order to do so SQL Server uses statistics and histogram in
particular. Histogram contains some values from the key as well as the information about data distributions in the intervals of the values. The problem is that histogram contains at most 200 steps/values. So more data we have in the table, bigger intervals are. Approximations are done on each interval and as result our estimations are less accurate.

Another thing that’s worth to mention is that by default SQL Server updates statistics only after 20% of the key values are updated. So if we have 1 million rows in the table, we can insert 200,000 new rows, delete 200,000 rows or update them before statistics update is triggered. Obviously, more rows we have,
less often statistics would be updated.

That list is not completed by any means. But each of those reasons is big enough by itself to start thinking about data partitioning.

Next: When to partition the data

Sliding window scenario – when data spilled out to the right-most partition

Leave a reply

After my last post I’ve received a couple emails asking how did I fix the problem. Just to remind – data spilled out to the right-most partition of the table and as result sliding window scenario did not work anymore. When process tried to split right-most, non-empty partition, it obtained long-time schema modification (SCH-M) lock. And then start to scan/move data while all access to the table was blocked. Today we are going to discuss how to fix that.

The most important question is if we can put the system offline during that operation. If this is not the case, well, our life becomes much more complicated. And our options for online recovery are rather limited. Let’s take a look. First, I would create partition function, scheme, table and populate it with some data with ID between 1 and 1250.

As you can see, I duplicated the case we had – we spilled out some data to the right most partition (with ID > 1100) which should be empty.

Let’s create another correct partition function and scheme:

For online recovery we can re-create our indexes online moving them to another partition. I’m doing it as part of transaction just to roll everything back to the initial “invalid” stage.

This operation would run with the minimal locking and achieve our goal. There are 3 things worth to mention though. First – online index rebuild uses row versioning and as result you will introduce quite heavy tempdb usage during that operation. Next, we are “rebuilding” much more data than actually needed – think about it – we have just one (right most) partition that needs to be recreated but we are rebuilding entire table. But more critical for us that this approach would not work if you have primary key constraint defined on the table. Unfortunately there is no way to recreate constraint moving it to another partition. Basically it means that you would not be able to resolve the issue without putting your system offline. Well, to be exact, if you have non-clustered primary key you can replace it with unique non-clustered index if this is acceptable. But still..

So what can we do with the primary key constraint? First, let’s prepare our table. As the side note – I’m disabling non-clustered index to prevent extra rebuild when we drop the clustered index.

As I mentioned before, we would not be able to do the operation without putting system offline. The goal though is to reduce offline time as well as to process/move as little data as possible (our right most partition only).

First of all, let’s create another table with indexes using new (corrected) partition scheme.

The problem here is that new table should have exactly the same physical row structure with the old one. And if you dropped or altered any column in the old table, your physical structure would not match table definition. If this is the case, you’ll need to create and alter the new table in the same way. Simplest way to compare is to run the following statement after new table is created.

I strongly suggest to test that you did not miss anything with the latest backup. Otherwise you could end up in the nasty situation when your data spread across multiple tables.

Anyway, assuming everything is fine our action plan would be:

Copy all data from invalid partition(s) from Data to Data2. Let’s say that step starts at time T1.
Apply all data changes that happened between T1 and now. Assuming that step starts at time T2. This step is optional but it would reduce downtime
Put system offline
Apply all changes from Data that happened between T2 and now. No new changes would be done in Data during this step because system is offline
Switch all “normal” partitions from Data to Data2
At this point Data would have only right-most incorrect partition, Drop (or rename) the table.
Rename Data2 to Data (including constraints)
Bring system online

Obviously one of the questions is how to find what is changing in the table while we are running steps 1 and 2. Well, it depends, of course. If you have identity or sequence keys, you can use them to track inserts. For updates – you can log time of update – perhaps by creating the new column and populating it with the trigger. For deletions – trigger again. Let’s do our prep work. I assume in that example that we are using sequences (or identity) for inserts. First – updates.

Now deletions. I’m using very simple approach here and going to apply all deletions at once during step 4.

Now we are ready to go. At the beginning of the step 1 we will log our ID and time to use it later.

This step would be quite time consuming and depend on amount of data we are copying. I’m also going to emulate some update activity in Data.

Now we need to apply changes in the original table that were made during step 1. Generally speaking, we can skip this step because we are going to do that again later when system is offline. But remember – our system is still online and this step would reduce downtime. In fact, if you have heavy activity you’d probably like to run this step again a few times.

As I mentioned before – I’m going to cut a corner here and don’t deal with deletions during that step. Again, we need to log ID and current time at the beginning of this step.

Again, I’m emulating some update activity in the system

Now it’s time to put system offline. And if your system is truly offline at this point, it’s also the good idea to create database snapshot to be able to “rollback” the changes.

So let’s apply all updates again including deletion activity.

Now we are almost ready to switch partitions. Only thing we need to do before is dropping our ModTime column and index we created.

And now the key part – switch and rename.

Now we can bring the system online and do the final clean up.

As you can see, the process is a bit complex. But at the end it minimizes the downtime. One thing to keep in mind – partition scheme and functions have been recreated under the new name so you need to take care of the code that references the old ones. SQL Search is your friend here.

While problem is fixed now the best way to deal with such kind of problems is don’t have them at all. As I mentioned before – all of that could be avoided if the size of partition has been re-evaluated from time to time.

The source code is available for download

SQL Saturday #151 and Tampa IT Pro Camp Slides

Leave a reply

.. are available for download.

The pile of mistakes that lead to disaster

Leave a reply

Bad things happen. Very bad things happen too. And usually when it happened there is no such thing as the single root cause – there were the set of mistakes that piled up and led to the final disaster. Today I’d like to tell you the story about one of those cases.

We are collecting data. Not as much as Google, perhaps, but still very decent amount – hundreds of new records coming to the system every second. Database backend is not very complex – our challenges are in the different area – how to collect and process data fast enough. We are not very big either and as result there are strict budget constraints. While we have decent development and QA environments, we cannot afford good similar-to-production staging environment that allows us to fully emulate production load. We still do performance testing but there are the limitations we have to work-around. Sounds bad but realistically speaking how many of us have the luxury of production-like testing infrastructure?

Mistake #1 – There is no staging / testing environment that allows to test performance under similar-to-production load.

We are collecting data from the various devices. We can control how often devices send us the data. But there are always the new devices on the market. And there are always the exceptions. One of device types is configurable only through the vendor. Customers submit configuration change requests to him directly keeping us out of the loop. And people make mistakes. Nothing shame about it – that’s the life. So bad thing happened – vendor changed the wrong parameter in the configuration file and devices started to send one record every two seconds instead of two minutes.

Mistake #2 – Expectation that you’ll always have control and would be able to avoid simple mistakes

The service that collects the data from devices is multitheaded and scalable through the number of active threads. Obviously we tested that aspect of the system. And what is the typical development/qa configuration with multithreaded application? Yes, you are correct – two threads. It worked just fine. We even did performance testing and found that two threads can handle three times bigger load than we have in production. Guess how many threads did we have in production config? Yes, you are correct – still two. Well, generally speaking we had two servers and four threads but it hardly matters. While we could handle six times more load than usual, we were not able to handle sixty times more. Collectors built the backlog.

Mistake #3 – Config files have not been adjusted accordingly when moved to production

Generally speaking, when everything works “as usual”, there is very little possibility that data would come out of order. We have some cache though that can address that situation. The cache was not big enough – just a few records per device. Same time it should be more than enough in the normal circumstances – what was the purpose of over-architecting the solution up front – we could refactor it later. Thank you, Agile development! Never need to say that potential problem has never been addressed – we always had more important things to take care of.

Mistake #4 – Under-architected temporary solution became permanent.

As I mentioned, collectors were backlogged. Data came out of order on the large scale and has not been processed correctly. And that particular aspect of the system behavior had not been monitored. Again, what is the purpose to spend time on monitoring and alerting code for the use-case that should never happen?

Mistake #5 – System monitoring was not extensive enough.

Finally, after a few days, customers started to complain about data quality. We found the problem, changed config files (20 threads magically eliminated backlog) and contacted vendor to fix incorrect parameter. So far, so good. Last thing remaining was data reprocessing. We already had the processing framework, so we basically needed to create the small utility that reuses the code and reprocesses the data. Piece of cake? I wish. Well, code had been reused including the part that should be excluded from that. That led to “split brain” situation with main processing routine. Well, another 4 hours outage to fix the problem. Strictly speaking (and because it supposed to be SQL Server blog), that situation could be avoided if we created separate SQL login for reprocessing utility and assigned very strict set of permissions there. But it had not been done either – utility used the same login with main processing service.

Mistake #6 – Excessive code reuse under stress without appropriate peer code review and testing.

Finally data had been reprocessed. Everything went back to normal. Except one last thing. In our system we have 2 databases – one stores the current production data, another one as archive. There is sliding window pattern implemented in production one – data has been purged on the daily basis. So we stopped purge in order to update (already copied) data in archive. But the way how we partition our data is a bit unusual. We don’t use date as partition column but rather identity column which is incrementing in par with the date. Every day we reserve the chunk of IDs for the next day partition. There were a couple reasons why we did it that way. Long time ago we were storage bound and decided not to partition by date because we did not want to add another column to every non clustered index. We could not change our clustered index either (at least left-most
columns) because it did not play well with our most common queries. Never need to say that neither of those reasons are still important but what is the purpose of changing working things?

Mistake #7 – Following “If it ain’t broke don’t fix it” principle

Never need to say the size of ID chunk we reserve has not been recently re-evaluated. We were not ready to the situation when purge has been stopped for a few days and records spilled out to the right-most partition that supposed to be empty. Obviously we did not notice it either. As result, when we turned process back on and system tried to split right-most partition it was not metadata operation anymore. Alter partition function acquired schema modification (SCH-M) lock which blocked any queries against the table. Another four hours outage.

Mistake #8 – Poorly documented design decisions without any formal re-evaluation policies.

What do we have at the end? Hundreds of hours and thousands of dollars we spent to fix the issues. Unhappy customers. Exhausted team. If either of those mistakes were avoided we would be fine. Or, perhaps, not in such bad shape.

That story does not have any morality though. It’s just a tale.

Cautionary tale about triggers, version store and fragmentation

2 Replies

A while ago triggers were slow. Very slow. In the versions prior to SQL Server 2005, triggers scanned transaction log in order to build internal inserted and deleted tables. That approach has been changed in SQL Server 2005 when triggers started to use version store – similarly to what optimistic locking is using. That helps with performance but same time introduces a few interesting issues. Let’s take a look.

First of all, let’s create a table and populate it with some data. I would like to point that I’m using tempdb in that example mainly to show that this behavior is completely independent from transaction isolation levels and optimistic locking database options.

Now let’s take a look at what we have in the index (click on the image to open it in the different tab)

As you can see, we have 65 bytes rows without any fragmentation. So far so good. Now let’s update Value column.

We updated fixed-width column. Row size has not changed. No fragmentation here. As the next step let’s create a trigger.

As you see, trigger itself is empty – we don’t even access inserted/deleted tables there. Now let’s update our table again

Still the same thing. But if we look at performance monitor counters – we can see that update now generates version store activity.

As the next step I’d like to add LOB column to the table. I’m using varchar(max) but you can replace it with nvarchar, binary, xml or clr data type. It would be the same.

Column is nullable so alteration here is pure metadata operation. And now let’s run the update statement.

Oops – now we have 14 bytes version store pointer added to the row. That introduces heavy fragmentation (new versions are larger so they don’t fit into the pages) and double our storage space. Well, a little bit unexpected development. Generally speaking this is very similar problem with what we have with data modifications when we use optimistic isolation levels.

OK, it happens with LOB columns. Now let’s check the case when ROW OVERFLOW is possible. Let’s drop LOB column and add to varchar(5000) columns here.

Again, we rebuilt the index – 14 bytes pointers are went away. Now let’s run update statement again.

Same thing – extra 14 bytes and fragmentation.

And let’s prove that trigger is the problem. First, let’s drop the trigger and rebuild the index

And next – run update again.

As you can there are no 14 bytes version store pointer nor fragmentation.

At last, let’s take a look at the very interesting thing – let’s see what happens when we have ON DELETE trigger. Generally speaking, when we delete the row, only thing that SQL Server initially does is mark row as the “ghost” row. Just the change in the header. But with the trigger the situation could be different.

undefined

You can see, that in our case we ended up with page split on delete!

This behavior adds extra arguments to the discussion why triggers are bad in general. Triggers add extra load to tempdb. And in case if you have the possibility of row overflow or LOB allocations, you’d end up with 14 bytes version store pointer and extra fragmentation.

Source code is available for download

P.S. I’d like to thank Paul White for his help with that post.

SQL Saturday 141 Presentation

Leave a reply

.. is available for download. Thank you very much for the interest!

Locking in Microsoft SQL Server (Part 14) – Deadlocks during DDL operations (alteration, partition switch, etc)

Leave a reply

Today I’d like us to talk about one practical problem related with the locking – how to deal with deadlocks during DDL operations. For example, table and partition functions alteration, partition switch and others. This problem is very common in OLTP environment when table is constantly querying by the multiple sessions. One of the common examples there is sliding window pattern when the table has been partitioned based on some time interval (daily, weekly, monthly, etc) and data has been purged on the regular basis by switching partition to the temporary table. Let’s try to understand why we have deadlocks here and what we can do to solve the problem.

First of all, as the pre-requirement, let’s assume that we implemented our partitioning correctly. Most important factor is that our operation does not require any data movement nor scans. For example, we don’t split ranges in partition function in the way that requires physically move the data to another partition (e.g. we are pre-allocating empty partition rather than splitting the range with the value at the middle of the partition that has some data). In that case partition operations are purely metadata operations and only schema locks are involved.

Another thing to avoid is mixing DML and DDL statements in one transaction. Let’s think about following example (let’s ignore page locks and schema stability (SCH-S) locks for simplicity sake):

Session 1 starts transaction and updates 1 row in the table. E.g. at this stage we will have (X) lock on the row and (IX) lock on the table
Session 2 is trying to select data from the table and being blocked by Session 1. So now we have Session 2 waiting for (S) lock on the row and also have (IS) lock acquired on the table
Now session 1 is trying to implement partition switch. For that operation (SCH-M) lock needs to be acquired on the table and it would be blocked by (IS) lock from the Session 2.

Classic deadlock. But let’s assume that this is not the case. What happens when we have multiple locks on the same object? Generally speaking, lock requests would be serialized. In some cases existing locks could be even re-used. Let’s see that. First, let’s create a table and add a few records there. Next, let’s have session 1 to read one row in repeatable read mode. As we remember it would hold (S) lock on the resource. Again, for simplicity sake let’s look at the row (key) level locks only

Now let’s try to update the row in another session. This session will try to acquire (X) lock on the resource and would be blocked because (X) lock is not compatible with (S) lock.

Now let’s try to read the same row in read committed mode from the 3rd session. As we can see everything works without blocking.

Let’s try to look at what locks were acquired using SQL Profiler.

As you see session acquired intent (IS) locks but did not acquire shared (S) lock. There is (S) lock from the session 1 already which guarantees that row has not been modified by uncommitted transactions. No reasons to acquire another (S) lock because in read committed isolation level (S) locks on the rows have been released immediately.

But what will happen if we try to read the row in repeatable read mode again? In that isolation level session needs to keep the (S) lock until end of transaction and it’s a reason why it’s trying to acquire the lock.

As you can see – it’s being blocked because there is (X) lock in the queue. So in order to be granted lock needs to be compatible with the all locks on that resource – does not matter granted or now.

It’s worth to mention that the first scenario, when session 3 ran in read committed mode and did not acquire the lock on the resource, can be considered as internal optimization. In some cases SQL Server still acquires another shared (S) lock there even if there is another shared (S) lock held. In such case the query would be blocked similarly to repeatable read transaction isolation level.

Keeping all this in mind we can ask – why would we have deadlocks if locks are serialized? Would not the session with schema modification (sch-m) lock wait till other locks are released and proceed from there? Yes and no. Let’s put one other thing to the mix – lock partitioning. This is quite confusing term and does not have anything to do with table partitioning. In nutshells when server has 16 or more CPUs (think about dual quad-core with hyperthreading enabled) SQL Server starts to partition intent (IS, IX, IU) and schema stability (SCH-S) object locks on per-cpu basis. Those (intent and schema-stability) locks would be acquired and stored within 1 (or a few) lock partitions only. And when another session needs to acquire S, X, U or SCH-M locks on the object, those locks would have to be acquired across all lock partitions. As I mentioned, this behavior is on when system has 16 or more CPUs and cannot be
disabled. So let’s take a look at one of the scenarios:

Session 1 executes the complex query that acquired (IS) and/or (SCH-S) lock on the lock partition = 5. (just an example)
Session 2 tries to alter the table and starts to acquire (SCH-M) locks on each individual lock partition. It successfully acquired locks on partitions 1-4 and blocked on partition 5 by (IS) or (SCH-S) lock held by Session 1
Session 1 tries to acquire another (IS) and/or (SCH-S) lock on lock partition = 2. It’s blocked by (SCH-M) lock held by Session 2.

Classic deadlock again.

Unfortunately there is very little we can do about it. Lock partitioning cannot be disabled with documented approaches. There is the undocumented trace flag T1229 that does the trick although using undocumented trace flags are dangerous and not recommended in production.

In case if you have dedicated data access tier (which is always the great idea) the good option would be serializing access to the table via application locks. Main idea of this method is that any code that accessing the object is acquiring application locks. DDL code acquires exclusive lock and DML code acquires shared lock. Locks would not be partitioned and as result requests would be serialized. Let’s take a look. First – let’s define the stored procedure that alter the metadata.

This is just an example. Obviously you need to think about timeouts, exception handling, retry logic and other things – but hopefully it would give you an idea.

Now let’s try to create the stored procedure that reads the data from the table. But first, we need to define what should we do in case if we are blocked and have timeouts. Of course, we can return the error code to the client but another method that can be useful in some cases is to return empty result set. In order to do that let’s create another (empty) table of the same structure with original one.

Now let’s define the stored procedure.

As you can see, if SP cannot obtain application lock within 3 seconds, it reads data from the empty table and basically returns empty result set to the client. Otherwise it would read the main table. Obviously the biggest “downside” of this method – it would not work well in the case if there is no dedicated data access tier in place and object accessed from the various places. Well, in such case old classic approach with try/catch and retry logic is your friend. And, of course, we can use SET DEADLOCK_PRIORITY to reduce the chance that the session with DDL statement would be chosen as the deadlock victim.

Source code is available for download

Next: When Transaction Start

Table of content