The pile of mistakes that lead to disaster

Bad things happen. Very bad things happen too. And usually when it happened there is no such thing as the single root cause – there were the set of mistakes that piled up and led to the final disaster. Today I’d like to tell you the story about one of those cases.

We are collecting data. Not as much as Google, perhaps, but still very decent amount – hundreds of new records coming to the system every second. Database backend is not very complex – our challenges are in the different area – how to collect and process data fast enough. We are not very big either and as result there are strict budget constraints. While we have decent development and QA environments, we cannot afford good similar-to-production staging environment that allows us to fully emulate production load. We still do performance testing but there are the limitations we have to work-around. Sounds bad but realistically speaking how many of us have the luxury of production-like testing infrastructure?

Mistake #1 – There is no staging / testing environment that allows to test performance under similar-to-production load.

We are collecting data from the various devices. We can control how often devices send us the data. But there are always the new devices on the market. And there are always the exceptions. One of device types is configurable only through the vendor. Customers submit configuration change requests to him directly keeping us out of the loop. And people make mistakes. Nothing shame about it – that’s the life. So bad thing happened – vendor changed the wrong parameter in the configuration file and devices started to send one record every two seconds instead of two minutes.

Mistake #2 – Expectation that you’ll always have control and would be able to avoid simple mistakes

The service that collects the data from devices is multitheaded and scalable through the number of active threads. Obviously we tested that aspect of the system. And what is the typical development/qa configuration with multithreaded application? Yes, you are correct – two threads. It worked just fine. We even did performance testing and found that two threads can handle three times bigger load than we have in production. Guess how many threads did we have in production config? Yes, you are correct – still two. Well, generally speaking we had two servers and four threads but it hardly matters. While we could handle six times more load than usual, we were not able to handle sixty times more. Collectors built the backlog.

Mistake #3 – Config files have not been adjusted accordingly when moved to production

Generally speaking, when everything works “as usual”, there is very little possibility that data would come out of order. We have some cache though that can address that situation. The cache was not big enough – just a few records per device. Same time it should be more than enough in the normal circumstances – what was the purpose of over-architecting the solution up front – we could refactor it later. Thank you, Agile development! Never need to say that potential problem has never been addressed – we always had more important things to take care of.

Mistake #4 – Under-architected temporary solution became permanent.

As I mentioned, collectors were backlogged. Data came out of order on the large scale and has not been processed correctly. And that particular aspect of the system behavior had not been monitored. Again, what is the purpose to spend time on monitoring and alerting code for the use-case that should never happen?

Mistake #5 – System monitoring was not extensive enough.

Finally, after a few days, customers started to complain about data quality. We found the problem, changed config files (20 threads magically eliminated backlog) and contacted vendor to fix incorrect parameter. So far, so good. Last thing remaining was data reprocessing. We already had the processing framework, so we basically needed to create the small utility that reuses the code and reprocesses the data. Piece of cake? I wish. Well, code had been reused including the part that should be excluded from that. That led to “split brain” situation with main processing routine. Well, another 4 hours outage to fix the problem. Strictly speaking (and because it supposed to be SQL Server blog), that situation could be avoided if we created separate SQL login for reprocessing utility and assigned very strict set of permissions there. But it had not been done either – utility used the same login with main processing service.

Mistake #6 – Excessive code reuse under stress without appropriate peer code review and testing.

Finally data had been reprocessed. Everything went back to normal. Except one last thing. In our system we have 2 databases – one stores the current production data, another one as archive. There is sliding window pattern implemented in production one – data has been purged on the daily basis. So we stopped purge in order to update (already copied) data in archive. But the way how we partition our data is a bit unusual. We don’t use date as partition column but rather identity column which is incrementing in par with the date. Every day we reserve the chunk of IDs for the next day partition. There were a couple reasons why we did it that way. Long time ago we were storage bound and decided not to partition by date because we did not want to add another column to every non clustered index. We could not change our clustered index either (at least left-most
columns) because it did not play well with our most common queries. Never need to say that neither of those reasons are still important but what is the purpose of changing working things?

Mistake #7 – Following “If it ain’t broke don’t fix it” principle

Never need to say the size of ID chunk we reserve has not been recently re-evaluated. We were not ready to the situation when purge has been stopped for a few days and records spilled out to the right-most partition that supposed to be empty. Obviously we did not notice it either. As result, when we turned process back on and system tried to split right-most partition it was not metadata operation anymore. Alter partition function acquired schema modification (SCH-M) lock which blocked any queries against the table. Another four hours outage.

Mistake #8 – Poorly documented design decisions without any formal re-evaluation policies.

What do we have at the end? Hundreds of hours and thousands of dollars we spent to fix the issues. Unhappy customers. Exhausted team. If either of those mistakes were avoided we would be fine. Or, perhaps, not in such bad shape.

That story does not have any morality though. It’s just a tale.