South California is great! I’ve never seen seven-stuck-lanes-in-each-direction highways 🙂 And thanks to the crew! They did the wonderful job!
Again, thank you for attending! It was the great event!
South California is great! I’ve never seen seven-stuck-lanes-in-each-direction highways 🙂 And thanks to the crew! They did the wonderful job!
Again, thank you for attending! It was the great event!
..are available for download
Thank you very much for attending the sessions!
Of course, there are plenty other ways to scale-out the system in addition to horizontal partitioning and data sharding. Although in every case we have the same set of factors we need to consider before making the final decision of separating the data between the physical servers. Let’s talk about them today.
Scaling-out would increase development cost. The system becomes more complicated. Often there are additional tiers (application server, data warehouse, etc.) that need to be developed, tested and supported. You also need to have different kind of people in the team – while you can possibly save something by having good rather than exceptional DBA, you’d need to have very good architects and backend developers there. With the current trends when hardware prices are decreasing and development cost is rising, you’d need to consider that aspect very carefully from the beginning.
And of course, there is the question about legacy software. Neither of us want to touch old, often obsolete, code. In some cases we could be lucky and keep everything as is (again, legacy is accessing only operational data and/or we have multiple desktop shortcuts). In other cases we could face very painful and expensive process of refactoring.
Support and maintenance cost:
Well, you’ll have more servers to maintain. And everything depends on how much manual work you are doing. There is not much difference between supporting ten or one hundred servers when everything is automated but it would be completely different story when manual work is involved. PowerShell, policy-based management, data collectors and other tools and technologies – learn them – they are your friends.
Hardware and software cost:
At bare minimum you’ll need to buy OS and SQL Server licenses for each server. A lot of things depend on SQL Server Edition – in some cases scaling out would allow you to use Standard editions on the smaller servers rather than stuck with the Enterprise when you don’t scale out. You can be forced to use Enterprise edition even if you don’t need edition features simply because of 64GB RAM limitation Standard edition has. In some cases you can live with lower end storage too. So you need to carefully plan everything.
And there are the Clouds especially when we talk about SQL Server within VM (Infrastructure As A Service approach). With Clouds the picture changes dramatically. We are paying by the hour – for what we are using. There is no (or minimal) upfront cost for hardware and software. And there are physical limitations in VM configuration (RAM, # of Cores, etc.) as well as slow IO performance. Those limitations could force us to scale out regardless of the cost.
This one is interesting. On the one hand, with multiple servers we have to implement HA strategy multiple times (think about automation again). It introduces additional work but on the other hand gives us better (or, perhaps, different) type of protection. When we use horizontal partitioning, we obviously need to keep our main (operational period) server online if we want the system to be operational. But if the server with historical data is down, it would not affect operational activity. Same with data sharding. If we have one out of ten servers down, only ten percent of the customers would be affected. It still bad but trust me, ten percent of complaining customers are better than one hundred percent.
Of course, everything is in “It depends” category. There is no such thing as strict set of rules when scaling out is appropriate. We need to look at the systems on case by case basis and do our homework all the time.
Last time we discussed how to scale-out our system with horizontal partitioning. Another approach that can be helpful is data sharding. With such approach we are creating multiple databases of the same (or very similar) structure and distribute (“shard”) our data between them. The typical use-case for this situation would be the system that collects data from the multiple customers where every customer works with his/her own subset of the data.
When customer logs in to the system, we redirect him/her to specific database. After that customer works within that database only. One of the key differences with horizontal partitioning is that customer-facing UI does not necessarily need to have application server. We still need to implement some sort of “redirector” – the part that knows where customer’s data is located and redirects the application to the right shard database. Although, technically speaking, as soon as it’s done, even regular 2-tier client/server system could work just fine. Don’t take me wrong – I’m not trying to suggest not implementing application server but in some cases legacy systems can work just fine with the minimum modifications.
What are the challenges there? From the customer-facing functional the biggest one is how to deal with the shared data. One of the examples is shared Article list in the point-of-sale system. We need to either replicate the data across all databases or create separate database to store that list. Obviously there are plenty of ways to replicate the data. Perhaps the simplest one would be to use snapshot replication (assuming we can have centralized publisher and articles are generally small). Although it still requires additional efforts to support it. Approach with the separate database introduces its own set of issues. It becomes single point of failure – when this database is down – all shards are down. And of course, there is the question how to access this database from the separate shards. Again, most likely linked servers will be in the game. While it’s not necessarily the big issue performance wise – those shared entities are usually small – it still introduces some overhead from development and
Second interesting question is the legacy software. Well, again, it’s not necessarily a problem. The database schema remains the same across all shards so legacy software technically can work “As Is”. Assuming, of course, it can connect to and work within the single shard and we don’t mind to have large number of shortcuts on desktop.
Cross-shard database access is challenging. Similarly to horizontal partitioning, we need to design the application server that works across multiple shards. With all development challenges this architecture can be beneficial from performance standpoint – we can query shards in parallel. On the other hand cross-shard access is not always needed. Some of the use-cases (for example, CRM) could be done within the single shard. Others, such as analysis, reporting and accounting, could be implemented in the separate Data Warehouse type database with ETL processes that get the data from the shards. And we rarely need raw transaction data in the data warehouse so size should not be an issue.
One very important question is how to shard the data. In some cases we can have natural criteria – like geographic regions in the picture above. Although we need to be careful when criteria is artificial. Ideally we are looking for uniform data distribution though it could be complicated in some cases. It’s not the good idea to distribute customers data based on ID ranges when, for example, first shard stores IDs between 1 and 1,000; second one – between 1,001 and 2,000 and so on. Old customers (lowest IDs) tend to leave and we will end up with set of underutilized shards. In those cases it’s better to use modulus division: Shard_ID = ID mod Total_Number_Of_Shards.
Another thing to keep in mind is uneven amount of data across for different customers. Let’s think about GPS Tracking as the example when customers are tracking their assets. One customer can have just a few assets, another one hundreds or even thousands of them. Obviously amount of data for those customers would vary greatly. In case, if we have shard servers powerful enough and store large set of the customers per shard, we would be fine. At least should be from statistics standpoint. But it still makes sense to implement some sort of the monitoring solutions to avoid extreme situations.
Before we start talking about Data Partitioning within the same database, I’d like us to discuss a couple methods of partitioning by scaling-out our data and servers. So, first of all, let’s think about the situation when we have all our data in the same database/table.
This is quite general multi-tier client architecture nowadays. Obviously we can have client software (especially legacy one) which is connecting to the database directly.
As we already discussed, almost every system stores two different kinds of data – operational and historical. So what if we do something like that:
In this schema, we keep all operational data and catalog entities in the main database and move historical data to the separate SQL Server(s). We define the linked servers which will allow us to join historical data with catalog entities as well as union the data from the different servers in some cases. Data placement is transparent to the clients which access the data through application server – “the must have” in such architecture. Application server knows how data is stored, queries the data from the multiple databases, merges the streams and returns the final dataset to the clients. Technically speaking, we can even have performance advantages by querying the data in parallel – each database connection could be done from the separate thread.
Legacy software that queries database directly is usually OK too – as long as software does not need access to historical data. Even if historical data is needed, it could be sometimes workarounded by creating distributed partitioned views.
The biggest drawback of this design is additional complexity and overhead introduced by developing the code that combines data from the multiple sources. E.g. application server and, sometimes, database code. Everything would depend on what do we need to do with the data. Some cases are very simple (think about code like that running in the single database/table solution):
With the new multiple servers we will need to open multiple database connections and run those statements against multiple servers:
And finally merge the data for the client. We can even have Customers list cached on application server somewhere and avoid cross-server joins. Simple enough. Although what if we look for something like that – find most expensive orders, page the results and return the second page to the client? (Small disclaimer – it’s not the best way of doing paging. Check this post for more information)
As you can imagine, there are no easy ways to implement it. We need to either write the query that collects the data from the multiple databases via linked servers and sort/page it after that. Alternatively, we can do it on application server side which could lead to the complex code that does not perform well enough. It’s also worth to mention that linked servers could introduce the own set of the issues from both, performance and development standpoints.
The biggest problem I see with such design is that requirements are rarely static – things tend to change. I have this architecture implemented in one of my systems and while things were relatively simple at the beginning, newer requirements led to the huge development overhead. But everything, of course, is system-specific and in “It depends” category. That design has its own purpose and use-cases. And we will talk about them later.
Last week we were discussing why do we want to partition the data. I hope at this point we all agreed that data partitioning could help us with the large table. Which leads to the question – when the table becomes “big enough” for partitioning.
This is kind of theoretical discussion. I heard a lot of different opinions varied from hundreds MBs to TBs. We can spend a lot of time discussing that and will never end up with the right answer. Or, better say, all answers would be the right ones. Obviously your mileage may vary – a lot of things depend on your system, environment and hardware. But my personal take is very simple. If I expect my system to collect large amount of data, the right time (for me) to partition the data is:
Even if Data Partitioning introduces some overhead on development and testing there are two particular reasons why I’d suggest to do that. Most importantly – data partitioning changes execution plans. As result, when you finally implemented it you’d need to re-test entire system. And this could be very expensive and time consuming. Not even mention the possibility of another round of performance tuning. It’s much easier to deal with potential issues during initial development and testing especially when you don’t have a lot of data and impact of sub-optimal execution plans is smaller.
Second reason is that it’s very hard (if even possible) to partition the data keeping the system online. In most part of the cases we are going to deal with exclusive schema modification locks that prevent any access to the table even in read uncommitted transaction isolation level. And if you think how long physical data movement of 100s GBs or TBs of data could take.. Well, you get an idea.
Again, everything depends on the system and requirements. There is no need to spend time and efforts on data partitioning if you don’t expect to have a lot of data. But remember that situation and requirements can change. There is always the possibility that project would be more successful than we expect it to be 🙂
It was the day when I did 2 presentations at Teched Russia in Moscow. I came from that event completely exhausted and found the email that I passed Microsoft Certified Master Lab exam. I’ve obviously expected the email – it was about a month after I took the lab in Seattle during PASS Summit. What I did not expect is that I passed. I already started to think how to adjust my home lab and practice for retake. But, perhaps, I’m going a bit ahead of myself with the story.
I passed my first MCP exam in 2007. Originally I started to obtain the certificates with simple purpose – improve my resume. But it changed quickly. It’s strange – you don’t appreciate learning much when you are young – there are always way too many other things to be excited about. And, as many other things, you miss those challenges when you are getting older. I really enjoyed the process of learning and was happy and proud of myself after each test I passed. Last but not least, it also helped with my day-to-day job. Obviously it’s impossible to retain all information from study materials but you at least know what is available and where to look when needed. No needs to re-invent the wheels. Anyway, to make the long story short – in 6 months I got MCPD and MCITP certificates.
And then I was stuck. Of course, not in terms of learning but speaking of certifications – there were no ways to go. As much as I wanted to attend MCM training in Redmond, I knew that my company would never approve it. Neither from cost nor from time prospective. So when Joe Sack announced that training is no longer required during PASS Summit 2010, I was quite excited. And quite lazy, of course. There were other priorities I had to take care of so I lost almost an year before even started to prepare to the exams.
I took the Knowledge exam in July 2012. This was very interesting experience. Never need to say that exam content was much more challenging than MCITP ones. The main difference from my point of view is that questions require you to think, not to know the answer. And of course, there were a lot of questions and not much time. I particularly enjoyed that there were very few questions from “It depends” category without enough information to make the decision. There were still some, but far less comparing to MCITP/MCTS ones.
I left Prometrics center with very mixed impressions – gave myself 50/50 chances. But in a month I found that I passed with relatively high score – higher than I expected. And I started to prepare to the lab. I took it in Seattle during PASS Summit last month. boB Taylor put some clarifications around the format. I, of course, cannot comment the content but to give you a tip – your main enemy is not the complexity of scenarios but time. You are out of time, you are under stress. And you are making mistakes. I’d probably disagree with boB’s advice about reading all scenarios first though – I spent 20 minutes on that – time I desperately needed later. What I would do is to scan through the scenarios and immediately take care of those you feel yourself comfortable with.
Reduce the scope as fast you can. And don’t kill yourself on testing. Do the basic tests, of course, but don’t spend any more time than absolutely needed.
How to prepare yourself to the exams? First, and most importantly, you need to understand how SQL Server works internally. It’s not enough to know, for example, how to restore the database from the backup. You’d also need to understand what happen under-the-hood when you run RESTORE command. Next practice, practice and practice. One of my challenges, for example, is that I rarely do low level day-to-day DBA stuff. For example, I usually design backup strategy but I rarely implement it by myself. Of course, I know how to do that, but sometimes it took me a couple extra minutes to find exact syntax in BOL. And those minutes are precious.
Speaking of specific resources – my first advice would be to start with Kalen Delaney books about SQL Server internals. After that, I’d focus on MCM training videos – (Huge thanks to all SQLSkills crew). And, of course, focus on MCM reading list. This is huge but stuff clicks together over time – further you are on the way, easier it is for you.
In any case, good luck! As for me – there is still a long way to go. Once you start, you would not be able to stop. 🙂
A few new slide decks are available for download from Presentations page
Again, thank you very much for the interest!
.. are available for download.
Bad things happen. Very bad things happen too. And usually when it happened there is no such thing as the single root cause – there were the set of mistakes that piled up and led to the final disaster. Today I’d like to tell you the story about one of those cases.
We are collecting data. Not as much as Google, perhaps, but still very decent amount – hundreds of new records coming to the system every second. Database backend is not very complex – our challenges are in the different area – how to collect and process data fast enough. We are not very big either and as result there are strict budget constraints. While we have decent development and QA environments, we cannot afford good similar-to-production staging environment that allows us to fully emulate production load. We still do performance testing but there are the limitations we have to work-around. Sounds bad but realistically speaking how many of us have the luxury of production-like testing infrastructure?
Mistake #1 – There is no staging / testing environment that allows to test performance under similar-to-production load.
We are collecting data from the various devices. We can control how often devices send us the data. But there are always the new devices on the market. And there are always the exceptions. One of device types is configurable only through the vendor. Customers submit configuration change requests to him directly keeping us out of the loop. And people make mistakes. Nothing shame about it – that’s the life. So bad thing happened – vendor changed the wrong parameter in the configuration file and devices started to send one record every two seconds instead of two minutes.
Mistake #2 – Expectation that you’ll always have control and would be able to avoid simple mistakes
The service that collects the data from devices is multitheaded and scalable through the number of active threads. Obviously we tested that aspect of the system. And what is the typical development/qa configuration with multithreaded application? Yes, you are correct – two threads. It worked just fine. We even did performance testing and found that two threads can handle three times bigger load than we have in production. Guess how many threads did we have in production config? Yes, you are correct – still two. Well, generally speaking we had two servers and four threads but it hardly matters. While we could handle six times more load than usual, we were not able to handle sixty times more. Collectors built the backlog.
Mistake #3 – Config files have not been adjusted accordingly when moved to production
Generally speaking, when everything works “as usual”, there is very little possibility that data would come out of order. We have some cache though that can address that situation. The cache was not big enough – just a few records per device. Same time it should be more than enough in the normal circumstances – what was the purpose of over-architecting the solution up front – we could refactor it later. Thank you, Agile development! Never need to say that potential problem has never been addressed – we always had more important things to take care of.
Mistake #4 – Under-architected temporary solution became permanent.
As I mentioned, collectors were backlogged. Data came out of order on the large scale and has not been processed correctly. And that particular aspect of the system behavior had not been monitored. Again, what is the purpose to spend time on monitoring and alerting code for the use-case that should never happen?
Mistake #5 – System monitoring was not extensive enough.
Finally, after a few days, customers started to complain about data quality. We found the problem, changed config files (20 threads magically eliminated backlog) and contacted vendor to fix incorrect parameter. So far, so good. Last thing remaining was data reprocessing. We already had the processing framework, so we basically needed to create the small utility that reuses the code and reprocesses the data. Piece of cake? I wish. Well, code had been reused including the part that should be excluded from that. That led to “split brain” situation with main processing routine. Well, another 4 hours outage to fix the problem. Strictly speaking (and because it supposed to be SQL Server blog), that situation could be avoided if we created separate SQL login for reprocessing utility and assigned very strict set of permissions there. But it had not been done either – utility used the same login with main processing service.
Mistake #6 – Excessive code reuse under stress without appropriate peer code review and testing.
Finally data had been reprocessed. Everything went back to normal. Except one last thing. In our system we have 2 databases – one stores the current production data, another one as archive. There is sliding window pattern implemented in production one – data has been purged on the daily basis. So we stopped purge in order to update (already copied) data in archive. But the way how we partition our data is a bit unusual. We don’t use date as partition column but rather identity column which is incrementing in par with the date. Every day we reserve the chunk of IDs for the next day partition. There were a couple reasons why we did it that way. Long time ago we were storage bound and decided not to partition by date because we did not want to add another column to every non clustered index. We could not change our clustered index either (at least left-most
columns) because it did not play well with our most common queries. Never need to say that neither of those reasons are still important but what is the purpose of changing working things?
Mistake #7 – Following “If it ain’t broke don’t fix it” principle
Never need to say the size of ID chunk we reserve has not been recently re-evaluated. We were not ready to the situation when purge has been stopped for a few days and records spilled out to the right-most partition that supposed to be empty. Obviously we did not notice it either. As result, when we turned process back on and system tried to split right-most partition it was not metadata operation anymore. Alter partition function acquired schema modification (SCH-M) lock which blocked any queries against the table. Another four hours outage.
Mistake #8 – Poorly documented design decisions without any formal re-evaluation policies.
What do we have at the end? Hundreds of hours and thousands of dollars we spent to fix the issues. Unhappy customers. Exhausted team. If either of those mistakes were avoided we would be fine. Or, perhaps, not in such bad shape.
That story does not have any morality though. It’s just a tale.