It was the day before our app was going to be featured on the Today Show. I was sitting in the conference room with our CEO, CMO, and COO. We had a few senior members of our development and design teams from our other office connected via Skype. The question was very simple: “Are we ready for tomorrow?” Our operations and design teams had all the content updated and ready to go. Marketing was ready to piggyback on the coverage. Our DevOps team had deployed additional servers, turned up throughput capacity, and configured new auto-scaling rules to handle the increased load we all anticipated. On the surface, we looked like a well-coordinated team of professionals that had spent months preparing for this moment.
But there was something lurking in the shadows. Something that we all knew about. Something we had discussed at length many times. Something that haunted our database developer day in and day out, knowing that one day, under the right circumstances, this design flaw was going to crash the system. And tomorrow, the circumstances were going to be perfect.
A Small Piece of Technical Debt
The design flaw was very common: we used a single MySQL table to store all of our user accounts. The rest of the system was highly scalable with sharded databases for product data storage, DynamoDB for structured product collections, and several layers of caching with Memcached and Varnish. Under “normal” circumstances, even heavy load, our single MySQL table performed very well. Our load testing always passed with flying colors, simulating tens of thousands of concurrent users. The table was replicated to a slave server, so reads weren’t a problem. The issue was with our write throughput, and thousands of people signing up at the same time was going to be a major problem.
An Epic Fail
It was too late to do anything about it. We had dismissed the problem so many times that mentioning it had almost become taboo. The idea of that many signups at once was something we thought was so unlikely, that fixing it would be a complete waste of resources. Then this amazing opportunity fell into our lap. We all watched with an overwhelming sense of joy and pride as Natali Morris gave the hosts a quick demo of our app. And then the signups started.
Within minutes we had over 10,000 new accounts! It was amazing to watch hockey stick growth in real time. But then signups started to slow down. After a minute or two they stopped completely. The CloudWatch alarms started blaring. The database connections were spiked, the gateways were throwing 500 errors, and the support emails started coming in. Not only did signups stop, but the flooded database connections blocked existing accounts from logging in or renewing auth tokens. The system was completely unavailable.
After analyzing the logs, it appeared that we could have had well over 100,000 new accounts that morning. Had those users signed up over the course of a few hours, the system would have held up just fine. Instead, there was a perfect storm that we never took the time to prepare for, even though we knew it was a possibility. We lost thousands of potential users from that initial push, plus the thousands of potential users they could have recommended us to. We even started trending in the App Store, but much of that traffic was lost due to the time it took to fully recover from the crash. We not only wasted this huge opportunity, but also soured thousands of users on our service.
Hindsight is always 20/20, so there was plenty of blame to go around. But ultimately, the team made a calculated decision that resulted in our worst case scenario. This could have been a major launching pad for our product, but because we failed to properly address some technical debt, it turned out to be one of the worst days the company had to face.
Dealing with Technical Debt
Technical debt is a concept that reflects the implied cost of additional rework caused by choosing an easy solution now instead of using a better approach that would take longer. ~ Wikipedia
Technical debt can (and should) give software developers nightmares. Knowing that there is a piece of code or a system that will fail under the right circumstances should be of major concern to the entire company. As my story above (hopefully) makes clear, an application may only be as scalable as its weakest link. As a product manager, you need to understand the implications of technical debt within your product and create a plan for dealing with it.
Here are some tips to dealing with technical debt within your product:
Technical debt needs to be identified and documented
As a product manager, you need to be aware of any limitations your product might have. There are times when shortcuts are necessary, but you need to know the impact that those shortcuts have on future performance. Even if you are not a technical product manager, you still need to understand what parts of your product have potential gotchas. Work with your developers, document these issues, and keep an updated list so everyone is aware.
Technical debt is often cumulative, address the root first
As soon as technical debt is introduced, it’s often the case that any additional service that needs to interact with it will contain technical debt itself. If a piece of code only performs under certain conditions, then code written to deal with its output most likely will only handle those conditions. This can have a cascading effect that can introduce multiple layers of bugs and potential failures. If you are properly documenting technical debt, then prioritize addressing root systems first. This will help to harden the underlying systems, making new system that rely on them more stable as well.
Schedule time to work down technical debt
I’ve known a lot of engineers that would just happily refactor code all day, chasing a more elegant and efficient solution. This is clearly an unproductive use of an engineer’s time. However, if technical debt is documented and the risks are clear, schedule time for the engineering team to address it. This will not only give the team piece of mind, but will often lead to new innovations and reduced work for additional features.
Embrace Test Driven Development (TDD)
Test Driven Development (TDD) is a great way to help manage technical debt. Not only does it provide you with baked in regression testing, but it can also help to identify technical debt by writing scenarios that your code doesn’t handle. For example, a series of tests can be written that deal with totalling an ecommerce order. Those tests can include orders that need to apply a VAT tax or another type of local taxing. If your code doesn’t currently support those options, you can flag the tests as UNSUPPORTED features, giving you a constant reminder that the code is missing support for certain scenarios.
Not all technical debt is bad
As I mentioned before, sometimes taking on technical debt is necessary. The are several reasons why this is the case. You may be building sample components for other teams or working in a Lean Startup environment. Expediency in the software world does have its place, but as a product manager, be sure to limit these shortcuts as much as possible. If any of this code goes into production, technical debt will need to be addressed.
I hope my experience helps you realize the importance of dealing with technical debt and how it can lead to better, more stable products.