Untangling Dependencies at Amazon
As companies scale, they slow down. Teams spend more time coordinating and less time building. Every new feature overlaps and intertwine with features … Read Article
For digital companies, speed disproportionally matters. Many organizations are trying to improve their time-to-value–how long it takes to put an idea into customers’ hands. However, the product teams supporting these efforts often need help to maintain the technical excellence required in modern architectures that enable frequent releases of features. This challenge primarily exists because product teams need a structured approach to making architecture decisions and operations trade-offs.
Architectural tenets help to coordinate development that expands across teams. Mature digital companies use mechanisms to align teams’ decisions with organizational best practices. It creates consistency across services in distributed architectures, helping the onboarding and mobility of developers between teams while driving technical and operational excellence.
Our Reliability Manifesto is a succinct collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable system.
Amazon Simple Storage Service launched with ten design tenets. The service teams grew their system from eight to more than two hundred fifty distributed services over the last fifteen years, with hundreds of developers constantly launching new features while providing 99.999999999% (11 9’s) of data durability.
Similarly, Twilio followed a set of architectural design principles that helped them to sustain growth and to minimize the impact of occasional but inevitable issues in underlying infrastructure.
In the paper On Designing and Deploying Internet-Scale Services, James Hamilton describe the tenets and overall application design behind the Windows Live Services Platform.
Establishing clear architectural guidance and operational guardrails for teams improves the overall system design quality and reduces the time teams need to make decisions.
Many organizations are adopting the “you build it; you run it” principle to increase teams’ autonomy. However, teams will need a certain level of maturity before operating successfully. A production readiness review is a helpful mechanism to support teams preparing new services. Implemented as a questionnaire or checklist, it gives teams guidance on what to think about and consider before bringing a new service into production.
Production readiness reviews guide teams on what categories of service levels to think of, what organizational standards to comply with, and what documentation is required. Many organizations are using production readiness reviews as part of the go-live process, such as Grafana Labs or Gitlab, which has made publicly available their production readiness review plabybooks or Google that popularized this approach as part as their hand-off pager process in the site reliability engineering model
For organizations concerned that a review process could negatively impact a team’s ability to go live, having a definition of production readiness can at least provide some guidance and document the agreed-upon criteria for the organization.
It’s easy for teams to lose sight of how to make decisions and trade-offs and get distracted by the nuances of everyday software delivery challenges. Organizations should define mechanisms to align teams, make decisions, and sustain technical excellence. A Well-Architected framework helps teams understand the pros and cons of their choices (including security, reliability, operational excellence, performance efficiency, cost optimization, and sustainability).
Cloud providers have published their well-architected frameworks on how best architect solutions on their platforms. Teams should use them as a starting point to develop their frameworks. In a series of articles, I am starting to explore how to create, adopt, and scale-up across the board architectural and operational principles using well-architected frameworks
Failures are a given, and everything will eventually fail over time
If you have embarked on a similar journey, I would love to hear about it. Please reach out
As companies scale, they slow down. Teams spend more time coordinating and less time building. Every new feature overlaps and intertwine with features … Read Article
Over the last decade, software developers have increasingly adopted open-source software to assemble their applications. Open software has become … Read Article