I am currently on a short sabbatical from my role with KPMG UK. That gives us plenty of time to get back to technical blogging. Today, from Airlie Beach, Australia (the gateway to the Whitsundays), I’m writing a short, opinionated article about his three golden rules for delivering cloud-based technology.
Today, “technology offerings” in this context can range from ad-hoc scripts that retrieve some metadata from a cloud environment, to building new functionality, or completely greenfield applications.
Regardless of the size of the solution, these three basic principles remain the same and form the basis of my design philosophy.
Simply put, maintainability can be described as the ease with which a tool’s functionality can be maintained.
Aim for “No Ops” – I love the word “no operation”. This means that the solution works out of the box with little or no manual intervention. In many cases, this Nivarna requires positive thinking and investment. Done right, your team can focus on new features without worrying about keeping old solutions running. Avoid manual intervention whenever possible and automate as your life depends on it!
Complexity, the number one enemy of maintainability – I always try to keep my solutions as simple as possible. The legendary saying “Don’t reinvent the wheel” applies strongly here. Always try to use PAAS/SAAS services if your requirements allow. Choose a well-maintained open source library instead of starting from scratch.
Low barriers to entry – Junior and new team members should be able to contribute seamlessly to the product. This is achieved through solid documentation, contribution guidelines, and a backlog of tagged “first issues” ready for new engineers to gain experience. Having a solution that can only be sustained by a single point of failure is not ideal and will certainly impact your team’s speed.
All as Code!! – of The best documentation is clean code. Make sure all infrastructure is described using Terraform / Bicep. This means that engineers can easily refer to the topology in a language they understand. Machine images (Packer/Ansible), policies (YAML/JSON), K8s (Helm), and of course the source code itself, to name a few other preferred examples! Ideally, the solution should be immutable . That means you can easily recreate it from scratch. If not, identify manual steps to open some open tickets.
Almost every day, our LinkedIn and Medium feeds are filled with new companies falling victim to cloud-based data breaches, usually through social engineering or accidental misconfigurations. Whatever your solution in the cloud, keeping things safe costs money!
Invest early in guardrails – All major cloud providers have rich built-in policies that can protect your organization from critical cloud misconfigurations. Many of them are readily available. Configure as early as possible as a baseline. As your organization matures, it’s worth providing a way to deploy these cloud policies as code, making it easier to adapt to new standards. See the Azure-based example below.
Security is everyone’s responsibility – A year of Privileged Access E-Learning is not enough (if your organization has it!). The threat landscape is constantly changing and cybercriminals are getting more sophisticated every day. Security should be part of every engineer’s goals. Encourage cloud security certifications, combine threat intelligence capabilities with engineers, and encourage reading of regular threat reports such as those tagged below by the NCSC. Learn from where others have failed and address gaps.
Beware of very permissive accounts – The principle of least privilege is gospel for everyone involved in cybersecurity, but I’ve seen some questionable constructs in my career. Only assign necessary permissions for tools/solutions. If you need high-level permissions, see if you can combine this with mitigating controls such as Conditional Access policies. This controls, for example, when credentials can be used from trusted IP ranges or devices. Below we’ve added a very cool preview feature from Microsoft.
Notice the secret – If your solution relies on shared service accounts, make sure the keys are rotated regularly. A better alternative is credential-less access using AWS IAM Roles/Azure Managed Identities. Finally, you should implement robust secret scanning in your SCM toolset. Accidental access keys in git repositories can confuse malicious users.
safe enough – Be aware that security features can come with additional cost and complexity. There is value in having a quick, standardized way to risk-assess a piece of technology and apply a reasonable level of control. Don’t overdo security. Otherwise, maintainability and possibly reliability are sacrificed.
Reliability is the potential for a solution to stop working, ruining the user’s day and the on-call engineer’s night.
Design around entropy – Entropy is a scientific measure of uncertainty. Making changes to your system increases entropy, so make sure you do a battery of tests to make sure you understand when changes break or set back your system. Combine this with your deployment pipeline to make it easy to roll back problematic changes.
Monitor/Respond to Key Symptoms – Unfortunately Most systems contain unreliable components, so make sure you’re in a position to watch for major signs of failure. Ideally, these are combined with automated runbooks to remediate symptoms before they lead to outages.
Choose Trusted Components – This may state the obvious. However, some components and services are simply more reliable than others, so consult your documentation to ensure availability meets your business requirements.
Vital alerts – After all, it’s better to be able to find system failures than end users. Use health checks and identify critical job failures to help identify and alert on critical system failures. Link these with your chosen alert mechanism to ensure engineers know how to respond to maximize uptime.
Both business criticality and data sensitivity naturally dictate how much to invest in each of these areas. Based on my experience, systems become more critical and sensitive over time, so start thinking about these areas from the beginning and iterate over time.
It’s rap people! I hope you enjoyed reading this article. As with everything, this is not an exhaustive list, but we have made cloud design decisions over the years, regardless of size.
Comments
Post a Comment