The rise in popularity of cloud providers such as AWS, Google Cloud, and Microsoft Azure over the last decade has brought enormous change to the way we build and operate web services. With many of the labor-intensive tasks of the past, such as provisioning new server capacity, abstracted away behind convenient APIs, companies now have unprecedented levels of flexibility when it comes to how they design and operate the infrastructure needed to power their ambitious applications.
As a result, our production environments have become larger and more complex—and it’s increasingly difficult for a typical security team to safeguard them manually. As security teams struggle to keep pace using old best practices, automation is a key lever that can enable teams to perform their work effectively at scale. Aided by a growing ecosystem of powerful tools built to take advantage of the comprehensive APIs offered by cloud providers, this automation makes it possible to build highly secure solutions while keeping your team moving fast.
Automation and security
Before we explore the relationship between automation and security in detail, we should first take a moment to identify the properties of automation that are most important to us. From a security perspective, automation provides two primary benefits: reproducibility and least-privilege isolation.
That automated tasks generate identical and reproducible results is highly important because they create a homogenous and orderly production environment that is significantly easier to manage and secure. Rather than having to give unique care and attention to each server, we can work with them as a collection of disposable, easily replaceable entities where no one server is essential. Although many companies start out manually provisioning and configuring their servers, this is less than ideal because it creates a constant risk of introducing inconsistencies or misconfigurations through human error.
While this might not sound very serious, such misconfigurations are so commonly the root cause of security incidents that they were ranked sixth on the 2017 OWASP list of the top ten application security risks. By properly applying automation, you can adequately control for potential sources of noise when evaluating security risks and minimize potential incidents caused by updates not being applied uniformly.
A second benefit of automation is that it allows for easier isolation of the agents that execute tasks, using what is known as the “principle of least privilege.” In this context, an agent could be anything: an engineer on your team performing a manual change as part of their daily responsibilities; a third-party software service like GitHub to whom you’ve provided API credentials to power an integration; or even a program that you’ve created to perform a specific task.
The principle of least privilege states that each agent should be granted only the bare minimum set of permissions required to perform their respective task. While this may not feel familiar at first, its applications are commonplace in real life. For instance, it’s unlikely that every employee in your company is authorized to make wire transfers from the company bank account, and with very good reason.
When applied properly, this approach greatly reduces the potential fallout in the event that an agent is compromised or executed maliciously. This is because an attacker in this scenario would have their potential scope for mischief curtailed by the strict permissions granted to the compromised agent. By sufficiently isolating each agent, we can make it much more difficult for an attacker who has successfully compromised a component of your production environment to pivot to a more powerful position where they might gain more sensitive permissions or access to a protected data set.
A journey of a thousand commits starts with a single bash script
If you’re eager to roll up your sleeves and start automating your company’s most important tasks but aren’t sure where to start, you should first consider sets of common manual tasks that teams in your company frequently carry out. For example, if your company has a manual process to provision additional server capacity on EC2 ahead of deploying a new service, automation might be a helpful change.
With any luck, your company has documentation for these scenarios in the form of checklists that guide engineers through the necessary steps. These checklists are the perfect starting point for someone hoping to introduce automation into their company’s processes, and can serve as recipes for you to automate using simple bash scripts.
Most cloud providers offer a set of command line utilities that enable you to easily perform actions using their APIs from your local terminal. We can take advantage of these tools by quickly piecing together bash scripts that perform the necessary steps in sequence without requiring much setup or many external dependencies. As a result, the scripts you produce here can be surprisingly short and sweet. A lengthy checklist that took an engineer an hour or two to work through manually might be reduced to a 30-line bash script that executes in a few seconds and produces the exact same result every time.
Converting these process checklists into an internal repository of bash scripts for your engineering team provides an obvious productivity benefit for those following in your footsteps—and effectively solves a class of human error and misconfiguration in production, ensuring that every engineer who uses the scripts gets identical results.
In addition, this collected repository of scripts makes things much easier for the security team when they need to introduce new services or make configuration changes in production. Instead of updating the relevant documentation and relying on engineers to carry out the work on their behalf, they can update the scripts in place and be confident that the changes will go into effect the next time they’re executed.
Automating away your permissions
Creating this internal repository of scripts is a boon to both your security program, in the form of a more uniform and clean production environment, and to your fellow engineers in terms of time saved. However, we’re still left with some hard problems to tackle since these scripts must be executed manually. Chief among them is the need to provide the necessary API permissions to everyone in your engineering organization in order to ensure that they can fully utilize the scripts without encountering access exceptions.
The primary concern here is that a large number of engineers retain the ability to perform very sensitive changes, even if the need for them to do so arises only rarely. By generously providing your teams with access like this, you make each engineer into a very high-value target for prospective hackers, who might then be able to compromise their development machines in the hopes of stealing their API credentials.
While you could turn your energy and attention toward adequately securing your development laptops, it’s possible to achieve a much better outcome by improving your existing body of scripts and creating complementary infrastructure to execute them automatically. Doing so also dispenses with the need to provide your developers with API access in the first place.
It’s possible, depending on the scale and complexity of your environment, that you might need more advanced tooling than just your corpus of bash scripts to completely automate all of the work needed to operate your service effectively. A common approach for companies in this position is to adopt “infrastructure as code” practices that are designed to help with this problem.
Infrastructure as code refers to the use of programs that consume machine-readable configuration files that describe the desired state of your infrastructure to automatically provision the necessary cloud resources for you, rather than using interactive administrative interfaces like the AWS Console to do so. This enables you to very quickly make changes to your environment by editing the appropriate configuration files to include new definitions for the resources you wish to create. These tools then use your updated configuration, combined with an up-to-date snapshot of your environment, to compute the changes that need to be made and execute them for you.
On top of the obvious productivity boost, one of the additional benefits of this approach is that it allows you to manage your infrastructure with the same level of rigor as your software development practices by treating your infrastructure configuration files in a similar manner to your application source code. You can use the same mechanisms that you’re likely already familiar with, such as version control, requiring pull requests to introduce changes and acquire necessary approvals, test suites that evaluate each branch to ensure that changes are both syntactically and logically correct, and even a continuous deployment pipeline to handle the application of changes once they’ve been merged.
By leveraging automation and borrowing best practices from your software development process, you arrive at a much more secure and effective solution. Your engineers retain the ability to move quickly and to easily introduce infrastructure changes by updating the appropriate configuration in a controlled manner, and the security team eradicates a considerable area of risk by revoking the set of sensitive permissions previously given to each engineer. You will have automated away their need for them.
Who watches the robots?
With your fully automated pipeline in place, the next step is to introduce a thorough auditing component to your security program to help you continually monitor for anomalous activity or policy exceptions.
One of the most effective ways to do this is to utilize the comprehensive activity audit functionality offered by most of the major cloud service providers, such as AWS CloudTrail or Google’s Cloud Audit Logging. These audit logs contain a complete, detailed history of the actions performed in your account, along with important metadata such as the identity of the actors and the source IP address for the corresponding API request. Typically, these logs are made available via an easy-to-use API or a similar mechanism to facilitate their programmatic consumption.
By relieving your engineers of their API permissions in lieu of mandatory automation, you’ll drastically reduce the noise in your API traffic, resulting in a much more valuable signal for you to monitor as part of your security program. This strategically avoids the need to audit human API activity, while enabling you to write much stricter rules that you can use to evaluate whether a given API action that you observe in the log was authorized or whether it represents a security risk.
If you then introduce an additional requirement that pull requests made in your configuration repositories must be approved by a separate engineer, you can safely discard any events in the audit log belonging to the corresponding pipeline agent because you’ll know that the changes were properly reviewed and authorized. Similarly, any actions that you observe being performed by anyone other than the authorized agents can be immediately sent to your security team, allowing you to respond in near real time to any incidents. You might even wish to implement some automated remediation, depending on the event in question. In the example of an unauthorized server launch, you might wish to immediately deprovision the server to prevent the attacker from utilizing it for any further attacks.
Security + robots: Better together
I hope that this article leaves you with a good sense of how automation can help your company tackle hard security problems and stay on top of the challenges of managing our ever-changing production environment. Leveraging automation raises your security bar substantially—and makes the secure path the path of least resistance for your engineers.