Editor's note: This post originally appeared on DuploCloud's blog.
As we are moving to cloud based infrastructure, more and more teams are looking to use some kind of Infrastructure-as-code solution. This allows them to keep all changes under a version control system and review infrastructure changes by a senior devops person before they are actually deployed.
While Infrastructure-as-code has its advantages, it also comes with its own set of challenges. Based on talking to VPs of engineering, CTOs and devops leaders, we have classified their problems into a few categories:
Two Difficult Skills (Programming and Operations)Needed in One Person: Your DevOps team needs to be good at programming and operations — so now you need to hire for two skills. In many cases, people are good at one and not so great at the other. If a person is primarily a developer who has now turned into a devops person, they are good at programming but may deploy less-secure or sub-optimal infrastructure. On the other hand, if the existing ops experts or system administrators learn to code, they often do a suboptimal programming job leading to poor code organization. That makes code management and future changes harder to do. A lot of code gets copy-pasted instead of proper modules based design.
Scaling Adds Complexity: As the team scales, if the code is not structured properly, it gets harder for a team to make changes without stepping on each other. One needs to understand the overall code structure, layout and split across files to know where to add more resources. If there is any churn in the team, new members may prefer a different layout and structure, causing extra work to be done in such cases. This gets harder and one needs to follow best practices to do it well. Here is a detailed article from HashiCorp on how to organize your code for scale. Although it is a very interesting read, the guidelines require proper processes and learning in place for all team members.
Changes Take Time: Small changes or enhancements take much longer as one has to go through the code, review, testing, check-in and run process for every minor change. As infrastructure grows, one needs to look at code organization, restructuring, adding modules for keeping the code DRY and the role of an expert reviewer becomes even more critical. Also in order to test changes, one needs to write test code and manage that as well. With coding, reviews, testing, over time managing all of infrastructure-as-code itself becomes a complete software development project. As an example of how to write unit tests for Terraform code, see this article. One read through this article and you will see why Infrastructure-as-code is not suitable for all teams and it requires a certain amount of expertise to do it well.
Security and Compliance Can’t be an Afterthought: Your team needs to know how to deploy infrastructure in a secure manner. Lots of controls for security and compliance need to be met during the provisioning time. If one makes mistakes during provisioning, and the problems are detected later in an audit, the fixes can take weeks to months. For example, if the VPCs or subnets are not properly configured, redoing that work will require careful planning and execution so as not to lose IP addresses or connectivity to machines during the transition. It is simply not a scalable long term solution to keep applying security and compliance as afterthought after initial provisioning.
Code to No-Code
Given the advances in machine based automation and learning, it shouldn’t be very hard to have an intelligent program or a bot take care of the code for us.
Ultimately a lot of expertise about infrastructure deployment can be captured as rules or a knowledge graph that a program can use. The bot should be able to understand a high level requirement or specification for the application deployment, provision all the underlying infrastructure in a fully secure and compliant manner and ultimately generate the terraform or other infrastructure as code output for the team to keep.
Once the bot produces the right output, it should be fairly easy to change a few variables and deploy that infrastructure to mimic a dev or staging environment that looks identical to production. Essentially, the code gets written but not by humans necessarily.
Well Architected Design
There are well architected design guidelines that are published by public cloud vendors like AWS. These best practices can be learned, measured for a deployment, and implemented if they are not being used. AWS and many of its partners offer this as a free service. Ultimately a poorly designed infrastructure on AWS will also make AWS look bad. If customers get hacked on a cloud quite often, it will tarnish the reputation of the cloud vendor for no fault of their own. In fact, AWS even has lots of tools available to help with the security audit of a customer’s infrastructure. Some of the common things it will test are network ACLs, security groups, open ports, public access to any databases, public access of S3 buckets and so on.
It is quite challenging for any human to learn all these guidelines and keep up with the new services and their requirements that come out every month. As a result it is common for errors to happen during the deployment phase. In most cases, secops teams use a set of tools to go over the complete infrastructure periodically, using a read-only account and check for a set of guidelines that should be met. These tools then flag a list of violations with different severity levels that the team has to go through and analyze each one of them.
What a waste of time and resources for everyone.
Instead, why not have a tool that does the deployment while meeting all the well architected design principles from the provisioning time itself. You never should get into a bad situation and run continuous analysis tools to detect and fix problems.
Prevention Instead of (Detection + Remediation)
Instead of writing a lot of code to build infrastructure, making sure that the code is correct, analyze the deployment using a separate set of tools, fixing those in the code, re-deploying that code and repeating this whole process over and over again, we need a way to have an automated bot that can take care of infrastructure provisioning, application deployment, adding security controls to the underlying VMs or containers and alerting when something goes wrong.
We need to go from detection mindset to prevention mindset. That is the only way to a healthy lifestyle as a doctor would say!
Figure A shows the current state of the art in terms of deploying applications in a cloud and Figure B shows what should happen.
Figure A: Lifecycle of Application Deployment using Infrastructure-as-code
Figure B: Lifecycle of Application Deployment using an intelligent bot
DuploCloud has built a solution as shown in figure B that allows you to have the benefits of infrastructure-as-code without any of its drawbacks. We believe that this is how the applications and infrastructure should be deployed in the cloud. If you agree with this new approach and think this is a journey you would like to be on, please reach out to us.
We can’t promise how many users your application will have, but we can promise that your developers and operations team will not have to worry about security, compliance and infrastructure following the right set of principles and guidelines. They will have a lot more free time to focus on what matters to the business and not worry about the infrastructure or application deployment being done in the right way.
We can promise to give a lot more time back to your developers and operations team, which they can spend on activities that really matter to business instead of on security, compliance and infrastructure provisioning.
As tools like Terraform, Ansible and CloudFormation have brought about Infrastructure-as-code as a dominant design pattern, we are also realizing the limitations and challenges that come with using such tools. Specifically in areas such as balancing the skill sets required, scalability as the code base grows, time taken making changes & security being an afterthought. Furthermore, these tools lead teams to using even more tools to check their work for security issues, compliance and well architected design principles.
We are trying to create a world where all the infrastructure provisioning and application deployment tasks can be done using an intelligent bot, which will do these in the right way from day 1 and will not require continuous checking. More than two dozen of our customers are using DuploCloud everyday, managing more than a million dollars of AWS spend every tear, while doing 5,000+ deployments per week.
Always looking to connect with others in the cloud space — feel free to drop us a note right here.
Finally, there is an old saying for both personal and cloud infrastructure health: Prevention is better than cure!
About the Author: Venkat Thiruvengadam is the CEO and founder of DuploCloud and a pioneer in today’s public cloud technology. An early engineer at Microsoft Azure and the first developer and founding member of Azure’s networking team. He wrote significant parts of Azure’s compute and network controller stack and saw Azure grow from hundred-odd servers to millions of nodes in just a few years. After leaving Microsoft, he realized that such hyper-scale automation techniques had not come outside of companies like AWS, Microsoft, and Google. This led Venkat to form DuploCloud to bring the hyper-scale automation techniques to main street IT.