Creating an OpenShift Cluster with Terraform

Overview

There are many examples of how to create an OpenShift cluster in AWS. Most of these examples use CloudFormation for orchestrating the creation of infrastructure and deploying the cluster. This post walks through how to do it using Terraform.

Note: The code displayed uses Terraform 0.12.x, but the concepts should still apply for Terraform 0.11.x

This example will create a number of items outside of OpenShift that are the basics required to get up and running in an AWS VPC, specifically:

VPC
Client VPN endpoints
Certificates for the endpoints

If these items are not required, you can skip the vpc module and continue on to the openshift module.

Terraform Deployment

The full code for this post can be found here. There are two branches of note. master which contains an OpenShift deployment using the aws-iam-authenticator for AutoScaling (more on that later), and feature/terraform-standard-install. For now we will focus on the second branch. To get started, download the code and checkout the right branch:

git clone https://gitlab.com/kjanania/openshift-terraform.git
cd openshift-terraform
git checkout feature/terraform-standard-install

You can check the README for the full list of steps and some convinience scripts.

Setting Up Prerequisites

Terraform maintains a state file to keep track of what resources have been created in order to determine which ones need to be modified to reach the current state. This state file can be stored locally on the file system or remotely in locations such as Amazon S3. This demo stores the state file on Amazon S3 and uses DynamoDB to concurrency locking.

We’ll start by creating the S3 bucket and DynamoDB tables:

aws dynamodb create-table --table-name terraform-vpc-lock --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema AttributeName=LockID,KeyType=HASH --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
aws dynamodb create-table --table-name terraform-openshift-lock --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema AttributeName=LockID,KeyType=HASH --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

aws s3 mb s3://<insert s3 bucket name here>

Next, we’ll need to upload certificates for use with the VPN. Certificates have been provided in the repo for convenience, but these are not secure!!

Create your own certificates as described in this document. The ones provided are for testing purposes only!

Upload the provided certificates or ones you have generated on your own:

SERVER_ARN=$(aws acm import-certificate --certificate file://terraform/vpc/certificates/server.crt --certificate-chain file://terraform/vpc/certificates/ca.crt --private-key file://terraform/vpc/certificates/server.key | jq -r '.CertificateArn')

aws acm add-tags-to-certificate --certificate-arn $SERVER_ARN --tags Key=Name,Value=vpn-server-cert

CLIENT_ARN=$(aws acm import-certificate --certificate file://terraform/vpc/certificates/client1.domain.tld.crt --certificate-chain file://terraform/vpc/certificates/ca.crt --private-key file://terraform/vpc/certificates/client1.domain.tld.key | jq -r '.CertificateArn')

aws acm add-tags-to-certificate --certificate-arn $CLIENT_ARN --tags Key=Name,Value=vpn-client-cert

The tags are important because that is what will be used to find them and attach them to the client VPN endpoints.

If you do not have a key pair for EC2 instances, go ahead and create one now.

We can go ahead and initalize both the vpc module and the openshift module:

cd terraform/vpc
terraform init -backend-config="bucket=<insert s3 bucket name here>"

cd ../openshift
terraform init -backend-config="bucket=<insert s3 bucket name here>"

After this, we’ll need to create a .tfvars file to populate these three variables:

openshift-cluster-name = "my-cluster"
ec2-key-location = "~/.ssh/my-ec2-key.pem"
ec2-key-name = "my-ec2-key"

Creating the VPC

Once we’ve created what we need for saving the state file to a backend, let’s create the VPC.

If you have one already you can skip this step but you’ll need to add some tags to existing resources. You’ll need to add the tag kubernetes.io/cluster/my-cluster = shared to these resources:

the VPC in which the cluster will be deployed

the Subnets in which the cluster AutoScaling Groups will create instances

Run the following command to apply the Terraform plan:

terraform apply -var-file=my-vars.tfvars

This will create a VPC which you can connect to over VPN. If you’re on Fedora or similar operating system, you can use the OpenVPN CLI to connect:

sudo openvpn --config client.config --cert certificates/client1.domain.tld.crt --key certificates/client1.domain.tld.key

Otherwise, connect using your preferred compatible client.

Creating the OpenShift Cluster

After you’ve connected over VPN (you need that connection to start the install on the bastion), we can start deploying the OpenShift cluster. Move to the openshift directory and start applying the Terraform plan:

terraform apply -var-file=my-vars.tfvars

This will create all the infrastructure required to create an OpenShift cluster:

3 t2.medium master-infra nodes
3 t2.medium compute nodes
1 t2.medium bastion node

Once the infrastructure is created, the Terraform plan will wait for the nodes to ready using the trick described in this post.

The deployment will run from the bastion node using Ansible. Since we’re using smaller hardware, some of the minimum requirements checks were disabled. If you update the variables to use larger nodes, you can re-enable those checks by commenting out the lines in the inventory file:

        openshift_disable_check:
          - memory_availability

Under the Hood

So what’s going on during the deployment? To make things a bit easier to troubleshoot and handle in general, it’s broken up into several phases:

Infrastructure deployment
- Bastion
- Masters
- Workers
OpenShift deployment
- Wait for nodes to ready
- Prepare Ansible inventory file
- Run OpenShift cluster install

During each infrastructure phase, corresponding IAM roles, instance profiles, and security groups are created as well. Each component is broken up into a separate module, along with the cluster install being its own module as well. During the cluster install, an Ansible inventory file is dynamically generated with some reasonable defaults. Then the addresses of each of the nodes is fed into the template and is used for the cluster installation.

It Works!

After the Terraform plan is complete, you should be able to ssh to one of the Master nodes and begin using the cluster:

[ec2-user@ip-10-0-1-8 ~]$ oc get nodes
NAME                         STATUS    ROLES          AGE       VERSION
ip-10-0-1-8.ec2.internal     Ready     infra,master   12m       v1.11.0+d4cacc0
ip-10-0-2-39.ec2.internal    Ready     compute        9m        v1.11.0+d4cacc0
ip-10-0-3-136.ec2.internal   Ready     infra,master   12m       v1.11.0+d4cacc0
ip-10-0-3-236.ec2.internal   Ready     compute        9m        v1.11.0+d4cacc0
ip-10-0-4-215.ec2.internal   Ready     compute        9m        v1.11.0+d4cacc0
ip-10-0-4-54.ec2.internal    Ready     infra,master   12m       v1.11.0+d4cacc0

[ec2-user@ip-10-0-1-8 ~]$ oc new-app https://github.com/openshift/ruby-hello-world.git
--> Found Docker image e42d0dc (16 months old) from Docker Hub for "centos/ruby-22-centos7"

    Ruby 2.2 
    -------- 
    Ruby 2.2 available as container is a base platform for building and running various Ruby 2.2 applications and frameworks. Ruby is the interpreted scripting language for quick and easy object-oriented programming. It has many features to process text files and to do system management tasks (as in Perl). It is simple, straight-forward, and extensible.

    Tags: builder, ruby, ruby22

    * An image stream tag will be created as "ruby-22-centos7:latest" that will track the source image
    * A Docker build using source code from https://github.com/openshift/ruby-hello-world.git will be created
      * The resulting image will be pushed to image stream tag "ruby-hello-world:latest"
      * Every time "ruby-22-centos7:latest" changes a new build will be triggered
    * This image will be deployed in deployment config "ruby-hello-world"
    * Port 8080/tcp will be load balanced by service "ruby-hello-world"
      * Other containers can access this service through the hostname "ruby-hello-world"

--> Creating resources ...
    imagestream.image.openshift.io "ruby-22-centos7" created
    imagestream.image.openshift.io "ruby-hello-world" created
    buildconfig.build.openshift.io "ruby-hello-world" created
    deploymentconfig.apps.openshift.io "ruby-hello-world" created
    service "ruby-hello-world" created
--> Success
    Build scheduled, use 'oc logs -f bc/ruby-hello-world' to track its progress.
    Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
     'oc expose svc/ruby-hello-world' 
    Run 'oc status' to view your app.

Cleaning it Up

When it comes time to clean it up, we’ll basically need to run the process in reverse.

If you do not disconnect from your VPN session prior to destroying the vpc module, your plan cleaup may get stuck, requiring that you manually fix the state lock.

First we’ll clean up the OpenShift cluster:

# CWD is <repo path>/terraform/openshift
terraform destroy -var-file=my-vars.tfvars

Next, we’ll disconnect from the VPN:

Thu Oct 10 15:11:50 2019 /sbin/ip route add 0.0.0.0/1 via 10.0.44.161
Thu Oct 10 15:11:50 2019 /sbin/ip route add 128.0.0.0/1 via 10.0.44.161
Thu Oct 10 15:11:50 2019 WARNING: this configuration may cache passwords in memory -- use the auth-nocache option to prevent this
Thu Oct 10 15:11:50 2019 Initialization Sequence Completed
# Hit Ctrl+C to stop the connection
^CThu Oct 10 22:30:42 2019 event_wait : Interrupted system call (code=4)
Thu Oct 10 22:30:42 2019 /sbin/ip route del 18.211.133.7/32
Thu Oct 10 22:30:42 2019 /sbin/ip route del 0.0.0.0/1
Thu Oct 10 22:30:42 2019 /sbin/ip route del 128.0.0.0/1
Thu Oct 10 22:30:42 2019 Closing TUN/TAP interface
Thu Oct 10 22:30:42 2019 /sbin/ip addr del dev tun0 10.0.44.162/27
Thu Oct 10 22:30:42 2019 SIGINT[hard,] received, process exiting

Then we’ll destroy the vpc module:

# CWD is <repo path>/terraform/vpc
terraform destroy -var-file=my-vars.tfvars

And finally, we’ll clean up our state backends, locks, and certificates:

# Backends
aws s3 rb --force s3://<insert s3 bucket name here>
aws dynamodb delete-table --table-name terraform-vpc-lock
aws dynamodb delete-table --table-name terraform-openshift-lock

# Certificates
SERVER_ARN=$(aws resourcegroupstaggingapi get-resources --tag-filters Key=Name,Values=vpn-server-cert --resource-type-filters acm:certificate | jq -r '.ResourceTagMappingList[0].ResourceARN')
CLIENT_ARN=$(aws resourcegroupstaggingapi get-resources --tag-filters Key=Name,Values=vpn-client-cert --resource-type-filters acm:certificate | jq -r '.ResourceTagMappingList[0].ResourceARN')

aws acm delete-certificate --certificate-arn $SERVER_ARN
aws acm delete-certificate --certificate-arn $CLIENT_ARN

And that’s that!