Persisting state between AWS EC2 spot instances

Spot instances on AWS are up to 90% cheaper than on-demand instances.

For example, on-demand m4.large (8 GB of RAM, 2 vCPU) instances cost $0.12/h or $90/month but spot prices are typically less than $0.02/h which is $15/month.

The catch is that they can be terminated at any time.

If it's OK if your webapp or service is offline for a while, then the cost savings can be very tempting.

In my case, there was an internal service that I wanted to run only during work hours.

If your app has state, you will probably want to preserve it when a spot instance is killed and then recreated. On AWS, you can store data on EBS volumes which are like external disks that you can attach to any instance. So you could simply reuse that volume when you launch a new spot instance, right? Unfortunately, it's not that simple.

Solutions

Not possible to choose root disk

When you launch a new instance (be it on-demand or spot), you can't select an existing volume as the root disk. Instead you have to choose an AMI (Amazon Machine Image). They can be created from a snapshot that can in turn be created from a volume.

Converting volumes to AMIs is a slow process and not something we want to do.

Stop, replace root disk, start

If you use on-demand instances, you can stop them, detach the created root volume, attach your existing volume and start it back up. But this doesn't work for spot instances. While you can restart them, stopping them is the same as terminating them.

So we'd like to get that reusable volume to be the root of our system without having to stop our spot instances.

Mount, chroot

It is possible to attach an existing volume to a running instance and then mount it.

One problem with this is that you may have problems starting services that expect to be in / not /mnt/disk. There's chroot /mnt/disk but it won't always work.

ec2-spotter is a utility that does this. It looks like it uses user data to run a script at boot time to replace the root disk and chroot into it.

Termination behavor: stop

Since September 18, 2017 if you use persistent spot requests, you can change the termination behavior to Stop instead of Terminate. Your instances will be stopped until new ones can be created. All volumes and even the instance ID will be preserved. Read more about it in the announcement.

But this won't work with one-off spot instances like in my case.

Docker containers

I propose running everything in Docker containers and storing the Docker data on an external volume. When a new spot instance is created, all we have to do is attach it and mount it at /var/lib/docker.

That's what I'm going to talk about in this blog post.

Docker

Make sure to start the spot instance in the same availability zone as your volume. If you need to launch a spot instance in another availability zone, you'll have to create a snapshot of your volume and then create a new volume from that snapshot in the other availability zone.

We are going to use Ubuntu 16.04 for this. We need AWS CLI tools and jq.

sudo apt install -y awscli jq

First, create a volume that you want to reuse.

You can do it in the AWS Console or with this command:

VOLUME_ID=$(
  aws ec2 create-volume --volume-type gp2 --size 20 \
   --region $(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) \
   --availability-zone $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone) | jq -r .VolumeId
)

Then attach this volume to your spot instance.

aws ec2 attach-volume --device /dev/xvdf \
  --volume-id $VOLUME_ID \
  --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
  --region $(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

Create a file system.

sudo mkfs.ext4 /dev/xvdf

Mount it at /var/lib/docker.

sudo mkdir -p /var/lib/docker
sudo mount /dev/xvdf /var/lib/docker

Install docker.

curl -fsSL get.docker.com | sudo bash

That's it.

To demonstrate that it works, let's run a service. We are going to use GitLab as an example.

sudo docker run --detach \
  --hostname `hostname` \
  --publish 80:80 --publish 443:443 --publish 2222:22 \
  --name gitlab \
  --restart always \
  --volume gitlab-etc:/etc/gitlab \
  --volume gitlab-log:/var/log/gitlab \
  --volume gitlab-opt:/var/opt/gitlab \
  gitlab/gitlab-ce:latest

We can see that it's running:

$ sudo docker ps
CONTAINER ID   IMAGE                     STATUS                             PORTS                     NAMES
787df39b279b   gitlab/gitlab-ce:latest   Up 27 seconds (health: starting)   0.0.0.0:80->80/tcp, ...   gitlab

It takes a minute to start. Then open the IP address in your browser, set a password, log in, create a test project in GitLab. You can see the logs with sudo docker logs gitlab -f.

Finally, terminate the spot instance with

sudo halt

Alright, it's time to see if it worked.

Launch a new spot instance and run this:

sudo apt install -y awscli jq
export VOLUME_ID=vol-0d77dd43c8a00576b

Attach the volume exactly like we did it before.

aws ec2 attach-volume --device /dev/xvdf \
  --volume-id $VOLUME_ID \
  --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
  --region $(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

Mount it.

sudo mkdir -p /var/lib/docker
sudo mount /dev/xvdf /var/lib/docker

Install Docker.

curl -fsSL get.docker.com | sudo bash

GitLab should be running:

$ sudo docker ps
CONTAINER ID   IMAGE                     STATUS                             PORTS                     NAMES
787df39b279b   gitlab/gitlab-ce:latest   Up 13 seconds (health: starting)   0.0.0.0:80->80/tcp, ...   gitlab

It takes a minute to start, but you should be able to access GitLab again and see the test project that you created.

CoreOS

If all you're going to do is run Docker containers, you might as well use a Linux distribution that only has Docker installed on it. One such distribution is CoreOS.

You can find the list of CoreOS AMIs here: https://coreos.com/os/docs/latest/booting-on-ec2.html

Note that the SSH user is core.

Attach the volume with

docker run --rm \
  -e AWS_ACCESS_KEY_ID=xxx \
  -e AWS_SECRET_ACCESS_KEY=xxx \
  xueshanf/awscli \
  aws ec2 attach-volume --device /dev/xvdf \
  --volume-id $VOLUME_ID \
  --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
  --region $(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

Because Docker is already installed and started, you need to stop it before mounting the volume and then start it again.

sudo systemctl stop docker
sudo rm -rf /var/lib/docker
sudo mkdir -p /var/lib/docker
sudo mount /dev/xvdf /var/lib/docker
sudo systemctl start docker

Tips & tricks

Storage drivers

If you followed the instructions on Ubuntu first and then tried to attach the volume on CoreOS, you may find that you don't see the containers you created.

It's probably because CoreOS is using the overlay2 driver but Ubuntu is using aufs or something. https://docs.docker.com/engine/userguide/storagedriver/selectadriver/#supported-storage-drivers-per-linux-distribution

You can set up overlay2 driver with Ubuntu but in my experiments I found that it may not work.

sudo mkdir -p /etc/docker
echo '{ "storage-driver": "overlay2" }' | sudo tee /etc/docker/daemon.json
curl -fsSL get.docker.com | sudo bash
sudo usermod -aG docker $USER

AWS CLI in Docker

Running docker run --rm xueshanf/awscli aws ec2 ... with your AWS access keys can be dangerous. What if the maintainer of this Docker image is stealing your access keys?

You can easily create your own image with this Dockerfile

FROM python:3.6-alpine
RUN pip install awscli
ENTRYPOINT ["aws"]

Then build it

docker build -t aws .

And then run it like this

docker run --rm aws --version