Posted 2015-12-26AWS17 minutes read (About 2542 words)

Switching Elastic IPs with Pacemaker + Corosync in an AWS Multi-AZ Setup

Overview

This post summarizes my implementation of
failover by switching an EIP across Multi-AZ on AWS
using Pacemaker & Corosync.

The idea is illustrated below.

Normal state

Normal

When a failure occurs on Instance A placed in Availability Zone A,
the EIP is reassigned to Instance B placed in Availability Zone B

Accident occured

ToDo

Configure VPC and Subnets
Install / configure Pacemaker & Corosync
Build the cluster
Create the EIP reassignment script
Run the failover test

Environment

CentOS 7 (x86_64) with Updates HVM (t2.micro)

Since this is for verification, I am using t2.micro.

Building the VPC and Subnets

The following article does an excellent job of summarizing this, so please use it as a reference;
I will reuse these settings as-is from here on.

0から始めるAWS入門①：VPC編

Just in case, here are the VPC and Subnet settings.

VPC settings

Item	Value
Name tag	Any
CIDR	10.0.0.0/16
tenancy	Default

Subnet settings

Item	Subnet 1	Subnet 2
Name tag	Any (easier to manage if associated with the VPC’s tag name)	Any (easier to manage if associated with the VPC’s tag name)
VPC	Select the VPC created above	Select the VPC created above
Availability Zone	ap-northeast-1a	ap-northeast-1c
CIDR	10.0.0.0/24	10.0.1.0/24

Based on the VPC settings above, I will configure the following.

The setup we are building looks like this.

Creating the Security Group

In advance, create the security group to be attached to the two instances we will create this time.

Allow SSH login from My IP

Item	Value
Security group name	VPC-for-EIP (any)
Description	VPC-for-EIP (any)
VPC	Select the VPC created above

Editing the created security group

Search with the filter

* Adjust the following according to your environment.

Set the source to the security group ID you created, then add and save the following

Type	Protocol	Port Range	Source	Purpose
All TCP	TCP	0 - 65535	The security group ID you created	Fully open since this is for verification. Adjust the settings as appropriate.
All ICMP	ICMP	0 - 65535	The security group ID you created	For checking ping connectivity. Fully open since this is for verification. Adjust the settings as appropriate.
All UDP	UDP	0 - 65535	The security group ID you created	The ports required by corosync are 5404 - 5405 by default. Be careful if you change the settings depending on your environment. Fully open since this is for verification. Adjust the settings as appropriate.
SSH	TCP	20	My IP	For SSH login from your own PC. There is no need to set this in a real environment.
HTTP	TCP	80	My IP	For failover verification. There is no need to set this in a real environment.

That completes creating the security group to apply to the instances.

Creating the Policy

This time, we need to run the following commands.

Command	Purpose
aws ec2 associate-address	Associate an Elastic IP with an instance
aws ec2 disassociate-address	Disassociate an Elastic IP from an instance
aws ec2 describe-addresses	Get details about IP addresses

Access the Identity & Access Management page

Click “Create Policy”

Create a custom policy

Enter the custom policy details

Policy name (any)

floatingElasticIP

Policy document

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:Describe*",
                "ec2:DisassociateAddress",
                "ec2:AssociateAddress"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Confirm

Creating the IAM Role

Create a role that has permission to reassign the Elastic IP.

Click “Create New Role”

Set the role name

Select the role type

Click the “Select” button for Amazon EC2

Attach the policy

Review the registered details and create the role

Confirm it was created

That completes creating the IAM role to apply to the instances.

Creating the User

Access the Identity & Access Management page

Click Users in the menu & click the Create New Users button

Enter the user name and click the Create button

Click Download Credentials

A CSV containing the Access Key Id and Secret Access Key will be downloaded.
Store it carefully.

Access the created user

Start attaching the policy

Check the policy and attach it

With that,
the floatingIP user with AmazonEC2FullAccess permission has been created.

The credentials for this user will be used in the Install aws-cli step.

Creating the Instances

Create an instance (hereafter Instance A) in the Subnet (ap-northeast-1a) of the VPC created above.

Click “Launch Instance”

Select the machine image

This time we select CentOS 7 (x86_64) with Updates HVM.

Select the instance type

Since I want to use the free tier for verification this time, I select t2.micro.

Configure instance details

Set the primary IP of Instance A, created in ap-northeast-1a,
to 10.0.0.20.

Add storage

Proceed to the next step without changing anything in particular

Tag the instance

Set Instance A for the Name tag.

This is arbitrary, so any easy-to-understand text is fine.

Configure the security group

Select the security group created in advance

Confirm the instance creation

That completes creating Insntace A.

Create `Instance B` in the same way

Main differences from `Instance A`

Select Subnet 10.0.1.0/24
Set the instance tag to Instance B

Notes when configuring `Instance B`

For the security group, select the same security group configured for
Instance A for Instance B as well.

Disable the Source/Destination check

For both Instance A and B above, you need to set Source/Destination Check (Networking > Change Source/Dest. Check) to Disabled.

First things to do after SSH login to the instances

Install the minimum required modules

git is required when installing the shell script used to reassign the Elastic IP.

[Instance A & B ]# yum install -y git

[Instance A & B ]# git --version

git version 1.8.3.1

Install httpd and php for failover verification

These are installed and started purely to observe the behavior during failover.
* This is not a required step.

[Instance A & B ]# yum --disableexcludes=main install -y gcc
[Instance A & B ]# yum install -y gmp gmp-devel
[Instance A & B ]# yum install -y php php-mysql httpd libxml2-devel net-snmp net-snmp-devel curl-devel gettext
[Instance A & B ]# echo '<?php print_r($_SERVER["SERVER_ADDR"]); ?>' > /var/www/html/index.php
[Instance A & B ]# systemctl start httpd
[Instance A & B ]# systemctl enable httpd

Adjust the system clock to JST

If the time inside the OS is out of sync with the actual time,
aws-cli may not work correctly,
so let’s adjust it just in case.

# Take a backup
[Instance A & B ]# cp /etc/sysconfig/clock /etc/sysconfig/clock.org

# Make the setting persist even after a reboot.
[Instance A & B ]# echo -e 'ZONE="Asia/Tokyo"\nUTC=false' > /etc/sysconfig/clock

# Take a backup
[Instance A & B ]# cp /etc/localtime /etc/localtime.org

# Set Asia/Tokyo as localtime
[Instance A & B ]# ln -sf  /usr/share/zoneinfo/Asia/Tokyo /etc/localtime

[Instance A & B ]# date

Creating the Elastic IP

Create an Elastic IP and associate it with Server A.

Allocate a new address

Click “Associate” in the confirmation popup

Confirm success

Associate with an instance

Select the instance to associate

Confirm

With that, the Elastic IP has been associated with Instance A.

SSH login to Instance A & B

SSH login to Instance A

1	[Local PC]# ssh -i aws.pem centos@<Instance A's Public IP>

SSH login to Instance B

1	[Local PC]# ssh -i aws.pem centos@<Instance B's Public IP>

Configuring /etc/hosts

1 2	[Instance A ]# uname -n ip-10-0-0-10.ap-northeast-1.compute.internal

1 2	[Instance B ]# uname -n ip-10-0-1-10.ap-northeast-1.compute.internal

[Instance A & B ]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Add the following
10.0.0.20 ip-10-0-0-20.ap-northeast-1.compute.internal
10.0.1.20 ip-10-0-1-20.ap-northeast-1.compute.internal

Installing Pacemaker & Corosync

pcs is the Pacemaker cluster management tool that replaces the legacy crmsh, and using pcs is recommended on RHEL/CentOS 7.

1	[Instance A & B ]# yum -y install pcs fence-agents-all

Check the versions

[Instance A & B ]# pcs --version
0.9.143

[Instance A & B ]# pacemakerd --version
Pacemaker 1.1.13-10.el7
Written by Andrew Beekhof

[Instance A & B ]# corosync -v
Corosync Cluster Engine, version '2.3.4'
Copyright (c) 2006-2009 Red Hat, Inc.

Setting the hacluster password

When the corosync package is installed, a hacluster user is automatically added.
Set the password for that hacluster user.

[Instance A & B ]# passwd hacluster
Changing password for user hacluster.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

Starting pcsd

To perform cluster monitoring

1
2
3

[Instance A & B ]# systemctl start pcsd
[Instance A & B ]# systemctl enable pcsd
[Instance A & B ]# systemctl status pcsd

Cluster authentication

Verify access authentication to each host that forms the cluster.

Run this from either one of the instances.
The following is run from Instance A.

[Instance A ]# pcs cluster auth ip-10-0-0-20.ap-northeast-1.compute.internal ip-10-0-1-20.ap-northeast-1.compute.internal
Username: hacluster
Password:
ip-10-0-1-20.ap-northeast-1.compute.internal: Authorized
ip-10-0-0-20.ap-northeast-1.compute.internal: Authorized

If you see Authorized output as above, there is no problem, but
if you see an error such as Unable to Communicate like below,
review the settings on each Instance.

Example of an authentication error

1
2
3

[Instance A ]# pcs cluster auth ip-10-0-0-20.ap-northeast-1.compute.internal ip-10-0-1-20.ap-northeast-1.compute.internal -u hacluster -p ruby2015
Error: Unable to communicate with ip-10-0-0-20.ap-northeast-1.compute.internal
Error: Unable to communicate with ip-10-0-1-20.ap-northeast-1.compute.internal

Cluster configuration

Configure the cluster.

[Instance A ]# pcs cluster setup --name aws-cluster ip-10-0-0-20.ap-northeast-1.compute.internal ip-10-0-1-20.ap-northeast-1.compute.internal --force

Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
ip-10-0-0-20.ap-northeast-1.compute.internal: Succeeded
ip-10-0-1-20.ap-northeast-1.compute.internal: Succeeded
Synchronizing pcsd certificates on nodes ip-10-0-0-20.ap-northeast-1.compute.internal, ip-10-0-1-20.ap-northeast-1.compute.internal...
ip-10-0-0-20.ap-northeast-1.compute.internal: Success
ip-10-0-1-20.ap-northeast-1.compute.internal: Success

Restaring pcsd on the nodes in order to reload the certificates...
ip-10-0-0-20.ap-northeast-1.compute.internal: Success
ip-10-0-1-20.ap-northeast-1.compute.internal: Success

Starting the cluster

Start the cluster across all hosts.

[Instance A ]# pcs cluster start --all

ip-10-0-1-20.ap-northeast-1.compute.internal: Starting Cluster...
ip-10-0-0-20.ap-northeast-1.compute.internal: Starting Cluster...

Installing aws-cli

Use the Access Key Id and Secret Access Key written in the credentials.csv
that you downloaded in the Creating the User step.

[Instance A & B ]# rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
[Instance A & B ]# yum -y install python-pip
[Instance A & B ]# pip --version
pip 7.1.0 from /usr/lib/python2.7/site-packages (python 2.7)

[Instance A & B ]# pip install awscli
[Instance A & B ]# aws configure
AWS Access Key ID [None]: *********************
AWS Secret Access Key [None]: **************************************
Default region name [None]: ap-northeast-1
Default output format [None]: json

Creating the EIP reassignment resource

OCF_ROOT is specified as a constant, but it does not exist, so

[Instance A & B ]# cd /tmp
[Instance A & B ]# git clone https://github.com/moomindani/aws-eip-resource-agent.git
[Instance A & B ]# cd aws-eip-resource-agent
[Instance A & B ]# sed -i 's/\${OCF_ROOT}/\/usr\/lib\/ocf/' eip
[Instance A & B ]# mv eip /usr/lib/ocf/resource.d/heartbeat/
[Instance A & B ]# chown root:root /usr/lib/ocf/resource.d/heartbeat/eip
[Instance A & B ]# chmod 0755 /usr/lib/ocf/resource.d/heartbeat/eip

Configuring pacemaker

Disable stonish

1	[Instance A ]# pcs property set stonith-enabled=false

Configure quorum so that it does not take any special action even if `split-brain` occurs

1	[Instance A ]# pcs property set no-quorum-policy=ignore

What is split-brain?
When a problem such as a disconnection occurs on the network used for heartbeat communication, a host mistakenly assumes another host has failed,
and the standby host, which should not become active, ends up becoming active.

Set the wait time on attribute value updates ( `crmd-transition-delay` ) to 0s (seconds)

1	[Instance A ]# pcs property set crmd-transition-delay="0s"

Pacemaker-1.0.11 がリリースされました

No automatic failback; set the number of attempts to restart the resource on the same server to 1

1	[Instance A ]# pcs resource defaults resource-stickiness="INFINITY" migration-threshold="1"

EIP switching configuration

The Elastic IP we created and associated with Instance A this time is 52.192.203.215.
Reflect it in the following configuration.

[Instance A ]# pcs resource create eip ocf:heartbeat:eip \
    params \
        elastic_ip="52.192.203.215" \
    op start   timeout="60s" interval="0s"  on-fail="stop" \
    op monitor timeout="60s" interval="10s" on-fail="restart" \
    op stop    timeout="60s" interval="0s"  on-fail="block"

Checking the cluster configuration

[Instance A ]# pcs config

pcs config
Cluster Name: aws-cluster
Corosync Nodes:
 ip-10-0-0-20.ap-northeast-1.compute.internal ip-10-0-1-20.ap-northeast-1.compute.internal
Pacemaker Nodes:
 ip-10-0-0-20.ap-northeast-1.compute.internal ip-10-0-1-20.ap-northeast-1.compute.internal

Resources:
 Resource: eip (class=ocf provider=heartbeat type=eip)
  Attributes: elastic_ip=52.192.203.215
  Operations: start interval=0s timeout=60s on-fail=stop (eip-start-interval-0s)
              monitor interval=10s timeout=60s on-fail=restart (eip-monitor-interval-10s)
              stop interval=0s timeout=60s on-fail=block (eip-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 resource-stickiness: INFINITY
 migration-threshold: 1
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: aws-cluster
 crmd-transition-delay: 0s
 dc-version: 1.1.13-10.el7-44eb2dd
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false

Verifying the failover

In the Install httpd and php for failover verification step,
we placed an index.php file in the DocumentRoot (/var/www/html/)
that displays the Private IP ($_SERVER["SERVER_ADDR"]).

From the browser, you can tell, based on the Private IP, whether you are
accessing Instance A or Instance B.

Access the Elastic IP from the browser

When you access the Elastic IP 52.192.203.215,
you can see that the Private IP 10.0.0.20 is displayed.

You can tell that the Elastic IP is currently associated with Instance A.

Stop corosync on Instance A

1	[Instance A]# systemctl stop corosync

Access the Elastic IP from the browser again

When you reload the browser you displayed earlier a few times,
you can see that the Private IP 10.0.1.20 is displayed.

You can tell that the Elastic IP has been associated with Instance B.

The Elastic IP has been disassociated from Instance A and is now associated with Instance B.

You can also confirm this on the console page.

With that, although it is a simple example,
the floating IP (Elastic IP) of the Cloud Design Pattern has been achieved.

That’s all.

References

kenzo0107

About me

Switching Elastic IPs with Pacemaker + Corosync in an AWS Multi-AZ Setup

Overview

ToDo

Environment

Building the VPC and Subnets

Just in case, here are the VPC and Subnet settings.

Creating the Security Group

Allow SSH login from My IP

Editing the created security group

Creating the Policy

Access the Identity & Access Management page

Click “Create Policy”

Create a custom policy

Enter the custom policy details

Confirm

Creating the IAM Role

Click “Create New Role”

Set the role name

Select the role type

Attach the policy

Review the registered details and create the role

Confirm it was created

Creating the User

Access the Identity & Access Management page

Click Users in the menu & click the Create New Users button

Enter the user name and click the Create button

Click Download Credentials

Access the created user

Start attaching the policy

Check the policy and attach it

Creating the Instances

Click “Launch Instance”

Select the machine image

Select the instance type

Configure instance details

Add storage

Tag the instance

Configure the security group

Confirm the instance creation

Create Instance B in the same way

Main differences from Instance A

Notes when configuring Instance B

Disable the Source/Destination check

First things to do after SSH login to the instances

Install the minimum required modules

Install httpd and php for failover verification

Adjust the system clock to JST

Creating the Elastic IP

Allocate a new address

Click “Associate” in the confirmation popup

Confirm success

Associate with an instance

Select the instance to associate

Confirm

SSH login to Instance A & B

Configuring /etc/hosts

Installing Pacemaker & Corosync

Setting the hacluster password

Starting pcsd

Cluster authentication

Cluster configuration

Starting the cluster

Installing aws-cli

Creating the EIP reassignment resource

Configuring pacemaker

Disable stonish

Configure quorum so that it does not take any special action even if split-brain occurs

Set the wait time on attribute value updates ( crmd-transition-delay ) to 0s (seconds)

No automatic failback; set the number of attempts to restart the resource on the same server to 1

EIP switching configuration

Checking the cluster configuration

Verifying the failover

Access the Elastic IP from the browser

Stop corosync on Instance A

Access the Elastic IP from the browser again

References

Like this article? Support the author with

Catalogue

Create `Instance B` in the same way

Main differences from `Instance A`

Notes when configuring `Instance B`

Configure quorum so that it does not take any special action even if `split-brain` occurs

Set the wait time on attribute value updates ( `crmd-transition-delay` ) to 0s (seconds)