MySQL MHA配置记录

MHA介绍

MHA是一套 MySQL 高可用管理软件，除了检测 Master 宕机后，提升候选 Slave 为 New Master 之外，还会自动让其他 Slave 与 New Master 建立复制关系。切换通过浮动 ip 的方式，浮动 ip 将绑定在任何时期的主节点上（若切换则 ip 也漂移），通过浮动 ip 提供服务。

MHA Manager可以单独部署在一台独立的机器上，并管理多个 master-slave 集群，只需要指定不通的配置文件。

-- 
+------------------------------------+ (app1.conf) +----------+
|                           VIP      |   heartbeat |    MHA   |
|  +----------+    rpl   +--------+  <-------------+  Manager |
|  | Slave(M) +----------> Master |  | rpl manager +----------+
|  +----------+          +-^----^-+  |                  |
|                  rpl     |    |    |                  |
|       +------------------+    |rpl |                  |
|       |                       |    |      (app2.conf) |
|  +---------+           +-------+-+ |             +----V----+
|  | Slave 1 |  ......   | Slave N | |             |  Other  |
|  +---------+           +---------+ |             |  MySQL  |
|                                    |             |   rpl   |
|             MySQL Replication      |             +---------+
+------------------------------------+

集群节点描述

MHA Manager： MHA的管理节点，负责检测MySQL的状态，以及管理MySQL的复制关系
Master： MySQL对外提供的服务（通过 VIP ）的主节点
**Slave(M)**： MySQL候选主节点，本质上为一个Slave节点，只是在Master宕机后，会被提升为 New Master
Slave N： MySQL从机节点，对外可以提供只读服务

故障检测过程

当Master宕机后，MHA Manager会让 Slave(M) 提升为 New Master，然后修改 Slave(N) 的复制关系（CHANGE MASTER），指向 New Master

MHA Manager检测Master是否宕机，不会简单的只检查自身与Master之间的直连状态
MHA Manager会透过其他Slave节点去检测Master，进行二次探测，最大限度的避免脑裂的发生
若检测到Master宕机了， MHA Manager会透过Slave(M) 检测是否为宕机；如果检测后仍为宕机状态，会继续透过Slave1…SlaveN进行探测，只有在透过所有节点检测到Master宕机了，才真的认为Master宕机了。

以上操作步骤的前提是，**MHA Manager 到所有 MySQL 节点的 SSH 免密码登录要打通**。
最后将提供对外服务的 VIP 漂移到 New Master 上，继续对外提供服务。

MHA Failover过程

从宕机崩溃的 Master 节点保存二进制日志事件（binlog events）
- 此种情况为 MySQL 挂了，但是服务器没挂，还是可以通过 SSH 连接过去
- 如果服务器彻底宕机了，该步骤略过
识别含有最新的更新的 Slave 节点
应用差异的中继日志（ relay-log）到其他的 Slave
应用从 Master 保存的二进制日志事件（binlog events）
- 如果之前 Master 彻底宕机了，就没有保存的 binlog，该步骤略过
提升一个 Slave 为新的 Master（ New）
使其他的 Slave 连接新的 Master（ New）进行复制

注意

从上面的步骤看， MHA本身无法完全保证数据不会丢

MySQL 5.7 之前数据不丢的前提是 Master 服务器还可以被 MHA Manager 进行 SSH 连接，通过应用保存的binlog的方式来保证。
MySQL 5.7 之后通过无损复制，仅仅是减少了丢数据的可能性，假如此时的状态为切成异步的状态，那就和之前一样了（可以设置超时的时间很大）。
当 Master 恢复的时候，最后一部分数据是否需要 Flashback， MHA 不负责这个操作，需要人工确认。

准备工作

会用到 mysql utilities 中的工具，请安装。

测试集群拓扑图

注意： slave 中的 /etc/my.cnf 中需要配置 report-host=slave ip_addr ，否则命令执行失败。

shell> mysqlrplshow --master=root:111111@"192.168.122.66:3307" --discover-slaves-login=root:111111 -r -v
WARNING: Using a password on the command line interface can be insecure.
# master on 192.168.122.66: ... connected.
# Finding slaves for master: 192.168.122.66:3307
# master on 192.168.122.70: ... connected.
# Finding slaves for master: 192.168.122.70:3307

WARNING: Cannot connect to some slaves:
 - 192.168.122.70:3306: Can't connect to MySQL server on '192.168.122.70:3306' (111 Connection refused)
# master on 192.168.122.80: ... connected.
# Finding slaves for master: 192.168.122.80:3307

# Replication Topology Graph
192.168.122.66:3307 (MASTER)  [由于机器数量问题，此节点会跑MHA manager和node]
   |
   +--- 192.168.122.70:3307 [IO: Yes, SQL: Yes] - (SLAVE)  [做MHA node]
   |
   +--- 192.168.122.80:3307 [IO: Yes, SQL: Yes] - (SLAVE)  [做MHA node]

另外如果使用 mysqlreplicate 创建复制关系，则他会帮你在你的 Master上创建 [email protected] ， [email protected] 用户，需要注意这个 rpl 用户需要赋予 replication和slave 权限，可以手动确认一下。

MHA软件角色分布

上述架构，MHA 中 Manager 节点最好不要在集群中(这里空闲的测试机有限，so..，需要打通 Manager 节点到所有 mysql 节点的 ssh 密钥连接，mysql 节点相互也需要打通，方便无密码登陆。操作不演示。

安装MHA软件

示例基于1主2从，需要先搭建好，打通 ssh 认证。
MHA Node 在所有节点上都要安装

获取MHA源码

获取MHA源码，github中tar包最新的貌似是 0.56 下载地址，rpm 包有 0.57 下载地址，需要源码版可以往百度云找，已收藏…
PS：有梯子可以打开此处下载0.57

如上准备主从，mysqlreplicate搭建，不演示。

安装依赖包

安装 perl 依赖包

-- redhat系安装依赖，若装完执行makefile报错可以尝试重新登陆下
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes \
    perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker perl-ExtUtils-Embed cpan perl-CPAN
   
-- debian系安装依赖
apt install libdbd-mysql-perl libconfig-tiny-perl liblog-dispatch-perl libparallel-forkmanager-perl

安装MHA Manager

mysql master 节点安装 mha node 和 manager，先解压好（这里环境的 MHA Manager 放在了 mysql 节点中）

shell> cd mha4mysql-manager-0.57
shell> perl Makefile.PL
shell> make && make install
shell> ll bin/
total 40
-rwxr-xr-x 1 git mysql 1995 May 31  2015 masterha_check_repl         # 检查MySQL复制状态
-rwxr-xr-x 1 git mysql 1779 May 31  2015 masterha_check_ssh          # 检查MHA的SSH状态
-rwxr-xr-x 1 git mysql 1865 May 31  2015 masterha_check_status       # 检查当前MHA的状态
-rwxr-xr-x 1 git mysql 3201 May 31  2015 masterha_conf_host          # 添加或删除配置的server信息
-rwxr-xr-x 1 git mysql 2517 May 31  2015 masterha_manager            # 启动MHA
-rwxr-xr-x 1 git mysql 2165 May 31  2015 masterha_master_monitor     # 检测master是否宕机
-rwxr-xr-x 1 git mysql 2373 May 31  2015 masterha_master_switch      # 控制故障转移（手动或者自动）
-rwxr-xr-x 1 git mysql 5171 May 31  2015 masterha_secondary_check    # 通过slave检测master是否宕机（二次探测）
-rwxr-xr-x 1 git mysql 1739 May 31  2015 masterha_stop               # MHA停止
shell> mkdir -p /etc/masterha          # MHA配置文件存放目录
shell> mkdir -p /var/log/masterha/app1 # 建议和MHA配置文件保持一致，若管理多套mysql集群可以再新建appxxx
shell> mkdir -p /usr/local/mha/        # 准备存放MHA可执行程序和lib
shell> cp -rf bin/ lib/ /usr/local/mha/

shell> cd ../mha4mysql-node-0.57/      # manager也需要安装node工具
shell> perl Makefile.PL
shell> make && make install
shell> cp bin/* /usr/local/mha/bin/
shell> cp lib/* /usr/local/mha/lib/

安装MHA node

为所有节点安装 MHA node，这里两个 slave 只安装 mha node

shell> yum install perl-DBI perl-DBD-MySQL -y   # node节点依赖较少
shell> cd mha4mysql-node-0.57/
shell> perl Makefile.PL
shell> make && make install
shell> ll bin/
total 44
-rwxr-xr-x 1 1001 mysql 16381 May 31  2015 apply_diff_relay_logs  # 识别差异的中继日志事件并将其差异的事件应用于其他的slave
-rwxr-xr-x 1 1001 mysql  4807 May 31  2015 filter_mysqlbinlog     # 去除不必要的ROLLBACK事件（脚本中指出该脚本已经过时，应该是被废弃了）
-rwxr-xr-x 1 1001 mysql  8261 May 31  2015 purge_relay_logs       # 清除中继日志（不会阻塞SQL线程）
-rwxr-xr-x 1 1001 mysql  7525 May 31  2015 save_binary_logs       # 保存和复制master的二进制日志

MHA自带配置文件

MHA-manager 编译完在 samples 目录下有配置文件模板和 failover 脚本供参考，这里用另外写的

shell> tree samples/
samples/
├── conf
│   ├── app1.cnf
│   └── masterha_default.cnf
└── scripts
    ├── master_ip_failover
    ├── master_ip_online_change
    ├── power_manager
    └── send_report

MHA配置文件和脚本

配置文件统一放在 /etc/masterha/ 中，针对当前这个 MySQL 复制组，我们定义为 app1.conf ，如果还有其他 MySQL 复制组，可以继续定义 app2.conf 、app3.conf等等

示例配置文件细述

/etc/masterha/app1.conf 定义一个 mysql 集群如下内容

#
# MHA Manager 端 /etc/masterha/app1.conf
#
[server default]
# 这两个参数需要根据不同的集群进行修改
manager_workdir=/var/log/masterha/app1
manager_log=/var/log/masterha/app1/manager.log
# 按照master服务器存放binlog的实际路径进行修改，主要为了让MHA拉取binlog
master_binlog_dir=/data/mysql_data/5.7.11/
# 设置自动failover的脚本，下面会给出
master_ip_failover_script= /usr/local/mha/bin/master_ip_failover
# 设置手动切换时候的脚本 （供(masterha_master_switch使用）
master_ip_online_change_script= /usr/local/mha/bin/master_ip_online_change
log_level=debug
# 监控的用户
user=root
# 监控用户的密码
password=111111
# 监控主库的时间间隔，默认是3秒，尝试三次没有回应的时候自动进行failover
ping_interval=3
# 检测方式是insert， MHA-0.56开始支持insert
# 会在Master中生成一个 infra 数据库
ping_type=INSERT
# 设置远端mysql在发生切换时binlog的保存位置
remote_workdir=/tmp
# 复制用的密码
repl_password=rpl
# 复制的用户
repl_user=rpl
# 告警脚本，可自行修改，这里没有使用
#report_script=/usr/local/mha/bin/send_report
# 通过从机进行二次探测的脚本， IP地址按照实际的情况进行修改
secondary_check_script=/usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307
# 设置故障发生后关闭故障主机的脚本（主要作用是关闭主机防止发生脑裂,这里没有使用，类似Fence功能）
#shutdown_script="/usr/local/mha/bin/power_manager --command=stopssh2 --host=test-1 --ssh_user=root"
# 定义ssh的用户
ssh_user=root
[server1]
# 这个hostname也可以配置成IP地址，同 ip 参数一样
# 如果这里写名字，需要DNS配合，或者使用 /etc/hosts
hostname=master
ip=192.168.122.66
port=3307
# candidate_master参数的意思为：设置为候选Master，如果发生主从切换，该主机会被提升为Master，即使这个服务器上的数据不是最新的（会用relay-log补全）
candidate_master=1
[server2]
hostname=slave1
ip=192.168.122.70
port=3307
candidate_master=1
# check_repl_delay参数的意思为：默认情况下如果一个slave落后master 100M的relay logs的话， MHA将不会选择该slave作为一个新的master；
# 因为对于这个slave的恢复需要花费很长时间;
# 通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时;
# 这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master
check_repl_delay=0
[server3]
hostname=slave2
ip=192.168.122.80
port=3307
# no_master 表示该主机不会被提升为Master
no_master=1

failover脚本

failover 脚本，新建 /usr/local/mha/bin/master_ip_failover 文件，记得加执行权限

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command, $ssh_user, $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);

# 注意修改系统对应的接口名称（CentOS比较变态，若多节点可以升级成master，记得接口名称需要一样，比如都为eth0...）
my $vip = '192.168.122.99/24';
my $eth = 'eth0';
my $key = '88';
my $ssh_start_vip = "/sbin/ifconfig $eth:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig $eth:$key down";

GetOptions(
    'command=s' => \$command,
    'ssh_user=s' => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s' => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s' => \$new_master_host,
    'new_master_ip=s' => \$new_master_ip,
    'new_master_port=i' => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";
    
    if ( $command eq "stop" || $command eq "stopssh" ) {
    
        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {
        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}

sub stop_vip() {
    return 0 unless ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

在线切换master的脚本

创建在线切换 master 的脚本 /usr/local/mha/bin/master_ip_online_change，同样加执行权限

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

#my (
# $command, $ssh_user, $orig_master_host, $orig_master_ip,
# $orig_master_port, $new_master_host, $new_master_ip, $new_master_port
#);

my (
    $command,              $orig_master_is_new_slave, $orig_master_host,
    $orig_master_ip,       $orig_master_port,         $orig_master_user,
    $orig_master_password, $orig_master_ssh_user,     $new_master_host,
    $new_master_ip,        $new_master_port,          $new_master_user,
    $new_master_password,  $new_master_ssh_user,
);

# 同样注意修改系统对应的接口名称，和对应的VIP
my $vip = '192.168.122.99/24';
my $eth = 'eth0';
my $key = '88';
my $ssh_start_vip = "/sbin/ifconfig $eth:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig $eth:$key down";
my $ssh_user = "root";

GetOptions(
    'command=s' => \$command,
    #'ssh_user=s' => \$ssh_user,
    #'orig_master_host=s' => \$orig_master_host,
    #'orig_master_ip=s' => \$orig_master_ip,
    #'orig_master_port=i' => \$orig_master_port,
    #'new_master_host=s' => \$new_master_host,
    #'new_master_ip=s' => \$new_master_ip,
    #'new_master_port=i' => \$new_master_port,
    'orig_master_is_new_slave' => \$orig_master_is_new_slave,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s' => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'orig_master_user=s' => \$orig_master_user,
    'orig_master_password=s' => \$orig_master_password,
    'orig_master_ssh_user=s' => \$orig_master_ssh_user,
    'new_master_host=s' => \$new_master_host,
    'new_master_ip=s' => \$new_master_ip,
    'new_master_port=i' => \$new_master_port,
    'new_master_user=s' => \$new_master_user,
    'new_master_password=s' => \$new_master_password,
    'new_master_ssh_user=s' => \$new_master_ssh_user,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";
    
    if ( $command eq "stop" || $command eq "stopssh" ) {
        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {
        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}

sub stop_vip() {
    return 0 unless ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --ssh-user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

Slave上的relay_log配置

从 MHA Failover 的过程中可以了解到， MHA Manager 在恢复（补齐）其他 Slave 数据时会用到 relay-log ，因此这些 relay-log 需要被保留。

而默认情况下， SQL 线程在回放完毕后， MySQL 会主动删除 relay-log ，需要禁用该功能，确保 relay-log 不被自动删除。

所以在所有 Slave 节点中配置如下参数，然后重启mysql服务。

# 所有的Slave节点
[mysqld]
# 关闭relay-log主动删除的功能
relay_log_purge = 0

但是这样做了以后又带来另外一个问题，relay-log 会大量堆积导致磁盘空间紧张，所以需要定时清空过时的relay-log。MHA Node 的安装包中有一个 pure_relay_logs 工具，提供删除大量 relay-log 的功能。

pure_relay_logs工具

该工具原理：在Linux下删除体积较大的文件需要一定的时间，且消耗一定的资源，所以 pure_relay_logs 在删除relay-log 之前会做一次硬连接，然后删除对应的 relay-log （只删除指向 relay-log 的指针），这样不会造成系统资源消耗，从而影响复制（造成复制延时），最后再删除硬连接的relay-log（_hardlink_）。

参数

--user ： 用户名
--password ： 密码
--port ： 端口号
--workdir ： 指定 创建relay-log的硬链接 的位置
    默认是 /var/tmp ，由于 硬连接不能跨分区 ，所以请 确保这个目录和你的relay-log在同一个分区 ；
    当脚本执行成功后，硬链接会被删除
--disable_relay_log_purge ： 默认情况下，如果 relay_log_purge=1 ，脚本会什么都不清理，自动退出
    通过设定这个参数，当 relay_log_purge=1 的情况下，该参数会将relay_log_purge设置为0。清理relay log之后，最后再设置 relay_log_purge=0 。
    但是还是要在 /etc/my.cnf 中显示配置 relay_log_purge=0 ，避免重启服务后被还原。

可以增加如下定时清理脚本到定时任务中，大概几个小时执行一次

#!/bin/bash

user=root
passwd=123
port=3306
# 目录可以创建
log_dir='/mysql/purge_relay_logs'
work_dir='/mysql/relay_log_hardlink'
purge='/usr/local/mha/bin/purge_relay_logs'

if [[ ! -d ${work_dir} ]];then
	mkdir ${work_dir} -p
fi

if [[ ! -d ${log_dir} ]];then
	mkdir ${log_dir} -p
fi

${purge} --user=${user} --password=${passwd} --port=${port} --workdir=${work_dir} \
         --disable_relay_log_purge >> ${log_dir}/purge_relay_logs.log 2>&1

Manager检测配置中主机状态

到次配置几乎完成，但是还未启动，可以在 MAH manager 机器运行 ssh 检测和复制关系检测

# ssh检测
shell> masterha_check_ssh --conf=/etc/masterha/app1.conf   # 下面是检测正常
Wed Mar 14 09:55:45 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Mar 14 09:55:45 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:55:45 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:55:45 2018 - [info] Starting SSH connection tests..
Wed Mar 14 09:55:46 2018 - [debug] 
Wed Mar 14 09:55:45 2018 - [debug]  Connecting via SSH from root@master(192.168.122.66:22) to root@slave1(192.168.122.70:22)..
Wed Mar 14 09:55:45 2018 - [debug]   ok.
Wed Mar 14 09:55:45 2018 - [debug]  Connecting via SSH from root@master(192.168.122.66:22) to root@slave2(192.168.122.80:22)..
Wed Mar 14 09:55:46 2018 - [debug]   ok.
Wed Mar 14 09:55:47 2018 - [debug] 
Wed Mar 14 09:55:46 2018 - [debug]  Connecting via SSH from root@slave2(192.168.122.80:22) to root@master(192.168.122.66:22)..
Wed Mar 14 09:55:47 2018 - [debug]   ok.
Wed Mar 14 09:55:47 2018 - [debug]  Connecting via SSH from root@slave2(192.168.122.80:22) to root@slave1(192.168.122.70:22)..
Wed Mar 14 09:55:47 2018 - [debug]   ok.
Wed Mar 14 09:55:47 2018 - [debug] 
Wed Mar 14 09:55:45 2018 - [debug]  Connecting via SSH from root@slave1(192.168.122.70:22) to root@master(192.168.122.66:22)..
Wed Mar 14 09:55:46 2018 - [debug]   ok.
Wed Mar 14 09:55:46 2018 - [debug]  Connecting via SSH from root@slave1(192.168.122.70:22) to root@slave2(192.168.122.80:22)..
Wed Mar 14 09:55:47 2018 - [debug]   ok.
Wed Mar 14 09:55:47 2018 - [info] All SSH connection tests passed successfully.

# 复制关系检测
shell> masterha_check_repl --conf=/etc/masterha/app1.conf
Wed Mar 14 09:57:00 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Mar 14 09:57:00 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:57:00 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:57:00 2018 - [info] MHA::MasterMonitor version 0.57.
Wed Mar 14 09:57:00 2018 - [debug] Connecting to servers..
Wed Mar 14 09:57:01 2018 - [debug]  Connected to: master(192.168.122.66:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug]  Number of slave worker threads on host master(192.168.122.66:3307): 0
Wed Mar 14 09:57:01 2018 - [debug]  Connected to: slave1(192.168.122.70:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug]  Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Wed Mar 14 09:57:01 2018 - [debug]  Connected to: slave2(192.168.122.80:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug]  Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Wed Mar 14 09:57:01 2018 - [debug]  Comparing MySQL versions..
Wed Mar 14 09:57:01 2018 - [debug]   Comparing MySQL versions done.
Wed Mar 14 09:57:01 2018 - [debug] Connecting to servers done.
Wed Mar 14 09:57:01 2018 - [info] GTID failover mode = 1
Wed Mar 14 09:57:01 2018 - [info] Dead Servers:
Wed Mar 14 09:57:01 2018 - [info] Alive Servers:
Wed Mar 14 09:57:01 2018 - [info]   master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info]   slave1(192.168.122.70:3307)
Wed Mar 14 09:57:01 2018 - [info]   slave2(192.168.122.80:3307)
Wed Mar 14 09:57:01 2018 - [info] Alive Slaves:
Wed Mar 14 09:57:01 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Wed Mar 14 09:57:01 2018 - [info]     GTID ON
Wed Mar 14 09:57:01 2018 - [debug]    Relay log info repository: TABLE
Wed Mar 14 09:57:01 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Mar 14 09:57:01 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Wed Mar 14 09:57:01 2018 - [info]     GTID ON
Wed Mar 14 09:57:01 2018 - [debug]    Relay log info repository: TABLE
Wed Mar 14 09:57:01 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info]     Not candidate for the new Master (no_master is set)
Wed Mar 14 09:57:01 2018 - [info] Current Alive Master: master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info] Checking slave configurations..
Wed Mar 14 09:57:01 2018 - [info]  read_only=1 is not set on slave slave1(192.168.122.70:3307).
Wed Mar 14 09:57:01 2018 - [info]  read_only=1 is not set on slave slave2(192.168.122.80:3307).
Wed Mar 14 09:57:01 2018 - [info] Checking replication filtering settings..
Wed Mar 14 09:57:01 2018 - [info]  binlog_do_db= , binlog_ignore_db= 
Wed Mar 14 09:57:01 2018 - [info]  Replication filtering check ok.
Wed Mar 14 09:57:01 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Wed Mar 14 09:57:01 2018 - [info] Checking SSH publickey authentication settings on the current master..
Wed Mar 14 09:57:01 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Wed Mar 14 09:57:01 2018 - [info] HealthCheck: SSH to master is reachable.
Wed Mar 14 09:57:01 2018 - [info] 
master(192.168.122.66:3307) (current master)  # -- 有输出拓扑图和slave健康状况的信息
 +--slave1(192.168.122.70:3307)
 +--slave2(192.168.122.80:3307)

Wed Mar 14 09:57:01 2018 - [info] Checking replication health on slave1..
Wed Mar 14 09:57:01 2018 - [info]  ok.
Wed Mar 14 09:57:01 2018 - [info] Checking replication health on slave2..
Wed Mar 14 09:57:01 2018 - [info]  ok.
Wed Mar 14 09:57:01 2018 - [info] Checking master_ip_failover_script status:
Wed Mar 14 09:57:01 2018 - [info]   /usr/local/mha/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Checking the Status of the script.. OK 
Wed Mar 14 09:57:01 2018 - [info]  OK.
Wed Mar 14 09:57:01 2018 - [warning] shutdown_script is not defined.
Wed Mar 14 09:57:01 2018 - [debug]  Disconnected from master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [debug]  Disconnected from slave1(192.168.122.70:3307)
Wed Mar 14 09:57:01 2018 - [debug]  Disconnected from slave2(192.168.122.80:3307)
Wed Mar 14 09:57:01 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

启动MHA

为master添加浮动ip

添加 VIP，因为这里使用的是 masterha_ip_failover 脚本切换 VIP，由于这个脚本的 start/stop 操作只有在 Failover (切换)期间才会执行，所以即便设置了该脚本的 vip，MHA 也不会在masterha_manager一开始启动的时候，在 Master 上设置 VIP，初次在 Master 上设置 VIP 需要人工操作，后期如果有 Failover 操作， MHA 会执行脚本，自动切换到新选取的主机器上。

##
## Master 端
##
# 在我的测试机中，网卡名称为eth0，两台可以提升成主的设备都是eth0网卡
shell> ifconfig eth0:88 192.168.122.99/24

启动`masterha_manager`

一些参数

--conf ：当前 MySQL 集群的配置文件，可以有多个，应用于不同的集群
--remove_dead_master_conf ：当发生切换操作 Failover 后，需要把之前的 Dead Master 从配置文件中删除。
- 如果不删除，且没有恢复的话，此时 masterha_manager 重启后，会报错 “there is a dead slave”
--ignore_last_failover ：如果前一次 Failover 失败了， MHA 不会去再一次去做 Failover 操作，除非人为的删除 (manager_workdir)/(app_name).failover.error ，或者增加此参数， MHA 会继续进行 Failover 操作。

启动 MHA 的管理服务

# 启动MHA manager
shell> nohup masterha_manager --conf=/etc/masterha/app1.conf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

shell> masterha_check_status --conf=/etc/masterha/app1.conf
app1 (pid:18702) is running(0:PING_OK), master:Master
# 状态正常，且当前主机是 Master

日志可以查看 /var/log/masterha/app1/manager.log，在 failover 成功后会把旧 master 的配置段删掉，如切换 VIP 后我们配置文件中的 server1(192.168.122.66:3307) 配置段将被删。

测试MHA自动Failover

在做 Failover 测试之前，把 MHA Manager 停掉，把 MySQL 节点也都停掉（注意停止顺序），把数据库进行一次冷备份，方便以后测试。若没测试成功可以移回冷备文件继续测。

停掉 Slave的 IO 线程，模拟复制延时（ Slave1保持不变）
使用 sysbench 对 Master 进行测试，生成测试数据（可以产生大量的 binlog）
等待步骤2完成后，开启 Slave 上的 IO 线程，让它去追 Master 的 binlog，同时立即操作第四步
关闭 Master 上的 MySQL，让 MHA 产生 Failover 操作
观察最终状态

附发生failover时的MHA Manager日志

Tue Mar 13 17:10:50 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:10:50 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:10:50 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:23 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:12:23 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:23 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
tory. Check for details, and consider setting --workdir separately.
Tue Mar 13 17:10:50 2018 - [debug] Connecting to servers..
Tue Mar 13 17:10:51 2018 - [debug]  Connected to: master(192.168.122.66:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug]  Number of slave worker threads on host master(192.168.122.66:3307): 0
Tue Mar 13 17:10:51 2018 - [debug]  Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug]  Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:10:51 2018 - [debug]  Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug]  Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:10:51 2018 - [debug]  Comparing MySQL versions..
Tue Mar 13 17:10:51 2018 - [debug]   Comparing MySQL versions done.
Tue Mar 13 17:10:51 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:10:51 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:10:51 2018 - [info] Dead Servers:
Tue Mar 13 17:10:51 2018 - [info] Alive Servers:
Tue Mar 13 17:10:51 2018 - [info]   master(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info]   slave1(192.168.122.70:3307)
Tue Mar 13 17:10:51 2018 - [info]   slave2(192.168.122.80:3307)
Tue Mar 13 17:10:51 2018 - [info] Alive Slaves:
Tue Mar 13 17:10:51 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:10:51 2018 - [info]     GTID ON
Tue Mar 13 17:10:51 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:10:51 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:10:51 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:10:51 2018 - [info]     GTID ON
Tue Mar 13 17:10:51 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:10:51 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info]     Not candidate for the new Master (no_master is set)
Tue Mar 13 17:10:51 2018 - [info] Current Alive Master: master(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info] Checking slave configurations..
Tue Mar 13 17:10:51 2018 - [info]  read_only=1 is not set on slave slave1(192.168.122.70:3307).
Tue Mar 13 17:10:51 2018 - [info]  read_only=1 is not set on slave slave2(192.168.122.80:3307).
Tue Mar 13 17:10:51 2018 - [info] Checking replication filtering settings..
Tue Mar 13 17:10:51 2018 - [info]  binlog_do_db= , binlog_ignore_db= 
Tue Mar 13 17:10:51 2018 - [info]  Replication filtering check ok.
Tue Mar 13 17:10:51 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Tue Mar 13 17:10:51 2018 - [info] Checking SSH publickey authentication settings on the current master..
Tue Mar 13 17:10:51 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Tue Mar 13 17:10:52 2018 - [info] HealthCheck: SSH to master is reachable.
Tue Mar 13 17:10:52 2018 - [info] 
master(192.168.122.66:3307) (current master)
 +--slave1(192.168.122.70:3307)
 +--slave2(192.168.122.80:3307)

Tue Mar 13 17:10:52 2018 - [info] Checking master_ip_failover_script status:
Tue Mar 13 17:10:52 2018 - [info]   /usr/local/mha/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===
# 还没故障转移的正常启动日志
Checking the Status of the script.. OK 
Tue Mar 13 17:10:52 2018 - [info]  OK.
Tue Mar 13 17:10:52 2018 - [warning] shutdown_script is not defined.
Tue Mar 13 17:10:52 2018 - [debug]  Disconnected from master(192.168.122.66:3307)
Tue Mar 13 17:10:52 2018 - [debug]  Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:10:52 2018 - [debug]  Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:10:52 2018 - [debug] SSH check command: exit 0
Tue Mar 13 17:10:52 2018 - [info] Set master ping interval 3 seconds.
Tue Mar 13 17:10:52 2018 - [info] Set secondary check script: /usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307
Tue Mar 13 17:10:52 2018 - [info] Starting ping health check on master(192.168.122.66:3307)..
Tue Mar 13 17:10:52 2018 - [debug] Connected on master.
Tue Mar 13 17:10:52 2018 - [debug] Set short wait_timeout on master: 6 seconds
Tue Mar 13 17:10:52 2018 - [debug] Trying to get advisory lock..
Tue Mar 13 17:10:52 2018 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..  # 开始等待mysql不响应
Tue Mar 13 17:12:13 2018 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)    # 此时发现master实例ping不通了
Tue Mar 13 17:12:13 2018 - [info] Executing secondary network check script: /usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307  --user=root  --master_host=master  --master_ip=192.168.122.66  --master_port=3307 --master_user=root --master_password=111111 --ping_type=INSERT
Tue Mar 13 17:12:13 2018 - [info] Executing SSH check script: exit 0
Tue Mar 13 17:12:13 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Tue Mar 13 17:12:13 2018 - [info] HealthCheck: SSH to master is reachable.    # 用两个slave实例检测ssh是否能到master
Monitoring server 192.168.122.70 is reachable, Master is not reachable from 192.168.122.70. OK.
Monitoring server 192.168.122.80 is reachable, Master is not reachable from 192.168.122.80. OK.
Tue Mar 13 17:12:14 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start.  # 两个slave都不能连到master的mysql实例，重试3次
Tue Mar 13 17:12:16 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:16 2018 - [warning] Connection failed 2 time(s)..
Tue Mar 13 17:12:19 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:19 2018 - [warning] Connection failed 3 time(s)..
Tue Mar 13 17:12:22 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:22 2018 - [warning] Connection failed 4 time(s)..
Tue Mar 13 17:12:22 2018 - [warning] Master is not reachable from health checker!
Tue Mar 13 17:12:22 2018 - [warning] Master master(192.168.122.66:3307) is not reachable!
Tue Mar 13 17:12:22 2018 - [warning] SSH is reachable.
Tue Mar 13 17:12:22 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.conf again, and trying to connect to all servers to check server status..
Tue Mar 13 17:12:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:12:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:22 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:22 2018 - [debug] Skipping connecting to dead master master(192.168.122.66:3307).
Tue Mar 13 17:12:22 2018 - [debug] Connecting to servers..
Tue Mar 13 17:12:23 2018 - [debug]  Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:12:23 2018 - [debug]  Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:12:23 2018 - [debug]  Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:12:23 2018 - [debug]  Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:12:23 2018 - [debug]  Comparing MySQL versions..
Tue Mar 13 17:12:23 2018 - [debug]   Comparing MySQL versions done.
Tue Mar 13 17:12:23 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:12:23 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:12:23 2018 - [info] Dead Servers:    # master 已经dead
Tue Mar 13 17:12:23 2018 - [info]   master(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info] Alive Servers:
Tue Mar 13 17:12:23 2018 - [info]   slave1(192.168.122.70:3307)
Tue Mar 13 17:12:23 2018 - [info]   slave2(192.168.122.80:3307)
Tue Mar 13 17:12:23 2018 - [info] Alive Slaves:
Tue Mar 13 17:12:23 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:23 2018 - [info]     GTID ON
Tue Mar 13 17:12:23 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:23 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info]     Primary candidate for the new Master (candidate_master is set)    
Tue Mar 13 17:12:23 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:23 2018 - [info]     GTID ON
Tue Mar 13 17:12:23 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:23 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info]     Not candidate for the new Master (no_master is set)    
Tue Mar 13 17:12:23 2018 - [info] Checking slave configurations..
Tue Mar 13 17:12:23 2018 - [info]  read_only=1 is not set on slave slave1(192.168.122.70:3307).
Tue Mar 13 17:12:23 2018 - [info]  read_only=1 is not set on slave slave2(192.168.122.80:3307).
Tue Mar 13 17:12:23 2018 - [info] Checking replication filtering settings..
Tue Mar 13 17:12:23 2018 - [info]  Replication filtering check ok.
Tue Mar 13 17:12:23 2018 - [info] Master is down!
Tue Mar 13 17:12:23 2018 - [info] Terminating monitoring script.
Tue Mar 13 17:12:23 2018 - [debug]  Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:12:23 2018 - [debug]  Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:12:23 2018 - [info] Got exit code 20 (Master dead).
Tue Mar 13 17:12:23 2018 - [info] MHA::MasterFailover version 0.57.
Tue Mar 13 17:12:23 2018 - [info] Starting master failover.
Tue Mar 13 17:12:23 2018 - [info] 
Tue Mar 13 17:12:23 2018 - [info] * Phase 1: Configuration Check Phase..
Tue Mar 13 17:12:23 2018 - [info] 
Tue Mar 13 17:12:23 2018 - [debug] Skipping connecting to dead master master.
Tue Mar 13 17:12:23 2018 - [debug] Connecting to servers..
Tue Mar 13 17:12:24 2018 - [debug]  Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:12:24 2018 - [debug]  Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:12:24 2018 - [debug]  Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:12:24 2018 - [debug]  Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:12:24 2018 - [debug]  Comparing MySQL versions..
Tue Mar 13 17:12:24 2018 - [debug]   Comparing MySQL versions done.
Tue Mar 13 17:12:24 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:12:24 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:12:24 2018 - [info] Dead Servers:
Tue Mar 13 17:12:24 2018 - [info]   master(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Alive Servers:
Tue Mar 13 17:12:24 2018 - [info]   slave1(192.168.122.70:3307)
Tue Mar 13 17:12:24 2018 - [info]   slave2(192.168.122.80:3307)
Tue Mar 13 17:12:24 2018 - [info] Alive Slaves:
Tue Mar 13 17:12:24 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Primary candidate for the new Master (candidate_master is set)     # 存活的slave中slave1是配置了可以提升为master的
Tue Mar 13 17:12:24 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Not candidate for the new Master (no_master is set)    # slave2配置文件未设置能提升为master
Tue Mar 13 17:12:24 2018 - [info] Starting GTID based failover.
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [debug]  Stopping IO thread on slave1(192.168.122.70:3307)..      # 停止slave1的IO线程
Tue Mar 13 17:12:24 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Mar 13 17:12:24 2018 - [info] Executing master IP deactivation script:
Tue Mar 13 17:12:24 2018 - [info]   /usr/local/mha/bin/master_ip_failover --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 --command=stopssh --ssh_user=root  
Tue Mar 13 17:12:24 2018 - [debug]  Stopping IO thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:24 2018 - [debug]  Stop IO thread on slave2(192.168.122.80:3307) done.

INSCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Disabling the VIP on old master: master 
Tue Mar 13 17:12:24 2018 - [debug]  Stop IO thread on slave1(192.168.122.70:3307) done.
Tue Mar 13 17:12:24 2018 - [info]  done.
Tue Mar 13 17:12:24 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Mar 13 17:12:24 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] * Phase 3: Master Recovery Phase..
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [debug] Fetching current slave status..
Tue Mar 13 17:12:24 2018 - [debug]  Fetching current slave status done.
Tue Mar 13 17:12:24 2018 - [info] The latest binary log file/position on all slaves is bin.000004:20816
Tue Mar 13 17:12:24 2018 - [info] Retrieved Gtid Set: 0c154ad5-2699-11e8-94a1-525400eac085:309-394
Tue Mar 13 17:12:24 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Mar 13 17:12:24 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info] The oldest binary log file/position on all slaves is bin.000004:20816
Tue Mar 13 17:12:24 2018 - [info] Retrieved Gtid Set: 0c154ad5-2699-11e8-94a1-525400eac085:309-394
Tue Mar 13 17:12:24 2018 - [info] Oldest slaves:
Tue Mar 13 17:12:24 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.3: Determining New Master Phase..
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] Searching new master from slaves..
Tue Mar 13 17:12:24 2018 - [info]  Candidate masters from the configuration file:
Tue Mar 13 17:12:24 2018 - [info]   slave1(192.168.122.70:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info]  Non-candidate masters:
Tue Mar 13 17:12:24 2018 - [info]   slave2(192.168.122.80:3307)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info]     GTID ON
Tue Mar 13 17:12:24 2018 - [debug]    Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info]     Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info]     Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Tue Mar 13 17:12:24 2018 - [info] New master is slave1(192.168.122.70:3307)    # 选择slave1提升为新的master
Tue Mar 13 17:12:24 2018 - [info] Starting master failover..    # 开始failover切换
Tue Mar 13 17:12:24 2018 - [info] 
From:    # 原架构拓扑
master(192.168.122.66:3307) (current master)
 +--slave1(192.168.122.70:3307)
 +--slave2(192.168.122.80:3307)

To:		 # failover后的拓扑
slave1(192.168.122.70:3307) (new master)
 +--slave2(192.168.122.80:3307)
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Mar 13 17:12:24 2018 - [info] 
Tue Mar 13 17:12:24 2018 - [info]  Waiting all logs to be applied.. 
Tue Mar 13 17:12:24 2018 - [info]   done.
Tue Mar 13 17:12:24 2018 - [debug]  Stopping slave IO/SQL thread on slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:24 2018 - [debug]   done.
Tue Mar 13 17:12:24 2018 - [info] Getting new master's binlog name and position..
Tue Mar 13 17:12:24 2018 - [info]  bin.000003:94211
Tue Mar 13 17:12:24 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='slave1 or 192.168.122.70', MASTER_PORT=3307, MASTER_AUTO_POSITION=1, MASTER_USER='rpl', MASTER_PASSWORD='xxx';
Tue Mar 13 17:12:24 2018 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: bin.000003, 94211, 0c154ad5-2699-11e8-94a1-525400eac085:1-394,
c08d09b5-2698-11e8-9ec0-5254004dae68:1-2
Tue Mar 13 17:12:24 2018 - [info] Executing master IP activate script:
Tue Mar 13 17:12:24 2018 - [info]   /usr/local/mha/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 --new_master_host=slave1 --new_master_ip=192.168.122.70 --new_master_port=3307 --new_master_user='root'   --new_master_password=xxx
Unknown option: new_master_user
Unknown option: new_master_password

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Enabling the VIP - 192.168.122.99/24 on the new master - slave1 
Tue Mar 13 17:12:25 2018 - [info]  OK.
Tue Mar 13 17:12:25 2018 - [info] ** Finished master recovery successfully.
Tue Mar 13 17:12:25 2018 - [info] * Phase 3: Master Recovery Phase completed.
Tue Mar 13 17:12:25 2018 - [info] 
Tue Mar 13 17:12:25 2018 - [info] * Phase 4: Slaves Recovery Phase..
Tue Mar 13 17:12:25 2018 - [info] 
Tue Mar 13 17:12:25 2018 - [info] 
Tue Mar 13 17:12:25 2018 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Mar 13 17:12:25 2018 - [info] 
Tue Mar 13 17:12:25 2018 - [info] -- Slave recovery on host slave2(192.168.122.80:3307) started, pid: 10212. Check tmp log /var/log/masterha/app1/slave2_3307_20180313171223.log if it takes time..
Tue Mar 13 17:12:27 2018 - [info] 
Tue Mar 13 17:12:27 2018 - [info] Log messages from slave2 ...
Tue Mar 13 17:12:27 2018 - [info] 
Tue Mar 13 17:12:25 2018 - [info]  Resetting slave slave2(192.168.122.80:3307) and starting replication from the new master slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:25 2018 - [debug]  Stopping slave IO/SQL thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:25 2018 - [debug]   done.
Tue Mar 13 17:12:25 2018 - [info]  Executed CHANGE MASTER.
Tue Mar 13 17:12:25 2018 - [debug]  Starting slave IO/SQL thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:26 2018 - [debug]   done.
Tue Mar 13 17:12:26 2018 - [info]  Slave started.
Tue Mar 13 17:12:26 2018 - [info]  gtid_wait(0c154ad5-2699-11e8-94a1-525400eac085:1-394,
c08d09b5-2698-11e8-9ec0-5254004dae68:1-2) completed on slave2(192.168.122.80:3307). Executed 3 events.
Tue Mar 13 17:12:27 2018 - [info] End of log messages from slave2.
Tue Mar 13 17:12:27 2018 - [info] -- Slave on host slave2(192.168.122.80:3307) started.
Tue Mar 13 17:12:27 2018 - [info] All new slave servers recovered successfully.
Tue Mar 13 17:12:27 2018 - [info] 
Tue Mar 13 17:12:27 2018 - [info] * Phase 5: New master cleanup phase..
Tue Mar 13 17:12:27 2018 - [info] 
Tue Mar 13 17:12:27 2018 - [info] Resetting slave info on the new master..
Tue Mar 13 17:12:27 2018 - [debug]  Clearing slave info..
Tue Mar 13 17:12:27 2018 - [debug]  Stopping slave IO/SQL thread on slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:27 2018 - [debug]   done.
Tue Mar 13 17:12:27 2018 - [debug]  SHOW SLAVE STATUS shows new master does not replicate from anywhere. OK.
Tue Mar 13 17:12:27 2018 - [info]  slave1: Resetting slave info succeeded.
Tue Mar 13 17:12:27 2018 - [info] Master failover to slave1(192.168.122.70:3307) completed successfully.
Tue Mar 13 17:12:27 2018 - [info] Deleted server1 entry from /etc/masterha/app1.conf .
Tue Mar 13 17:12:27 2018 - [debug]  Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:12:27 2018 - [debug]  Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:12:27 2018 - [info] 

----- Failover Report -----
# 下面是failover的报表，集群中master角色从原来的66:3307切到了70:3307机器
app1: MySQL Master failover master(192.168.122.66:3307) to slave1(192.168.122.70:3307) succeeded

Master master(192.168.122.66:3307) is down!

Check MHA Manager logs at centos-66:/var/log/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on master(192.168.122.66:3307)
Selected slave1(192.168.122.70:3307) as a new master.
slave1(192.168.122.70:3307): OK: Applying all logs succeeded.
slave1(192.168.122.70:3307): OK: Activated master IP address.
slave2(192.168.122.80:3307): OK: Slave started, replicating from slave1(192.168.122.70:3307)
slave1(192.168.122.70:3307): Resetting slave info succeeded.
Master failover to slave1(192.168.122.70:3307) completed successfully.  # 切换成功