mysql MHA 记录
chenzuoqing Lv3

MySQL MHA配置记录

MHA介绍

MHA是一套 MySQL 高可用管理软件 ,除了检测 Master 宕机后,提升候选 Slave 为 New Master 之外,还会自动让其他 Slave 与 New Master 建立复制关系。切换通过浮动 ip 的方式,浮动 ip 将绑定在任何时期的主节点上(若切换则 ip 也漂移),通过浮动 ip 提供服务。

MHA Manager可以 单独部署在一台独立的机器上,并管理多个 master-slave 集群,只需要指定不通的配置文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- 
+------------------------------------+ (app1.conf) +----------+
| VIP | heartbeat | MHA |
| +----------+ rpl +--------+ <-------------+ Manager |
| | Slave(M) +----------> Master | | rpl manager +----------+
| +----------+ +-^----^-+ | |
| rpl | | | |
| +------------------+ |rpl | |
| | | | (app2.conf) |
| +---------+ +-------+-+ | +----V----+
| | Slave 1 | ...... | Slave N | | | Other |
| +---------+ +---------+ | | MySQL |
| | | rpl |
| MySQL Replication | +---------+
+------------------------------------+

集群节点描述

  1. MHA Manager: MHA的管理节点,负责检测MySQL的状态,以及管理MySQL的复制关系
  2. Master: MySQL对外提供的服务( 通过 VIP )的主节点
  3. **Slave(M)**: MySQL候选主节点,本质上为一个Slave节点,只是在Master宕机后,会被提升为 New Master
  4. Slave N: MySQL从机节点,对外可以提供只读服务

故障检测过程

  当Master宕机后,MHA Manager会让 Slave(M) 提升为 New Master,然后修改 Slave(N) 的复制关系(CHANGE MASTER),指向 New Master

  1. MHA Manager检测Master是否宕机,不会简单的只检查自身与Master之间的直连状态
  2. MHA Manager会 透过其他Slave节点去检测Master,进行二次探测,最大限度的避免脑裂的发生
  3. 若检测到Master宕机了, MHA Manager会 透过Slave(M) 检测是否为宕机;如果检测后仍为宕机状态,会继续 透过Slave1…SlaveN进行探测,只有在透过所有节点检测到Master宕机了,才真的认为Master宕机了。
  • 以上操作步骤的前提是,**MHA Manager 到所有 MySQL 节点的 SSH 免密码登录要打通**。
    最后将提供对外服务的 VIP 漂移到 New Master 上,继续对外提供服务。

MHA Failover过程

  1. 从宕机崩溃的 Master 节点 保存二进制日志事件(binlog events
    • 此种情况为 MySQL 挂了,但是服务器没挂 ,还是可以通过 SSH 连接过去
    • 如果服务器彻底宕机了,该步骤略过
  2. 识别 含有最新的更新 的 Slave 节点
  3. 应用差异的中继日志 ( relay-log)到其他的 Slave
  4. 应用从 Master 保存的二进制日志事件(binlog events
    • 如果之前 Master 彻底宕机了,就没有保存的 binlog,该步骤略过
  5. 提升一个 Slave 为新的 Master( New)
  6. 使其他的 Slave 连接新的 Master( New) 进行复制

注意

  从上面的步骤看, MHA本身无法完全保证数据不会丢

  1. MySQL 5.7 之前数据不丢的前提是 Master 服务器还可以被 MHA Manager 进行 SSH 连接 ,通过 应用保存的binlog的方式来保证。
  2. MySQL 5.7 之后通过无损复制,仅仅是减少了丢数据的可能性,假如此时的状态为切成异步的状态 ,那就和之前一样了(可以设置超时的时间很大)。
  3. Master 恢复的时候,最后一部分数据是否需要 Flashback, MHA 不负责这个操作,需要人工确认。

准备工作

会用到 mysql utilities 中的工具,请安装。

测试集群拓扑图

  注意: slave 中的 /etc/my.cnf 中需要配置 report-host=slave ip_addr ,否则命令执行失败。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
shell> mysqlrplshow --master=root:111111@"192.168.122.66:3307" --discover-slaves-login=root:111111 -r -v
WARNING: Using a password on the command line interface can be insecure.
# master on 192.168.122.66: ... connected.
# Finding slaves for master: 192.168.122.66:3307
# master on 192.168.122.70: ... connected.
# Finding slaves for master: 192.168.122.70:3307

WARNING: Cannot connect to some slaves:
- 192.168.122.70:3306: Can't connect to MySQL server on '192.168.122.70:3306' (111 Connection refused)
# master on 192.168.122.80: ... connected.
# Finding slaves for master: 192.168.122.80:3307

# Replication Topology Graph
192.168.122.66:3307 (MASTER) [由于机器数量问题,此节点会跑MHA manager和node]
|
+--- 192.168.122.70:3307 [IO: Yes, SQL: Yes] - (SLAVE) [做MHA node]
|
+--- 192.168.122.80:3307 [IO: Yes, SQL: Yes] - (SLAVE) [做MHA node]

  另外如果使用 mysqlreplicate 创建复制关系,则他会帮你在你的 Master上创建 [email protected][email protected] 用户,需要注意这个 rpl 用户需要赋予 replicationslave 权限,可以手动确认一下。

MHA软件角色分布

  上述架构,MHA 中 Manager 节点最好不要在集群中(这里空闲的测试机有限,so..,需要打通 Manager 节点到所有 mysql 节点的 ssh 密钥连接,mysql 节点相互也需要打通,方便无密码登陆。操作不演示。

安装MHA软件

示例基于1主2从,需要先搭建好,打通 ssh 认证。
MHA Node 在所有节点上都要安装

获取MHA源码

  获取MHA源码,github中tar包最新的貌似是 0.56 下载地址,rpm 包有 0.57 下载地址,需要源码版可以往百度云找,已收藏…
PS:有梯子可以打开 此处下载0.57

如上准备主从,mysqlreplicate搭建,不演示。

安装依赖包

安装 perl 依赖包

1
2
3
4
5
6
7
-- redhat系安装依赖,若装完执行makefile报错可以尝试重新登陆下
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes \
perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker perl-ExtUtils-Embed cpan perl-CPAN

-- debian系安装依赖
apt install libdbd-mysql-perl libconfig-tiny-perl liblog-dispatch-perl libparallel-forkmanager-perl

安装MHA Manager

mysql master 节点安装 mha node 和 manager,先解压好(这里环境的 MHA Manager 放在了 mysql 节点中)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
shell> cd mha4mysql-manager-0.57
shell> perl Makefile.PL
shell> make && make install
shell> ll bin/
total 40
-rwxr-xr-x 1 git mysql 1995 May 31 2015 masterha_check_repl # 检查MySQL复制状态
-rwxr-xr-x 1 git mysql 1779 May 31 2015 masterha_check_ssh # 检查MHA的SSH状态
-rwxr-xr-x 1 git mysql 1865 May 31 2015 masterha_check_status # 检查当前MHA的状态
-rwxr-xr-x 1 git mysql 3201 May 31 2015 masterha_conf_host # 添加或删除配置的server信息
-rwxr-xr-x 1 git mysql 2517 May 31 2015 masterha_manager # 启动MHA
-rwxr-xr-x 1 git mysql 2165 May 31 2015 masterha_master_monitor # 检测master是否宕机
-rwxr-xr-x 1 git mysql 2373 May 31 2015 masterha_master_switch # 控制故障转移(手动或者自动)
-rwxr-xr-x 1 git mysql 5171 May 31 2015 masterha_secondary_check # 通过slave检测master是否宕机(二次探测)
-rwxr-xr-x 1 git mysql 1739 May 31 2015 masterha_stop # MHA停止
shell> mkdir -p /etc/masterha # MHA配置文件存放目录
shell> mkdir -p /var/log/masterha/app1 # 建议和MHA配置文件保持一致,若管理多套mysql集群可以再新建appxxx
shell> mkdir -p /usr/local/mha/ # 准备存放MHA可执行程序和lib
shell> cp -rf bin/ lib/ /usr/local/mha/

shell> cd ../mha4mysql-node-0.57/ # manager也需要安装node工具
shell> perl Makefile.PL
shell> make && make install
shell> cp bin/* /usr/local/mha/bin/
shell> cp lib/* /usr/local/mha/lib/

安装MHA node

为所有节点安装 MHA node,这里两个 slave 只安装 mha node

1
2
3
4
5
6
7
8
9
10
shell> yum install perl-DBI perl-DBD-MySQL -y   # node节点依赖较少
shell> cd mha4mysql-node-0.57/
shell> perl Makefile.PL
shell> make && make install
shell> ll bin/
total 44
-rwxr-xr-x 1 1001 mysql 16381 May 31 2015 apply_diff_relay_logs # 识别差异的中继日志事件并将其差异的事件应用于其他的slave
-rwxr-xr-x 1 1001 mysql 4807 May 31 2015 filter_mysqlbinlog # 去除不必要的ROLLBACK事件(脚本中指出该脚本已经过时,应该是被废弃了)
-rwxr-xr-x 1 1001 mysql 8261 May 31 2015 purge_relay_logs # 清除中继日志(不会阻塞SQL线程)
-rwxr-xr-x 1 1001 mysql 7525 May 31 2015 save_binary_logs # 保存和复制master的二进制日志

MHA自带配置文件

MHA-manager 编译完在 samples 目录下有配置文件模板和 failover 脚本供参考,这里用另外写的

1
2
3
4
5
6
7
8
9
10
shell> tree samples/
samples/
├── conf
│ ├── app1.cnf
│ └── masterha_default.cnf
└── scripts
├── master_ip_failover
├── master_ip_online_change
├── power_manager
└── send_report

MHA配置文件和脚本

配置文件统一放在 /etc/masterha/ 中,针对当前这个 MySQL 复制组,我们定义为 app1.conf ,如果还有 其他 MySQL 复制组 ,可以 继续定义 app2.conf 、app3.conf等等

示例配置文件细述

/etc/masterha/app1.conf 定义一个 mysql 集群如下内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#
# MHA Manager 端 /etc/masterha/app1.conf
#
[server default]
# 这两个参数需要根据不同的集群进行修改
manager_workdir=/var/log/masterha/app1
manager_log=/var/log/masterha/app1/manager.log
# 按照master服务器存放binlog的实际路径进行修改,主要为了让MHA拉取binlog
master_binlog_dir=/data/mysql_data/5.7.11/
# 设置自动failover的脚本,下面会给出
master_ip_failover_script= /usr/local/mha/bin/master_ip_failover
# 设置手动切换时候的脚本 (供(masterha_master_switch使用)
master_ip_online_change_script= /usr/local/mha/bin/master_ip_online_change
log_level=debug
# 监控的用户
user=root
# 监控用户的密码
password=111111
# 监控主库的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行failover
ping_interval=3
# 检测方式是insert, MHA-0.56开始支持insert
# 会在Master中生成一个 infra 数据库
ping_type=INSERT
# 设置远端mysql在发生切换时binlog的保存位置
remote_workdir=/tmp
# 复制用的密码
repl_password=rpl
# 复制的用户
repl_user=rpl
# 告警脚本,可自行修改,这里没有使用
#report_script=/usr/local/mha/bin/send_report
# 通过从机进行二次探测的脚本, IP地址按照实际的情况进行修改
secondary_check_script=/usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307
# 设置故障发生后关闭故障主机的脚本(主要作用是关闭主机防止发生脑裂,这里没有使用,类似Fence功能)
#shutdown_script="/usr/local/mha/bin/power_manager --command=stopssh2 --host=test-1 --ssh_user=root"
# 定义ssh的用户
ssh_user=root
[server1]
# 这个hostname也可以配置成IP地址,同 ip 参数一样
# 如果这里写名字,需要DNS配合,或者使用 /etc/hosts
hostname=master
ip=192.168.122.66
port=3307
# candidate_master参数的意思为:设置为候选Master,如果发生主从切换,该主机会被提升为Master,即使这个服务器上的数据不是最新的(会用relay-log补全)
candidate_master=1
[server2]
hostname=slave1
ip=192.168.122.70
port=3307
candidate_master=1
# check_repl_delay参数的意思为:默认情况下如果一个slave落后master 100M的relay logs的话, MHA将不会选择该slave作为一个新的master;
# 因为对于这个slave的恢复需要花费很长时间;
# 通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时;
# 这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master
check_repl_delay=0
[server3]
hostname=slave2
ip=192.168.122.80
port=3307
# no_master 表示该主机不会被提升为Master
no_master=1

failover脚本

failover 脚本,新建 /usr/local/mha/bin/master_ip_failover 文件,记得加执行权限

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);

# 注意修改系统对应的接口名称(CentOS比较变态,若多节点可以升级成master,记得接口名称需要一样,比如都为eth0...)
my $vip = '192.168.122.99/24';
my $eth = 'eth0';
my $key = '88';
my $ssh_start_vip = "/sbin/ifconfig $eth:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig $eth:$key down";

GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}

sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}

sub stop_vip() {
return 0 unless ($ssh_user);
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

在线切换master的脚本

创建在线切换 master 的脚本 /usr/local/mha/bin/master_ip_online_change,同样加执行权限

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

#my (
# $command, $ssh_user, $orig_master_host, $orig_master_ip,
# $orig_master_port, $new_master_host, $new_master_ip, $new_master_port
#);

my (
$command, $orig_master_is_new_slave, $orig_master_host,
$orig_master_ip, $orig_master_port, $orig_master_user,
$orig_master_password, $orig_master_ssh_user, $new_master_host,
$new_master_ip, $new_master_port, $new_master_user,
$new_master_password, $new_master_ssh_user,
);

# 同样注意修改系统对应的接口名称,和对应的VIP
my $vip = '192.168.122.99/24';
my $eth = 'eth0';
my $key = '88';
my $ssh_start_vip = "/sbin/ifconfig $eth:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig $eth:$key down";
my $ssh_user = "root";

GetOptions(
'command=s' => \$command,
#'ssh_user=s' => \$ssh_user,
#'orig_master_host=s' => \$orig_master_host,
#'orig_master_ip=s' => \$orig_master_ip,
#'orig_master_port=i' => \$orig_master_port,
#'new_master_host=s' => \$new_master_host,
#'new_master_ip=s' => \$new_master_ip,
#'new_master_port=i' => \$new_master_port,
'orig_master_is_new_slave' => \$orig_master_is_new_slave,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'orig_master_user=s' => \$orig_master_user,
'orig_master_password=s' => \$orig_master_password,
'orig_master_ssh_user=s' => \$orig_master_ssh_user,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
'new_master_user=s' => \$new_master_user,
'new_master_password=s' => \$new_master_password,
'new_master_ssh_user=s' => \$new_master_ssh_user,
);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {
my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}

sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}

sub stop_vip() {
return 0 unless ($ssh_user);
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --ssh-user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

Slave上的relay_log配置

  从 MHA Failover 的过程中可以了解到, MHA Manager 在恢复( 补齐)其他 Slave 数据时会用到 relay-log ,因此这些 relay-log 需要被保留。

  而默认情况下, SQL 线程在回放完毕后, MySQL 会 主动删除 relay-log ,需要 禁用 该功能,确保 relay-log 不被自动删除。

所以在所有 Slave 节点 中配置如下参数,然后重启mysql服务。

1
2
3
4
# 所有的Slave节点
[mysqld]
# 关闭relay-log主动删除的功能
relay_log_purge = 0

但是这样做了以后又带来另外一个问题,relay-log 会大量堆积导致磁盘空间紧张,所以需要定时清空过时的relay-logMHA Node 的安装包中有一个 pure_relay_logs 工具,提供删除大量 relay-log 的功能。

pure_relay_logs工具

该工具原理:在Linux下删除体积较大的文件需要一定的时间,且消耗一定的资源,所以 pure_relay_logs 在删除relay-log 之前会做一次 硬连接 ,然后删除对应的 relay-log (只删除指向 relay-log 的 指针 ),这样不会造成系统资源消耗,从而影响复制( 造成复制延时),最后再 删除硬连接的relay-log(_hardlink_) 。

参数

1
2
3
4
5
6
7
8
9
--user : 用户名
--password : 密码
--port : 端口号
--workdir : 指定 创建relay-log的硬链接 的位置
默认是 /var/tmp ,由于 硬连接不能跨分区 ,所以请 确保这个目录和你的relay-log在同一个分区 ;
当脚本执行成功后,硬链接会被删除
--disable_relay_log_purge : 默认情况下,如果 relay_log_purge=1 ,脚本会什么都不清理,自动退出
通过设定这个参数,当 relay_log_purge=1 的情况下,该参数会将relay_log_purge设置为0。清理relay log之后,最后再设置 relay_log_purge=0 。
但是还是要在 /etc/my.cnf 中显示配置 relay_log_purge=0 ,避免重启服务后被还原。

可以增加如下定时清理脚本到定时任务中,大概几个小时执行一次

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash

user=root
passwd=123
port=3306
# 目录可以创建
log_dir='/mysql/purge_relay_logs'
work_dir='/mysql/relay_log_hardlink'
purge='/usr/local/mha/bin/purge_relay_logs'

if [[ ! -d ${work_dir} ]];then
mkdir ${work_dir} -p
fi

if [[ ! -d ${log_dir} ]];then
mkdir ${log_dir} -p
fi

${purge} --user=${user} --password=${passwd} --port=${port} --workdir=${work_dir} \
--disable_relay_log_purge >> ${log_dir}/purge_relay_logs.log 2>&1

Manager检测配置中主机状态

到次配置几乎完成,但是还未启动,可以在 MAH manager 机器运行 ssh 检测和复制关系检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# ssh检测
shell> masterha_check_ssh --conf=/etc/masterha/app1.conf # 下面是检测正常
Wed Mar 14 09:55:45 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Mar 14 09:55:45 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:55:45 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:55:45 2018 - [info] Starting SSH connection tests..
Wed Mar 14 09:55:46 2018 - [debug]
Wed Mar 14 09:55:45 2018 - [debug] Connecting via SSH from root@master(192.168.122.66:22) to root@slave1(192.168.122.70:22)..
Wed Mar 14 09:55:45 2018 - [debug] ok.
Wed Mar 14 09:55:45 2018 - [debug] Connecting via SSH from root@master(192.168.122.66:22) to root@slave2(192.168.122.80:22)..
Wed Mar 14 09:55:46 2018 - [debug] ok.
Wed Mar 14 09:55:47 2018 - [debug]
Wed Mar 14 09:55:46 2018 - [debug] Connecting via SSH from root@slave2(192.168.122.80:22) to root@master(192.168.122.66:22)..
Wed Mar 14 09:55:47 2018 - [debug] ok.
Wed Mar 14 09:55:47 2018 - [debug] Connecting via SSH from root@slave2(192.168.122.80:22) to root@slave1(192.168.122.70:22)..
Wed Mar 14 09:55:47 2018 - [debug] ok.
Wed Mar 14 09:55:47 2018 - [debug]
Wed Mar 14 09:55:45 2018 - [debug] Connecting via SSH from root@slave1(192.168.122.70:22) to root@master(192.168.122.66:22)..
Wed Mar 14 09:55:46 2018 - [debug] ok.
Wed Mar 14 09:55:46 2018 - [debug] Connecting via SSH from root@slave1(192.168.122.70:22) to root@slave2(192.168.122.80:22)..
Wed Mar 14 09:55:47 2018 - [debug] ok.
Wed Mar 14 09:55:47 2018 - [info] All SSH connection tests passed successfully.

# 复制关系检测
shell> masterha_check_repl --conf=/etc/masterha/app1.conf
Wed Mar 14 09:57:00 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Mar 14 09:57:00 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:57:00 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Wed Mar 14 09:57:00 2018 - [info] MHA::MasterMonitor version 0.57.
Wed Mar 14 09:57:00 2018 - [debug] Connecting to servers..
Wed Mar 14 09:57:01 2018 - [debug] Connected to: master(192.168.122.66:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug] Number of slave worker threads on host master(192.168.122.66:3307): 0
Wed Mar 14 09:57:01 2018 - [debug] Connected to: slave1(192.168.122.70:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug] Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Wed Mar 14 09:57:01 2018 - [debug] Connected to: slave2(192.168.122.80:3307), user=root
Wed Mar 14 09:57:01 2018 - [debug] Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Wed Mar 14 09:57:01 2018 - [debug] Comparing MySQL versions..
Wed Mar 14 09:57:01 2018 - [debug] Comparing MySQL versions done.
Wed Mar 14 09:57:01 2018 - [debug] Connecting to servers done.
Wed Mar 14 09:57:01 2018 - [info] GTID failover mode = 1
Wed Mar 14 09:57:01 2018 - [info] Dead Servers:
Wed Mar 14 09:57:01 2018 - [info] Alive Servers:
Wed Mar 14 09:57:01 2018 - [info] master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info] slave1(192.168.122.70:3307)
Wed Mar 14 09:57:01 2018 - [info] slave2(192.168.122.80:3307)
Wed Mar 14 09:57:01 2018 - [info] Alive Slaves:
Wed Mar 14 09:57:01 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Wed Mar 14 09:57:01 2018 - [info] GTID ON
Wed Mar 14 09:57:01 2018 - [debug] Relay log info repository: TABLE
Wed Mar 14 09:57:01 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Mar 14 09:57:01 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Wed Mar 14 09:57:01 2018 - [info] GTID ON
Wed Mar 14 09:57:01 2018 - [debug] Relay log info repository: TABLE
Wed Mar 14 09:57:01 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info] Not candidate for the new Master (no_master is set)
Wed Mar 14 09:57:01 2018 - [info] Current Alive Master: master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [info] Checking slave configurations..
Wed Mar 14 09:57:01 2018 - [info] read_only=1 is not set on slave slave1(192.168.122.70:3307).
Wed Mar 14 09:57:01 2018 - [info] read_only=1 is not set on slave slave2(192.168.122.80:3307).
Wed Mar 14 09:57:01 2018 - [info] Checking replication filtering settings..
Wed Mar 14 09:57:01 2018 - [info] binlog_do_db= , binlog_ignore_db=
Wed Mar 14 09:57:01 2018 - [info] Replication filtering check ok.
Wed Mar 14 09:57:01 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Wed Mar 14 09:57:01 2018 - [info] Checking SSH publickey authentication settings on the current master..
Wed Mar 14 09:57:01 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Wed Mar 14 09:57:01 2018 - [info] HealthCheck: SSH to master is reachable.
Wed Mar 14 09:57:01 2018 - [info]
master(192.168.122.66:3307) (current master) # -- 有输出拓扑图和slave健康状况的信息
+--slave1(192.168.122.70:3307)
+--slave2(192.168.122.80:3307)

Wed Mar 14 09:57:01 2018 - [info] Checking replication health on slave1..
Wed Mar 14 09:57:01 2018 - [info] ok.
Wed Mar 14 09:57:01 2018 - [info] Checking replication health on slave2..
Wed Mar 14 09:57:01 2018 - [info] ok.
Wed Mar 14 09:57:01 2018 - [info] Checking master_ip_failover_script status:
Wed Mar 14 09:57:01 2018 - [info] /usr/local/mha/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Checking the Status of the script.. OK
Wed Mar 14 09:57:01 2018 - [info] OK.
Wed Mar 14 09:57:01 2018 - [warning] shutdown_script is not defined.
Wed Mar 14 09:57:01 2018 - [debug] Disconnected from master(192.168.122.66:3307)
Wed Mar 14 09:57:01 2018 - [debug] Disconnected from slave1(192.168.122.70:3307)
Wed Mar 14 09:57:01 2018 - [debug] Disconnected from slave2(192.168.122.80:3307)
Wed Mar 14 09:57:01 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

启动MHA

为master添加浮动ip

  添加 VIP,因为这里使用的是 masterha_ip_failover 脚本切换 VIP,由于这个脚本的 start/stop 操作只有在 Failover (切换)期间才会执行,所以即便设置了该脚本的 vip,MHA 也不会在masterha_manager一开始启动的时候,在 Master 上设置 VIP初次在 Master 上设置 VIP 需要人工操作,后期如果有 Failover 操作, MHA 会执行脚本,自动切换到新选取的主机器上。

1
2
3
4
5
##
## Master 端
##
# 在我的测试机中,网卡名称为eth0,两台可以提升成主的设备都是eth0网卡
shell> ifconfig eth0:88 192.168.122.99/24

启动masterha_manager

一些参数

  • --conf :当前 MySQL 集群的配置文件,可以有多个,应用于不同的集群
  • --remove_dead_master_conf :当发生切换操作 Failover 后,需要把之前的 Dead Master 从配置文件中删除。
    • 如果不删除,且没有恢复的话,此时 masterha_manager 重启后,会报错 “there is a dead slave”
  • --ignore_last_failover :如果前一次 Failover 失败了, MHA 不会去再一次去做 Failover 操作,除非人为的删除 (manager_workdir)/(app_name).failover.error ,或者增加此参数, MHA 会继续进行 Failover 操作。

启动 MHA 的管理服务

1
2
3
4
5
6
# 启动MHA manager
shell> nohup masterha_manager --conf=/etc/masterha/app1.conf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

shell> masterha_check_status --conf=/etc/masterha/app1.conf
app1 (pid:18702) is running(0:PING_OK), master:Master
# 状态正常,且当前主机是 Master

  日志可以查看 /var/log/masterha/app1/manager.log,在 failover 成功后会把旧 master 的配置段删掉,如切换 VIP 后我们配置文件中的 server1(192.168.122.66:3307) 配置段将被删。

测试MHA自动Failover

在做 Failover 测试之前,把 MHA Manager 停掉,把 MySQL 节点也都停掉( 注意停止顺序 ),把数据库进行一次冷备份,方便以后测试。若没测试成功可以移回冷备文件继续测。

  1. 停掉 Slave的 IO 线程 ,模拟复制延时( Slave1保持不变)
  2. 使用 sysbench 对 Master 进行测试,生成测试数据(可以产生大量的 binlog)
  3. 等待步骤2完成 后,开启 Slave 上的 IO 线程,让它去追 Master 的 binlog,同时立即操作第四步
  4. 关闭 Master 上的 MySQL,让 MHA 产生 Failover 操作
  5. 观察最终状态

附发生failover时的MHA Manager日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
Tue Mar 13 17:10:50 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:10:50 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:10:50 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:23 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:12:23 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:23 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
tory. Check for details, and consider setting --workdir separately.
Tue Mar 13 17:10:50 2018 - [debug] Connecting to servers..
Tue Mar 13 17:10:51 2018 - [debug] Connected to: master(192.168.122.66:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug] Number of slave worker threads on host master(192.168.122.66:3307): 0
Tue Mar 13 17:10:51 2018 - [debug] Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug] Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:10:51 2018 - [debug] Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:10:51 2018 - [debug] Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:10:51 2018 - [debug] Comparing MySQL versions..
Tue Mar 13 17:10:51 2018 - [debug] Comparing MySQL versions done.
Tue Mar 13 17:10:51 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:10:51 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:10:51 2018 - [info] Dead Servers:
Tue Mar 13 17:10:51 2018 - [info] Alive Servers:
Tue Mar 13 17:10:51 2018 - [info] master(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info] slave1(192.168.122.70:3307)
Tue Mar 13 17:10:51 2018 - [info] slave2(192.168.122.80:3307)
Tue Mar 13 17:10:51 2018 - [info] Alive Slaves:
Tue Mar 13 17:10:51 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:10:51 2018 - [info] GTID ON
Tue Mar 13 17:10:51 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:10:51 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:10:51 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:10:51 2018 - [info] GTID ON
Tue Mar 13 17:10:51 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:10:51 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info] Not candidate for the new Master (no_master is set)
Tue Mar 13 17:10:51 2018 - [info] Current Alive Master: master(192.168.122.66:3307)
Tue Mar 13 17:10:51 2018 - [info] Checking slave configurations..
Tue Mar 13 17:10:51 2018 - [info] read_only=1 is not set on slave slave1(192.168.122.70:3307).
Tue Mar 13 17:10:51 2018 - [info] read_only=1 is not set on slave slave2(192.168.122.80:3307).
Tue Mar 13 17:10:51 2018 - [info] Checking replication filtering settings..
Tue Mar 13 17:10:51 2018 - [info] binlog_do_db= , binlog_ignore_db=
Tue Mar 13 17:10:51 2018 - [info] Replication filtering check ok.
Tue Mar 13 17:10:51 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Tue Mar 13 17:10:51 2018 - [info] Checking SSH publickey authentication settings on the current master..
Tue Mar 13 17:10:51 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Tue Mar 13 17:10:52 2018 - [info] HealthCheck: SSH to master is reachable.
Tue Mar 13 17:10:52 2018 - [info]
master(192.168.122.66:3307) (current master)
+--slave1(192.168.122.70:3307)
+--slave2(192.168.122.80:3307)

Tue Mar 13 17:10:52 2018 - [info] Checking master_ip_failover_script status:
Tue Mar 13 17:10:52 2018 - [info] /usr/local/mha/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===
# 还没故障转移的正常启动日志
Checking the Status of the script.. OK
Tue Mar 13 17:10:52 2018 - [info] OK.
Tue Mar 13 17:10:52 2018 - [warning] shutdown_script is not defined.
Tue Mar 13 17:10:52 2018 - [debug] Disconnected from master(192.168.122.66:3307)
Tue Mar 13 17:10:52 2018 - [debug] Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:10:52 2018 - [debug] Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:10:52 2018 - [debug] SSH check command: exit 0
Tue Mar 13 17:10:52 2018 - [info] Set master ping interval 3 seconds.
Tue Mar 13 17:10:52 2018 - [info] Set secondary check script: /usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307
Tue Mar 13 17:10:52 2018 - [info] Starting ping health check on master(192.168.122.66:3307)..
Tue Mar 13 17:10:52 2018 - [debug] Connected on master.
Tue Mar 13 17:10:52 2018 - [debug] Set short wait_timeout on master: 6 seconds
Tue Mar 13 17:10:52 2018 - [debug] Trying to get advisory lock..
Tue Mar 13 17:10:52 2018 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond.. # 开始等待mysql不响应
Tue Mar 13 17:12:13 2018 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away) # 此时发现master实例ping不通了
Tue Mar 13 17:12:13 2018 - [info] Executing secondary network check script: /usr/local/mha/bin/masterha_secondary_check -s 192.168.122.70 -s 192.168.122.80 --user=root --master_host=192.168.122.66 --master_port=3307 --user=root --master_host=master --master_ip=192.168.122.66 --master_port=3307 --master_user=root --master_password=111111 --ping_type=INSERT
Tue Mar 13 17:12:13 2018 - [info] Executing SSH check script: exit 0
Tue Mar 13 17:12:13 2018 - [debug] SSH connection test to master, option -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5, timeout 5
Tue Mar 13 17:12:13 2018 - [info] HealthCheck: SSH to master is reachable. # 用两个slave实例检测ssh是否能到master
Monitoring server 192.168.122.70 is reachable, Master is not reachable from 192.168.122.70. OK.
Monitoring server 192.168.122.80 is reachable, Master is not reachable from 192.168.122.80. OK.
Tue Mar 13 17:12:14 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start. # 两个slave都不能连到master的mysql实例,重试3次
Tue Mar 13 17:12:16 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:16 2018 - [warning] Connection failed 2 time(s)..
Tue Mar 13 17:12:19 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:19 2018 - [warning] Connection failed 3 time(s)..
Tue Mar 13 17:12:22 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.122.66' (111))
Tue Mar 13 17:12:22 2018 - [warning] Connection failed 4 time(s)..
Tue Mar 13 17:12:22 2018 - [warning] Master is not reachable from health checker!
Tue Mar 13 17:12:22 2018 - [warning] Master master(192.168.122.66:3307) is not reachable!
Tue Mar 13 17:12:22 2018 - [warning] SSH is reachable.
Tue Mar 13 17:12:22 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.conf again, and trying to connect to all servers to check server status..
Tue Mar 13 17:12:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Mar 13 17:12:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:22 2018 - [info] Reading server configuration from /etc/masterha/app1.conf..
Tue Mar 13 17:12:22 2018 - [debug] Skipping connecting to dead master master(192.168.122.66:3307).
Tue Mar 13 17:12:22 2018 - [debug] Connecting to servers..
Tue Mar 13 17:12:23 2018 - [debug] Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:12:23 2018 - [debug] Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:12:23 2018 - [debug] Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:12:23 2018 - [debug] Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:12:23 2018 - [debug] Comparing MySQL versions..
Tue Mar 13 17:12:23 2018 - [debug] Comparing MySQL versions done.
Tue Mar 13 17:12:23 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:12:23 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:12:23 2018 - [info] Dead Servers: # master 已经dead
Tue Mar 13 17:12:23 2018 - [info] master(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info] Alive Servers:
Tue Mar 13 17:12:23 2018 - [info] slave1(192.168.122.70:3307)
Tue Mar 13 17:12:23 2018 - [info] slave2(192.168.122.80:3307)
Tue Mar 13 17:12:23 2018 - [info] Alive Slaves:
Tue Mar 13 17:12:23 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:23 2018 - [info] GTID ON
Tue Mar 13 17:12:23 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:23 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:23 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:23 2018 - [info] GTID ON
Tue Mar 13 17:12:23 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:23 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:23 2018 - [info] Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:23 2018 - [info] Checking slave configurations..
Tue Mar 13 17:12:23 2018 - [info] read_only=1 is not set on slave slave1(192.168.122.70:3307).
Tue Mar 13 17:12:23 2018 - [info] read_only=1 is not set on slave slave2(192.168.122.80:3307).
Tue Mar 13 17:12:23 2018 - [info] Checking replication filtering settings..
Tue Mar 13 17:12:23 2018 - [info] Replication filtering check ok.
Tue Mar 13 17:12:23 2018 - [info] Master is down!
Tue Mar 13 17:12:23 2018 - [info] Terminating monitoring script.
Tue Mar 13 17:12:23 2018 - [debug] Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:12:23 2018 - [debug] Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:12:23 2018 - [info] Got exit code 20 (Master dead).
Tue Mar 13 17:12:23 2018 - [info] MHA::MasterFailover version 0.57.
Tue Mar 13 17:12:23 2018 - [info] Starting master failover.
Tue Mar 13 17:12:23 2018 - [info]
Tue Mar 13 17:12:23 2018 - [info] * Phase 1: Configuration Check Phase..
Tue Mar 13 17:12:23 2018 - [info]
Tue Mar 13 17:12:23 2018 - [debug] Skipping connecting to dead master master.
Tue Mar 13 17:12:23 2018 - [debug] Connecting to servers..
Tue Mar 13 17:12:24 2018 - [debug] Connected to: slave1(192.168.122.70:3307), user=root
Tue Mar 13 17:12:24 2018 - [debug] Number of slave worker threads on host slave1(192.168.122.70:3307): 8
Tue Mar 13 17:12:24 2018 - [debug] Connected to: slave2(192.168.122.80:3307), user=root
Tue Mar 13 17:12:24 2018 - [debug] Number of slave worker threads on host slave2(192.168.122.80:3307): 16
Tue Mar 13 17:12:24 2018 - [debug] Comparing MySQL versions..
Tue Mar 13 17:12:24 2018 - [debug] Comparing MySQL versions done.
Tue Mar 13 17:12:24 2018 - [debug] Connecting to servers done.
Tue Mar 13 17:12:24 2018 - [info] GTID failover mode = 1
Tue Mar 13 17:12:24 2018 - [info] Dead Servers:
Tue Mar 13 17:12:24 2018 - [info] master(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Alive Servers:
Tue Mar 13 17:12:24 2018 - [info] slave1(192.168.122.70:3307)
Tue Mar 13 17:12:24 2018 - [info] slave2(192.168.122.80:3307)
Tue Mar 13 17:12:24 2018 - [info] Alive Slaves:
Tue Mar 13 17:12:24 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Primary candidate for the new Master (candidate_master is set) # 存活的slave中slave1是配置了可以提升为master的
Tue Mar 13 17:12:24 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Not candidate for the new Master (no_master is set) # slave2配置文件未设置能提升为master
Tue Mar 13 17:12:24 2018 - [info] Starting GTID based failover.
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [debug] Stopping IO thread on slave1(192.168.122.70:3307).. # 停止slave1的IO线程
Tue Mar 13 17:12:24 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Mar 13 17:12:24 2018 - [info] Executing master IP deactivation script:
Tue Mar 13 17:12:24 2018 - [info] /usr/local/mha/bin/master_ip_failover --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 --command=stopssh --ssh_user=root
Tue Mar 13 17:12:24 2018 - [debug] Stopping IO thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:24 2018 - [debug] Stop IO thread on slave2(192.168.122.80:3307) done.

INSCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Disabling the VIP on old master: master
Tue Mar 13 17:12:24 2018 - [debug] Stop IO thread on slave1(192.168.122.70:3307) done.
Tue Mar 13 17:12:24 2018 - [info] done.
Tue Mar 13 17:12:24 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Mar 13 17:12:24 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] * Phase 3: Master Recovery Phase..
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [debug] Fetching current slave status..
Tue Mar 13 17:12:24 2018 - [debug] Fetching current slave status done.
Tue Mar 13 17:12:24 2018 - [info] The latest binary log file/position on all slaves is bin.000004:20816
Tue Mar 13 17:12:24 2018 - [info] Retrieved Gtid Set: 0c154ad5-2699-11e8-94a1-525400eac085:309-394
Tue Mar 13 17:12:24 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Mar 13 17:12:24 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info] The oldest binary log file/position on all slaves is bin.000004:20816
Tue Mar 13 17:12:24 2018 - [info] Retrieved Gtid Set: 0c154ad5-2699-11e8-94a1-525400eac085:309-394
Tue Mar 13 17:12:24 2018 - [info] Oldest slaves:
Tue Mar 13 17:12:24 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.3: Determining New Master Phase..
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] Searching new master from slaves..
Tue Mar 13 17:12:24 2018 - [info] Candidate masters from the configuration file:
Tue Mar 13 17:12:24 2018 - [info] slave1(192.168.122.70:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Mar 13 17:12:24 2018 - [info] Non-candidate masters:
Tue Mar 13 17:12:24 2018 - [info] slave2(192.168.122.80:3307) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Tue Mar 13 17:12:24 2018 - [info] GTID ON
Tue Mar 13 17:12:24 2018 - [debug] Relay log info repository: TABLE
Tue Mar 13 17:12:24 2018 - [info] Replicating from 192.168.122.66(192.168.122.66:3307)
Tue Mar 13 17:12:24 2018 - [info] Not candidate for the new Master (no_master is set)
Tue Mar 13 17:12:24 2018 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Mar 13 17:12:24 2018 - [info] New master is slave1(192.168.122.70:3307) # 选择slave1提升为新的master
Tue Mar 13 17:12:24 2018 - [info] Starting master failover.. # 开始failover切换
Tue Mar 13 17:12:24 2018 - [info]
From: # 原架构拓扑
master(192.168.122.66:3307) (current master)
+--slave1(192.168.122.70:3307)
+--slave2(192.168.122.80:3307)

To: # failover后的拓扑
slave1(192.168.122.70:3307) (new master)
+--slave2(192.168.122.80:3307)
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Mar 13 17:12:24 2018 - [info]
Tue Mar 13 17:12:24 2018 - [info] Waiting all logs to be applied..
Tue Mar 13 17:12:24 2018 - [info] done.
Tue Mar 13 17:12:24 2018 - [debug] Stopping slave IO/SQL thread on slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:24 2018 - [debug] done.
Tue Mar 13 17:12:24 2018 - [info] Getting new master's binlog name and position..
Tue Mar 13 17:12:24 2018 - [info] bin.000003:94211
Tue Mar 13 17:12:24 2018 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='slave1 or 192.168.122.70', MASTER_PORT=3307, MASTER_AUTO_POSITION=1, MASTER_USER='rpl', MASTER_PASSWORD='xxx';
Tue Mar 13 17:12:24 2018 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: bin.000003, 94211, 0c154ad5-2699-11e8-94a1-525400eac085:1-394,
c08d09b5-2698-11e8-9ec0-5254004dae68:1-2
Tue Mar 13 17:12:24 2018 - [info] Executing master IP activate script:
Tue Mar 13 17:12:24 2018 - [info] /usr/local/mha/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=master --orig_master_ip=192.168.122.66 --orig_master_port=3307 --new_master_host=slave1 --new_master_ip=192.168.122.70 --new_master_port=3307 --new_master_user='root' --new_master_password=xxx
Unknown option: new_master_user
Unknown option: new_master_password

IN SCRIPT TEST====/sbin/ifconfig eth0:88 down==/sbin/ifconfig eth0:88 192.168.122.99/24===

Enabling the VIP - 192.168.122.99/24 on the new master - slave1
Tue Mar 13 17:12:25 2018 - [info] OK.
Tue Mar 13 17:12:25 2018 - [info] ** Finished master recovery successfully.
Tue Mar 13 17:12:25 2018 - [info] * Phase 3: Master Recovery Phase completed.
Tue Mar 13 17:12:25 2018 - [info]
Tue Mar 13 17:12:25 2018 - [info] * Phase 4: Slaves Recovery Phase..
Tue Mar 13 17:12:25 2018 - [info]
Tue Mar 13 17:12:25 2018 - [info]
Tue Mar 13 17:12:25 2018 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Mar 13 17:12:25 2018 - [info]
Tue Mar 13 17:12:25 2018 - [info] -- Slave recovery on host slave2(192.168.122.80:3307) started, pid: 10212. Check tmp log /var/log/masterha/app1/slave2_3307_20180313171223.log if it takes time..
Tue Mar 13 17:12:27 2018 - [info]
Tue Mar 13 17:12:27 2018 - [info] Log messages from slave2 ...
Tue Mar 13 17:12:27 2018 - [info]
Tue Mar 13 17:12:25 2018 - [info] Resetting slave slave2(192.168.122.80:3307) and starting replication from the new master slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:25 2018 - [debug] Stopping slave IO/SQL thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:25 2018 - [debug] done.
Tue Mar 13 17:12:25 2018 - [info] Executed CHANGE MASTER.
Tue Mar 13 17:12:25 2018 - [debug] Starting slave IO/SQL thread on slave2(192.168.122.80:3307)..
Tue Mar 13 17:12:26 2018 - [debug] done.
Tue Mar 13 17:12:26 2018 - [info] Slave started.
Tue Mar 13 17:12:26 2018 - [info] gtid_wait(0c154ad5-2699-11e8-94a1-525400eac085:1-394,
c08d09b5-2698-11e8-9ec0-5254004dae68:1-2) completed on slave2(192.168.122.80:3307). Executed 3 events.
Tue Mar 13 17:12:27 2018 - [info] End of log messages from slave2.
Tue Mar 13 17:12:27 2018 - [info] -- Slave on host slave2(192.168.122.80:3307) started.
Tue Mar 13 17:12:27 2018 - [info] All new slave servers recovered successfully.
Tue Mar 13 17:12:27 2018 - [info]
Tue Mar 13 17:12:27 2018 - [info] * Phase 5: New master cleanup phase..
Tue Mar 13 17:12:27 2018 - [info]
Tue Mar 13 17:12:27 2018 - [info] Resetting slave info on the new master..
Tue Mar 13 17:12:27 2018 - [debug] Clearing slave info..
Tue Mar 13 17:12:27 2018 - [debug] Stopping slave IO/SQL thread on slave1(192.168.122.70:3307)..
Tue Mar 13 17:12:27 2018 - [debug] done.
Tue Mar 13 17:12:27 2018 - [debug] SHOW SLAVE STATUS shows new master does not replicate from anywhere. OK.
Tue Mar 13 17:12:27 2018 - [info] slave1: Resetting slave info succeeded.
Tue Mar 13 17:12:27 2018 - [info] Master failover to slave1(192.168.122.70:3307) completed successfully.
Tue Mar 13 17:12:27 2018 - [info] Deleted server1 entry from /etc/masterha/app1.conf .
Tue Mar 13 17:12:27 2018 - [debug] Disconnected from slave1(192.168.122.70:3307)
Tue Mar 13 17:12:27 2018 - [debug] Disconnected from slave2(192.168.122.80:3307)
Tue Mar 13 17:12:27 2018 - [info]

----- Failover Report -----
# 下面是failover的报表,集群中master角色从原来的66:3307切到了70:3307机器
app1: MySQL Master failover master(192.168.122.66:3307) to slave1(192.168.122.70:3307) succeeded

Master master(192.168.122.66:3307) is down!

Check MHA Manager logs at centos-66:/var/log/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on master(192.168.122.66:3307)
Selected slave1(192.168.122.70:3307) as a new master.
slave1(192.168.122.70:3307): OK: Applying all logs succeeded.
slave1(192.168.122.70:3307): OK: Activated master IP address.
slave2(192.168.122.80:3307): OK: Slave started, replicating from slave1(192.168.122.70:3307)
slave1(192.168.122.70:3307): Resetting slave info succeeded.
Master failover to slave1(192.168.122.70:3307) completed successfully. # 切换成功
 Comments