标签归档:备份

MySql 慢日志分析

最近老是碰上MySql报错:1203:User already has more than ‘max_user_connections’ active,之前都没出现过,感觉应该是慢查询导致的。向运维拷贝慢日志分析,慢日志开、启配置参考这里
拷贝出来的日志很大,需要按故障时间点做一下切割,以便缩小排查范围。按照这里提供的cutlogbytime.pl脚本运行却报错

[vagrant@centos64 mysql-log-filter-1.9]$ ./cutlogbytime.pl slow.log 1443103200 1443117600 > yestoday.log
: command not foundline 1:
: command not foundline 4:
./cutlogbytime.pl: line 5: use: command not found
: command not foundline 5:
./cutlogbytime.pl: line 6: use: command not found
: command not foundline 6:
: command not foundline 7:
'/cutlogbytime.pl: line 8: syntax error near unexpected token `{
'/cutlogbytime.pl: line 8: `if (@ARGV<2){

去掉顶行的空格后再运行,还是报错

[vagrant@centos64 mysql-log-filter-1.9]$ ./cutlogbytime.pl slow.log 1443103200 1443117600 > today.log
-bash: ./cutlogbytime.pl: /usr/bin/perl^M: bad interpreter: No such file or directory

最后参考stackoverflow上面的答案更改运行方式为Perl(而不是shell),就可以了。

[vagrant@centos64 mysql-log-filter-1.9]$ perl cutlogbytime.pl slow.log 1443103200 1443117600 > today.log

利用mysqlslowdump(perl脚本)来分析日志,-s参数表示排序方式:r表示影响行数(Rows),t表示耗时(Time),c表示查询次数(Count)

[vagrant@entos64 mysql-log-filter-1.9]$  perl mysqldumpslow.pl -s r -t 10 today4.log

Reading mysql slow query log from today4.log
Count: 1  Time=190.48s (190s)  Lock=0.00s (0s)  Rows=21829854.0 (21829854), xx[xxxx]@[192.168.10.139]
  SELECT /*!N SQL_NO_CACHE */ * FROM `errormessage`

Count: 32791  Time=40.95s (1342865s)  Lock=0.05s (1512s)  Rows=1.0 (32791), xx[xxxx]@10hosts
  select  *  from connectinfo where  ID=N  and AppType=N  ORDER BY CreateDatetime DESC LIMIT N

Count: 3  Time=3.71s (11s)  Lock=0.02s (0s)  Rows=300.0 (900), xx[xxxx]@2hosts
  select SeverName from errormessage where  ID='S'  and ServerType=N  and level=N  and MsgType <= N

第一个语句返回行数21829854,查看具体慢日志,之后需要插入这张表的进程均处于等待状态。

# Time: 150924  1:03:12
# User@Host: xx[xxxx] @  [192.168.10.139]  Id: 1493761
# Query_time: 190.479062  Lock_time: 0.000000 Rows_sent: 21829854  Rows_examined: 21829854
SET timestamp=1443027792;
SELECT /*!40001 SQL_NO_CACHE */ * FROM `errormessage`;
# Time: 150924  1:03:14
# User@Host: xx[xxxx] @  [192.168.10.168]  Id: 1498010
# Query_time: 59.669817  Lock_time: 57.159403 Rows_sent: 0  Rows_examined: 0
SET timestamp=1443027794;
insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '1217', '3', '4', '4', '827', 'erc:start erc error,songid=46243,keymd5=ee1275b26762e85a7f00e9890bdc092e,ercmd5=abbc3ea9102dbd003b7aa0547dcbf6fa', '2015-09-23 21:49:27', '192.168.15.117');
# User@Host: xx[xxxx] @  [192.168.10.205]  Id: 1494756
# Query_time: 157.211158  Lock_time: 154.673647 Rows_sent: 0  Rows_examined: 0
SET timestamp=1443027794;
insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '865', '3', '1', '2', '106', '检测正常!', '2015-09-24 01:01:18', '192.168.0.33');
# User@Host: xx[xxxx] @  [192.168.10.213]  Id: 1496479
# Query_time: 100.733230  Lock_time: 98.210902 Rows_sent: 0  Rows_examined: 0
SET timestamp=1443027794;
insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '2472', '3', '2', '4', '809', 'videoseripnoconfig', '2015-09-24 01:02:26', '192.168.0.18');

分析这几天的日志,发现故障时间点附近都是这个语句引起后面的SQL堵塞。原来是每天早上1点开始备份并同步全表数据,锁住了这个表导致后面的所有这个表的insert操作处于等待状态。mysqldump应该使用–single-transaction来避免锁表,类似下面这个

mysqldump –uuser -p --skip-opt -q -R  --single-transaction --default-character-set=utf8 --master-data=2  --create-option --no-autocommit –S ${sock} -B ${DBName}  > backup.sql

但是这样仍然是全表扫描进行备份,如果能够增量备份的话,影响要小很多,或者数据做冷热/新旧之分,定期将新(每天)/热数据转入旧(历史)/冷数据中。后面运维的解决方案是:升级数据库机器从虚拟机变为实体机,配置从机,并从从机进行备份同步。

上面mysqlslowdump使用影响行数来排序,事实上用另外两个类型(时间,次数)分析结果是connectinfo比较频繁,一直以来都认为是这个表的操作引起的。这里还尝试了其他工具来分析,使用mysqlsla.pl进行分析,相关参数选项参考这里

[vagrant@centos64 mysql-log-filter-1.9]$  perl mysqlsla.pl today.log
Auto-detected logs as slow logs
Report for slow logs: today4.log
60.57k queries total, 17 unique
Sorted by 't_sum'
Grand Totals: Time 5.38M s, Lock 3.22M s, Rows sent 21.86M, Rows Examined 184.46M


______________________________________________________________________ 001 ___
Count         : 25.59k  (42.24%)
Time          : 3905525.574451 s total, 152.643069 s avg, 113.07488 s to 2720.338946 s max  (72.64%)
  95% of Time : 3260112.482495 s total, 134.12789 s avg, 113.07488 s to 282.366041 s max
Lock Time (s) : 3168076.975558 s total, 123.820721 s avg, 108.548105 s to 311.639359 s max  (98.45%)
  95% of Lock : 2961933.212121 s total, 121.860167 s avg, 108.548105 s to 123.487106 s max
Rows sent     : 0 avg, 0 to 0 max  (0.00%)
Rows examined : 54 avg, 0 to 4.92k max  (0.75%)
Database      :
Users         :
        xx@ 192.168.10.147 : 10.65% (2724) of query, 10.26% (6215) of all users
        xx@ 192.168.10.209 : 10.33% (2643) of query, 10.16% (6156) of all users
        xx@ 192.168.10.205 : 10.16% (2599) of query, 9.97% (6036) of all users
        xx@ 192.168.10.211 : 10.13% (2591) of query, 9.98% (6042) of all users
        xx@ 192.168.10.207 : 9.93% (2541) of query, 9.95% (6024) of all users
        xx@ 192.168.10.161 : 9.83% (2515) of query, 9.84% (5960) of all users
        xx@ 192.168.10.149 : 9.81% (2510) of query, 9.95% (6028) of all users
        xx@ 192.168.10.215 : 9.76% (2498) of query, 9.85% (5963) of all users
        xx@ 192.168.10.168 : 9.71% (2485) of query, 9.69% (5868) of all users
        xx@ 192.168.10.213 : 9.69% (2480) of query, 9.66% (5851) of all users

Query abstract:
SET timestamp=N; UPDATE connectinfo SET devicetag='S', connectipaddress='S', updatedatetime=now() WHERE ID=N AND apptype=N;

Query sample:
SET timestamp=1443027797;
update connectinfo set DeviceTag='1070A416AF000000', ConnectIPAddress='60.174.116.165', UpdateDatetime=now() where ID=5358 and AppType=0;

______________________________________________________________________ 002 ___
Count         : 32.79k  (54.14%)
Time          : 1344378.871914 s total, 40.99841 s avg, 2.000747 s to 1944.548192 s max  (25.01%)
  95% of Time : 587407.556704 s total, 18.85678 s avg, 2.000747 s to 233.465042 s max
Lock Time (s) : 1512.917798 s total, 46.138 ms avg, 76 ▒s to 114.302 ms max  (0.05%)
  95% of Lock : 1414.978902 s total, 45.423 ms avg, 76 ▒s to 50.514 ms max
Rows sent     : 1 avg, 1 to 1 max  (0.15%)
Rows examined : 4.92k avg, 4.92k to 4.92k max  (87.41%)
Database      :
Users         :
        xx@ 192.168.10.209 : 10.24% (3359) of query, 10.16% (6156) of all users
        xx@ 192.168.10.149 : 10.16% (3331) of query, 9.95% (6028) of all users
        xx@ 192.168.10.147 : 10.11% (3315) of query, 10.26% (6215) of all users
        xx@ 192.168.10.211 : 10.03% (3288) of query, 9.98% (6042) of all users
        xx@ 192.168.10.207 : 10.02% (3285) of query, 9.95% (6024) of all users
        xx@ 192.168.10.161 : 9.97% (3268) of query, 9.84% (5960) of all users
        xx@ 192.168.10.215 : 9.96% (3266) of query, 9.85% (5963) of all users
        xx@ 192.168.10.205 : 9.92% (3254) of query, 9.97% (6036) of all users
        xx@ 192.168.10.168 : 9.86% (3234) of query, 9.69% (5868) of all users
        xx@ 192.168.10.213 : 9.73% (3191) of query, 9.66% (5851) of all users

Query abstract:
SET timestamp=N; SELECT * FROM connectinfo WHERE ID=N AND apptype=N ORDER BY createdatetime DESC LIMIT N;

Query sample:
SET timestamp=1443027795;
select  *  from connectinfo where  ID=7646  and AppType=0  ORDER BY CreateDatetime DESC LIMIT 1;

______________________________________________________________________ 003 ___
Count         : 842  (1.39%)
Time          : 66663.314786 s total, 79.172583 s avg, 2.011408 s to 673.604537 s max  (1.24%)
  95% of Time : 56684.989954 s total, 70.944919 s avg, 2.011408 s to 193.623235 s max
Lock Time (s) : 48221.988255 s total, 57.27077 s avg, 69 ▒s to 185.402303 s max  (1.50%)
  95% of Lock : 40627.196184 s total, 50.847555 s avg, 69 ▒s to 166.67704 s max
Rows sent     : 0 avg, 0 to 0 max  (0.00%)
Rows examined : 0 avg, 0 to 0 max  (0.00%)
Database      :
Users         :
        xx@ 192.168.10.207 : 11.64% (98) of query, 9.95% (6024) of all users
        xx@ 192.168.10.205 : 11.28% (95) of query, 9.97% (6036) of all users
        xx@ 192.168.10.213 : 10.93% (92) of query, 9.66% (5851) of all users
        xx@ 192.168.10.161 : 10.45% (88) of query, 9.84% (5960) of all users
        xx@ 192.168.10.149 : 10.33% (87) of query, 9.95% (6028) of all users
        xx@ 192.168.10.211 : 9.74% (82) of query, 9.98% (6042) of all users
        xx@ 192.168.10.147 : 9.38% (79) of query, 10.26% (6215) of all users
        xx@ 192.168.10.215 : 9.38% (79) of query, 9.85% (5963) of all users
        xx@ 192.168.10.168 : 9.03% (76) of query, 9.69% (5868) of all users
        xx@ 192.168.10.209 : 7.84% (66) of query, 10.16% (6156) of all users

Query abstract:
SET timestamp=N; INSERT INTO errormessage (id,servertype,msgtype,level,dev,content,eventtime,severname) VALUES ( 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S')1;

使用mysqlsla可以看SQL语句的执行数量/比例,影响行数,用户,占比等。,从这里看很可能认为是connectinfo表(95%以上)引起,SHOW PROCESSLIST也是如此 。
由于这个文件我是单独下载回来的,运行mysqlsla.pl时候碰到很多错误,逐个安装解决了

[vagrant@centos64 mysql-log-filter-1.9]$  perl mysqlsla.pl today.log
Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at mysqlsla.pl line 2092.
BEGIN failed--compilation aborted at mysqlsla.pl line 2092.
#解决方法
[vagrant@centos64 mysql-log-filter-1.9]$ sudo yum install perl-Time-HiRes

[vagrant@centos64 mysql-log-filter-1.9]$ perl -MCPAN -e 'install DBI'
Can't locate CPAN.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .).
BEGIN failed--compilation aborted.

#解决方法
[vagrant@centos64 mysql-log-filter-1.9]$ sudo yum install perl-DBI

正确的方法是应该检测对应根目录下面的Makefile.PL,比如

[vagrant@centos64 percona-toolkit-2.2.15]$ perl Makefile.PL
#如果报以下错误,需要先安装对应模块,简单点就是
#sudo yum install perl-devel
Can't locate ExtUtils/MakeMaker.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at Makefile.PL line 1.
BEGIN failed--compilation aborted at Makefile.PL line 1.

然后安装对应模块,参考这里
使用Percona公司的工具pt-query-digest来分析也得到了同mysqlsla类似的结果

[vagrant@vagrant-centos64 bin]$ pt-query-digest ../../today4.log

# 9.8s user time, 700ms system time, 21.05M rss, 73.66M vsz
# Current date: Mon Oct  5 05:52:01 2015
# Hostname: vagrant-centos64.vagrantup.com
# Files: ../../today.log
# Overall: 60.57k total, 17 unique, 4.54 QPS, 402.68x concurrency ________
# Time range: 2015-09-23 22:17:29 to 2015-09-24 02:00:00
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time        5376198s      2s   2720s     89s    258s    118s     57s
# Lock time        3217840s       0    312s     53s    118s     60s    48ms
# Rows sent         20.85M       0  20.82M  361.00    0.99  84.46k    0.99
# Rows examine     175.91M       0  20.82M   2.97k   4.71k  84.48k   4.71k
# Query size         7.85M      64     597  135.90  151.03   27.56  112.70

# Profile
# Rank Query ID           Response time      Calls R/Call   V/M   Item
# ==== ================== ================== ===== ======== ===== ========
#    1 0xF1132168DB0BFC57 3905525.5745 72.6% 25586 152.6431 61.61 UPDATE connectinfo
#    2 0xD4B317E755A0ABD7 1344378.8719 25.0% 32791  40.9984 30... SELECT connectinfo
#    3 0xE23849EE6FB19DAE   66663.3148  1.2%   842  79.1726 62.99 INSERT errormessage
...

# Query 1: 7.52 QPS, 1.15kx concurrency, ID 0xF1132168DB0BFC57 at byte 16243195
# This item is included in the report because it matches --limit.
# Scores: V/M = 61.61
# Time range: 2015-09-24 01:03:17 to 02:00:00
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         42   25586
# Exec time     72 3905526s    113s   2720s    153s    271s     97s    124s
# Lock time     98 3168077s    109s    312s    124s    118s     14s    118s
# Rows sent      0       0       0       0       0       0       0       0
# Rows examine   0   1.33M       0   4.80k   54.39       0  504.65       0
# Query size    48   3.78M     149     157  154.94  151.03    0.52  151.03
# String:
# Hosts        192.168.10.147 (2724/10%)... 9 more
# Users        gate
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
# Tables
#    SHOW TABLE STATUS LIKE 'connectinfo'\G
#    SHOW CREATE TABLE `connectinfo`\G
update connectinfo set DeviceTag='10705BDDCD000000', ConnectIPAddress='115.231.63.78', UpdateDatetime=now() where ID=6912 and AppType=0\G
# Converted for EXPLAIN
# EXPLAIN /*!50100 PARTITIONS*/
select  DeviceTag='10705BDDCD000000', ConnectIPAddress='115.231.63.78', UpdateDatetime=now() from connectinfo where  ID=6912 and AppType=0\G

PS:这个问题解决后没几天,数据库又出问题,以为又是慢SQL什么的,结果是交换机网口问题。。。数据传输太慢,导致主从不同步。。。

参考链接:
mysql 慢日志分析
慢日志按时间截取
MySQL 之 slow log
MySQL优化—工欲善其事,必先利其器(2)
日志常用统计技巧
性能优化之MySQL优化(一)
mysqlsla的安装与使用
安装DBI组件。 Can’t locate DBI.pm in @INC-mysql接口
Issue 12: Can’t locate Time/HiRes.pm
打开MySQL的慢查询记录
“Can’t locate ExtUtils/MakeMaker.pm” while compile git
analysing slow MySQL queries with pt-query-digest
mysqldump备份原理
mysqldump –single-transaction, yet update queries are waiting for the backup
mysql 利用binlog增量备份,还原实例
MySQl备份恢复策略(完全+增量备份策略)