分类目录归档:数据库

PHP MongoDB Replica Set应用

MongoDB是个面向文档管理的NoSQL数据库,能够直接存取JSON数据,支持海量数据。公司内部的一个单点登录系统使用MongoDB来存储用户的session,以便在不同应用,服务器之间共享登录信息,解决了服务器切换用户状态丢失(被登出)的问题。单点登录系统前端采用PHP,与LDAP交互认证用户信息,提供登录界面,API等。MongoDB则采用Replica Set模式以便支持高可用。
在CentOS 6.5上安装MongoDB,首先添加源仓库

$ suod vim /etc/yum.repos.d/mongodb.repo
[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/
gpgcheck=0
enabled=0

安装,启动,连接MongoDB

$ sudo yum --enablerepo=mongodb install mongodb-org
$ sudo /sbin/chkconfig --levels 235 mongod on
$ sudo mongod --port 27017 --dbpath /data/db1
$ mongo --port 27017

创建用户SSOReplUser

> use admin 
> db.createUser( { user: "SSOReplUser", pwd: "<password>", roles: [ { role: "root", db: "admin" } ] });

创建集群要用到的key

$ sudo openssl rand -base64 741 > /home/sso/mongodb-keyfile 
$ sudo chmod 600 /home/sso/mongodb-keyfile

编辑MongoDB配置文件

$ sudo vim /etc/mongod.conf
# Replication Options 
# in replicated mongo databases, specify the replica set name here 
replSet=SSOReplSet 
# maximum size in megabytes for replication operation log 
#oplogSize=1024 
# path to a key file storing authentication info for connections # between replica set members keyFile=/home/sso/mongodb-keyfile

重启mongod服务

$ sudo /etc/init.d/mongod stop   
$ sudo /usr/bin/mongod -f /etc/mongod.conf

使用SSOReplUser登录

> use admin 
> db.auth("SSOReplUser", "<password>");

初始化集群

> rs.initiate()
> rs.conf()   
{
  "_id": "SSOReplSet",
  "version": 1,
  "members": [
    {
      "_id": 1,
      "host": "192.168.33.10:27017"
    }
  ]
}

在其他机器上按照上面步骤安装MongoDB,更改配置,复制对应的mongodb-keyfile并启动。
回到192.168.33.10并添加机器

SSOReplSet:PRIMARY> rs.add("192.168.33.11:27017") 
SSOReplSet:PRIMARY> rs.add("192.168.33.12:27017") 
SSOReplSet:PRIMARY> rs.add("192.168.33.13:27017")
#SSOReplSet:PRIMARY> rs.remove("192.168.33.13:27017")

在从机上认证查看

SSOReplSet:SECONDARY> use admin 
SSOReplSet:SECONDARY> db.auth("SSOReplUser", "<password>");   
SSOReplSet:SECONDARY> rs.status()

MongoDB的Replica set模式最少要求3台机器,以便在PRIMARY机器故障时,能够自我选举出新的PRIMARY,但需要n/2+1的机器投票同意。例如刚才配置了4台机器,那么需要4/2+1=3台机器投票,能够承受一台机器故障。如果是集群有5台机器,则能够承受2台机器故障。因此再配置一台不存储数据的Arbiter机器比较合理。
按照上面的步骤安装,复制mongodb-keyfile文件,但需要更改mongod.conf

$sudo vim /etc/mongod.conf
nojournal=true

然后启动mongod服务并在PRIMARY机器上将这台机器加入机器

SSOReplSet:PRIMARY> use admin   
SSOReplSet:PRIMARY> db.auth("SSOReplUser", "<password>");   
SSOReplSet:PRIMARY> rs.addArb("<ip of arbiter>");

PHP的mongo扩展安装比较简单,下载对应版本,启用即可。然而在应用过程中发现登录有点慢,启用看PHP mongo扩展profiling功能,查看PHP日志

<?php
\MongoLog::setLevel(\MongoLog::ALL);
\MongoLog::setModule(\MongoLog::ALL);
try{
    echo  microtime(true) . PHP_EOL;
    echo "aaa01(primary)" . PHP_EOL;
    $m = new \MongoClient("mongodb://admin:XXXXXXXXX@192.168.33.10:27017/?replicaSet=SSOReplSet ");
    echo  microtime(true) . PHP_EOL;
    echo "aaa01(primay), bbb01(secondary),ccc01(secondary),ddd01(secondary)" . PHP_EOL;
    $m = new \MongoClient("mongodb://admin:XXXXXXXXX@192.168.33.10:27017,192.168.33.11:27017,192.168.33.12:27017,192.168.33.13:27017/?replicaSet=SSOReplSet ");
    echo  microtime(true) . PHP_EOL;
} catch (\MongoConnectionException $e) {
    var_dump($e);
}

发现PHP连接MongoDB集群是这样子的

即:

  • 1)在连接里面配置了几个MongoDB连接服务器,PHP每个都会创建连接去查询
  • 2)从每台服务器上查询出整个集群的服务器列表,再分别ping和连接这些服务器,如果连接不存在或不匹配则创建,无效的则销毁
  • 3)汇总所有服务器返回集群列表
  • 4)选择离自己最近的服务器并使用该与该服务器的连接
  • MongoDB 的写操作默认在Primary节点上操作完成即返回成功;读操作默认是也是从Primary上读取,所以需要去查询服务器。由于SSO应用部署在多个数据中心,网络抖动会造成较大影响,跨数据中心的查询并不会很快,如果每次都去连接并查询整个列表是比较耗时。另外如果在php里面配置的是IP而MongoDB Replica Set里面配置的是域名,则连接名会出现不匹配,而创建新的连接并销毁旧连接,也耗时。如果配置的是域名,则需要DNS解析。由于每台PHP服务器均已配置HA检测,最终每个应用只配置了一台服务器,并统一配置MongoDB集群为IP。而连接最快的是在Primary机器上,可以根据本机是否Primary来做HA:

    #!/usr/bin/env bash
    count=`ps -fe |grep "mongod" | grep -v "grep" | wc -l`
    FILE="/home/scripts-bits/master.conf"
    SERVER=`hostname`
    
    if [ $count -lt 1 ]; then
        rm -f $FILE
    else
        PRIMAY=`/usr/bin/mongo ${SERVER}:27017 --quiet --eval 'printjson(db.isMaster().ismaster);'`
        if [ "$PRIMAY" == "true" ]; then
        	if [ ! -f "$FILE" ]; then
        		touch "$FILE"
        	fi
        	REMOVE=`/usr/bin/mongo ${SERVER}:27017/admin --quiet /home/scripts-bits/mongo_status.js`
        fi
    fi
    

    删除故障节点(not reachable/healthy)的脚本mongo_status.js:

    db.auth('admin','<password>');
    conf=rs.status();
    members=conf["members"];
    for(i in members){
    	if(members[i]["state"] == 8){
    		rs.remove(members[i]["name"]);
    	}
    }
    

    这中间也出现过因为网络/机器故障,导致PHP等待连接超时的情况的,将该机器从集群中移除即可。然而当服务器是启动的(可以到达)情况下,MongoDB故障,则不会超时。可以更改connectTimeoutMS,以便减少等待。
    MongoDB的日志默认都是记录在/var/log/mongodb/下面并且会越来越大。创建Python脚本来定时清除它:

    #!/bin/env python
    import commands
    import datetime,time
    
    def rotate_log(path, expire = 30):
        str_now = time.strftime("%Y-%m-%d")
        dat_now = time.strptime(str_now, "%Y-%m-%d")
        array_dat_now = datetime.datetime(dat_now[0], dat_now[1], dat_now[2])
        lns = commands.getoutput("/bin/ls --full-time %s|awk '{print $6, $9}'" % path)
        for ln in lns.split('\n'):
            ws = ln.split()
            if len(ws) != 2:
                continue
            ws1 = time.strptime(ws[0], "%Y-%m-%d")
            ws2 = datetime.datetime(ws1[0], ws1[1], ws1[2])
            if (array_dat_now - ws2).days > expire:
                v_del = commands.getoutput("/bin/rm -rf %s/%s" % (path, ws[1]))
    
    
    def rotate_mongo():
        # get mongo pid
        mongo_pid = commands.getoutput("/sbin/pidof mongod")
        #print mongo_pid
        # send Sig to mongo
        if mongo_pid != '':
            cmd = "/bin/kill -USR1 %s" % (mongo_pid)
            # print cmd
            mongo_rotate = commands.getoutput(cmd)
        else:
            print "mongod is not running..."
    
    if __name__ == "__main__":
        log_path = "/var/log/mongodb/"
        expire = 30
        rotate_mongo()
        rotate_log(log_path, expire)
    
    

    加入到crontab里面去执行

    10 1 * * * /usr/bin/python /home/sso/mongo_rotate.py > /dev/null 2>&1
    

    参考连接
    mongodb与mysql相比的优缺点
    PHP MongoDB 复制集合
    MongoDB Replication
    MongoDB Enable Auth
    MongoDB Rotate Log Files
    Blocking connect() leads to cumulative timeouts for multiple inaccessible servers
    How Raft consensus algorithm will make replication even better in MongoDB 3.2
    Write Concern for Replica Sets
    MongoDB Read Preference

    MySql 慢日志分析

    最近老是碰上MySql报错:1203:User already has more than ‘max_user_connections’ active,之前都没出现过,感觉应该是慢查询导致的。向运维拷贝慢日志分析,慢日志开、启配置参考这里
    拷贝出来的日志很大,需要按故障时间点做一下切割,以便缩小排查范围。按照这里提供的cutlogbytime.pl脚本运行却报错

    [vagrant@centos64 mysql-log-filter-1.9]$ ./cutlogbytime.pl slow.log 1443103200 1443117600 > yestoday.log
    : command not foundline 1:
    : command not foundline 4:
    ./cutlogbytime.pl: line 5: use: command not found
    : command not foundline 5:
    ./cutlogbytime.pl: line 6: use: command not found
    : command not foundline 6:
    : command not foundline 7:
    '/cutlogbytime.pl: line 8: syntax error near unexpected token `{
    '/cutlogbytime.pl: line 8: `if (@ARGV<2){
    
    

    去掉顶行的空格后再运行,还是报错

    [vagrant@centos64 mysql-log-filter-1.9]$ ./cutlogbytime.pl slow.log 1443103200 1443117600 > today.log
    -bash: ./cutlogbytime.pl: /usr/bin/perl^M: bad interpreter: No such file or directory
    

    最后参考stackoverflow上面的答案更改运行方式为Perl(而不是shell),就可以了。

    [vagrant@centos64 mysql-log-filter-1.9]$ perl cutlogbytime.pl slow.log 1443103200 1443117600 > today.log
    

    利用mysqlslowdump(perl脚本)来分析日志,-s参数表示排序方式:r表示影响行数(Rows),t表示耗时(Time),c表示查询次数(Count)

    [vagrant@entos64 mysql-log-filter-1.9]$  perl mysqldumpslow.pl -s r -t 10 today4.log
    
    Reading mysql slow query log from today4.log
    Count: 1  Time=190.48s (190s)  Lock=0.00s (0s)  Rows=21829854.0 (21829854), xx[xxxx]@[192.168.10.139]
      SELECT /*!N SQL_NO_CACHE */ * FROM `errormessage`
    
    Count: 32791  Time=40.95s (1342865s)  Lock=0.05s (1512s)  Rows=1.0 (32791), xx[xxxx]@10hosts
      select  *  from connectinfo where  ID=N  and AppType=N  ORDER BY CreateDatetime DESC LIMIT N
    
    Count: 3  Time=3.71s (11s)  Lock=0.02s (0s)  Rows=300.0 (900), xx[xxxx]@2hosts
      select SeverName from errormessage where  ID='S'  and ServerType=N  and level=N  and MsgType <= N
    
    

    第一个语句返回行数21829854,查看具体慢日志,之后需要插入这张表的进程均处于等待状态。

    # Time: 150924  1:03:12
    # User@Host: xx[xxxx] @  [192.168.10.139]  Id: 1493761
    # Query_time: 190.479062  Lock_time: 0.000000 Rows_sent: 21829854  Rows_examined: 21829854
    SET timestamp=1443027792;
    SELECT /*!40001 SQL_NO_CACHE */ * FROM `errormessage`;
    # Time: 150924  1:03:14
    # User@Host: xx[xxxx] @  [192.168.10.168]  Id: 1498010
    # Query_time: 59.669817  Lock_time: 57.159403 Rows_sent: 0  Rows_examined: 0
    SET timestamp=1443027794;
    insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '1217', '3', '4', '4', '827', 'erc:start erc error,songid=46243,keymd5=ee1275b26762e85a7f00e9890bdc092e,ercmd5=abbc3ea9102dbd003b7aa0547dcbf6fa', '2015-09-23 21:49:27', '192.168.15.117');
    # User@Host: xx[xxxx] @  [192.168.10.205]  Id: 1494756
    # Query_time: 157.211158  Lock_time: 154.673647 Rows_sent: 0  Rows_examined: 0
    SET timestamp=1443027794;
    insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '865', '3', '1', '2', '106', '检测正常!', '2015-09-24 01:01:18', '192.168.0.33');
    # User@Host: xx[xxxx] @  [192.168.10.213]  Id: 1496479
    # Query_time: 100.733230  Lock_time: 98.210902 Rows_sent: 0  Rows_examined: 0
    SET timestamp=1443027794;
    insert into errormessage (`ID`,`ServerType`,`MsgType`,`Level`,`dev`,`content`,`EventTime`,`SeverName`) values ( '2472', '3', '2', '4', '809', 'videoseripnoconfig', '2015-09-24 01:02:26', '192.168.0.18');
    

    分析这几天的日志,发现故障时间点附近都是这个语句引起后面的SQL堵塞。原来是每天早上1点开始备份并同步全表数据,锁住了这个表导致后面的所有这个表的insert操作处于等待状态。mysqldump应该使用–single-transaction来避免锁表,类似下面这个

    mysqldump –uuser -p --skip-opt -q -R  --single-transaction --default-character-set=utf8 --master-data=2  --create-option --no-autocommit –S ${sock} -B ${DBName}  > backup.sql
    

    但是这样仍然是全表扫描进行备份,如果能够增量备份的话,影响要小很多,或者数据做冷热/新旧之分,定期将新(每天)/热数据转入旧(历史)/冷数据中。后面运维的解决方案是:升级数据库机器从虚拟机变为实体机,配置从机,并从从机进行备份同步。

    上面mysqlslowdump使用影响行数来排序,事实上用另外两个类型(时间,次数)分析结果是connectinfo比较频繁,一直以来都认为是这个表的操作引起的。这里还尝试了其他工具来分析,使用mysqlsla.pl进行分析,相关参数选项参考这里

    [vagrant@centos64 mysql-log-filter-1.9]$  perl mysqlsla.pl today.log
    Auto-detected logs as slow logs
    Report for slow logs: today4.log
    60.57k queries total, 17 unique
    Sorted by 't_sum'
    Grand Totals: Time 5.38M s, Lock 3.22M s, Rows sent 21.86M, Rows Examined 184.46M
    
    
    ______________________________________________________________________ 001 ___
    Count         : 25.59k  (42.24%)
    Time          : 3905525.574451 s total, 152.643069 s avg, 113.07488 s to 2720.338946 s max  (72.64%)
      95% of Time : 3260112.482495 s total, 134.12789 s avg, 113.07488 s to 282.366041 s max
    Lock Time (s) : 3168076.975558 s total, 123.820721 s avg, 108.548105 s to 311.639359 s max  (98.45%)
      95% of Lock : 2961933.212121 s total, 121.860167 s avg, 108.548105 s to 123.487106 s max
    Rows sent     : 0 avg, 0 to 0 max  (0.00%)
    Rows examined : 54 avg, 0 to 4.92k max  (0.75%)
    Database      :
    Users         :
            xx@ 192.168.10.147 : 10.65% (2724) of query, 10.26% (6215) of all users
            xx@ 192.168.10.209 : 10.33% (2643) of query, 10.16% (6156) of all users
            xx@ 192.168.10.205 : 10.16% (2599) of query, 9.97% (6036) of all users
            xx@ 192.168.10.211 : 10.13% (2591) of query, 9.98% (6042) of all users
            xx@ 192.168.10.207 : 9.93% (2541) of query, 9.95% (6024) of all users
            xx@ 192.168.10.161 : 9.83% (2515) of query, 9.84% (5960) of all users
            xx@ 192.168.10.149 : 9.81% (2510) of query, 9.95% (6028) of all users
            xx@ 192.168.10.215 : 9.76% (2498) of query, 9.85% (5963) of all users
            xx@ 192.168.10.168 : 9.71% (2485) of query, 9.69% (5868) of all users
            xx@ 192.168.10.213 : 9.69% (2480) of query, 9.66% (5851) of all users
    
    Query abstract:
    SET timestamp=N; UPDATE connectinfo SET devicetag='S', connectipaddress='S', updatedatetime=now() WHERE ID=N AND apptype=N;
    
    Query sample:
    SET timestamp=1443027797;
    update connectinfo set DeviceTag='1070A416AF000000', ConnectIPAddress='60.174.116.165', UpdateDatetime=now() where ID=5358 and AppType=0;
    
    ______________________________________________________________________ 002 ___
    Count         : 32.79k  (54.14%)
    Time          : 1344378.871914 s total, 40.99841 s avg, 2.000747 s to 1944.548192 s max  (25.01%)
      95% of Time : 587407.556704 s total, 18.85678 s avg, 2.000747 s to 233.465042 s max
    Lock Time (s) : 1512.917798 s total, 46.138 ms avg, 76 ▒s to 114.302 ms max  (0.05%)
      95% of Lock : 1414.978902 s total, 45.423 ms avg, 76 ▒s to 50.514 ms max
    Rows sent     : 1 avg, 1 to 1 max  (0.15%)
    Rows examined : 4.92k avg, 4.92k to 4.92k max  (87.41%)
    Database      :
    Users         :
            xx@ 192.168.10.209 : 10.24% (3359) of query, 10.16% (6156) of all users
            xx@ 192.168.10.149 : 10.16% (3331) of query, 9.95% (6028) of all users
            xx@ 192.168.10.147 : 10.11% (3315) of query, 10.26% (6215) of all users
            xx@ 192.168.10.211 : 10.03% (3288) of query, 9.98% (6042) of all users
            xx@ 192.168.10.207 : 10.02% (3285) of query, 9.95% (6024) of all users
            xx@ 192.168.10.161 : 9.97% (3268) of query, 9.84% (5960) of all users
            xx@ 192.168.10.215 : 9.96% (3266) of query, 9.85% (5963) of all users
            xx@ 192.168.10.205 : 9.92% (3254) of query, 9.97% (6036) of all users
            xx@ 192.168.10.168 : 9.86% (3234) of query, 9.69% (5868) of all users
            xx@ 192.168.10.213 : 9.73% (3191) of query, 9.66% (5851) of all users
    
    Query abstract:
    SET timestamp=N; SELECT * FROM connectinfo WHERE ID=N AND apptype=N ORDER BY createdatetime DESC LIMIT N;
    
    Query sample:
    SET timestamp=1443027795;
    select  *  from connectinfo where  ID=7646  and AppType=0  ORDER BY CreateDatetime DESC LIMIT 1;
    
    ______________________________________________________________________ 003 ___
    Count         : 842  (1.39%)
    Time          : 66663.314786 s total, 79.172583 s avg, 2.011408 s to 673.604537 s max  (1.24%)
      95% of Time : 56684.989954 s total, 70.944919 s avg, 2.011408 s to 193.623235 s max
    Lock Time (s) : 48221.988255 s total, 57.27077 s avg, 69 ▒s to 185.402303 s max  (1.50%)
      95% of Lock : 40627.196184 s total, 50.847555 s avg, 69 ▒s to 166.67704 s max
    Rows sent     : 0 avg, 0 to 0 max  (0.00%)
    Rows examined : 0 avg, 0 to 0 max  (0.00%)
    Database      :
    Users         :
            xx@ 192.168.10.207 : 11.64% (98) of query, 9.95% (6024) of all users
            xx@ 192.168.10.205 : 11.28% (95) of query, 9.97% (6036) of all users
            xx@ 192.168.10.213 : 10.93% (92) of query, 9.66% (5851) of all users
            xx@ 192.168.10.161 : 10.45% (88) of query, 9.84% (5960) of all users
            xx@ 192.168.10.149 : 10.33% (87) of query, 9.95% (6028) of all users
            xx@ 192.168.10.211 : 9.74% (82) of query, 9.98% (6042) of all users
            xx@ 192.168.10.147 : 9.38% (79) of query, 10.26% (6215) of all users
            xx@ 192.168.10.215 : 9.38% (79) of query, 9.85% (5963) of all users
            xx@ 192.168.10.168 : 9.03% (76) of query, 9.69% (5868) of all users
            xx@ 192.168.10.209 : 7.84% (66) of query, 10.16% (6156) of all users
    
    Query abstract:
    SET timestamp=N; INSERT INTO errormessage (id,servertype,msgtype,level,dev,content,eventtime,severname) VALUES ( 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S')1;
    
    

    使用mysqlsla可以看SQL语句的执行数量/比例,影响行数,用户,占比等。,从这里看很可能认为是connectinfo表(95%以上)引起,SHOW PROCESSLIST也是如此 。
    由于这个文件我是单独下载回来的,运行mysqlsla.pl时候碰到很多错误,逐个安装解决了

    [vagrant@centos64 mysql-log-filter-1.9]$  perl mysqlsla.pl today.log
    Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at mysqlsla.pl line 2092.
    BEGIN failed--compilation aborted at mysqlsla.pl line 2092.
    #解决方法
    [vagrant@centos64 mysql-log-filter-1.9]$ sudo yum install perl-Time-HiRes
    
    [vagrant@centos64 mysql-log-filter-1.9]$ perl -MCPAN -e 'install DBI'
    Can't locate CPAN.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .).
    BEGIN failed--compilation aborted.
    
    #解决方法
    [vagrant@centos64 mysql-log-filter-1.9]$ sudo yum install perl-DBI
    

    正确的方法是应该检测对应根目录下面的Makefile.PL,比如

    [vagrant@centos64 percona-toolkit-2.2.15]$ perl Makefile.PL
    #如果报以下错误,需要先安装对应模块,简单点就是
    #sudo yum install perl-devel
    Can't locate ExtUtils/MakeMaker.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at Makefile.PL line 1.
    BEGIN failed--compilation aborted at Makefile.PL line 1.
    

    然后安装对应模块,参考这里
    使用Percona公司的工具pt-query-digest来分析也得到了同mysqlsla类似的结果

    [vagrant@vagrant-centos64 bin]$ pt-query-digest ../../today4.log
    
    # 9.8s user time, 700ms system time, 21.05M rss, 73.66M vsz
    # Current date: Mon Oct  5 05:52:01 2015
    # Hostname: vagrant-centos64.vagrantup.com
    # Files: ../../today.log
    # Overall: 60.57k total, 17 unique, 4.54 QPS, 402.68x concurrency ________
    # Time range: 2015-09-23 22:17:29 to 2015-09-24 02:00:00
    # Attribute          total     min     max     avg     95%  stddev  median
    # ============     ======= ======= ======= ======= ======= ======= =======
    # Exec time        5376198s      2s   2720s     89s    258s    118s     57s
    # Lock time        3217840s       0    312s     53s    118s     60s    48ms
    # Rows sent         20.85M       0  20.82M  361.00    0.99  84.46k    0.99
    # Rows examine     175.91M       0  20.82M   2.97k   4.71k  84.48k   4.71k
    # Query size         7.85M      64     597  135.90  151.03   27.56  112.70
    
    # Profile
    # Rank Query ID           Response time      Calls R/Call   V/M   Item
    # ==== ================== ================== ===== ======== ===== ========
    #    1 0xF1132168DB0BFC57 3905525.5745 72.6% 25586 152.6431 61.61 UPDATE connectinfo
    #    2 0xD4B317E755A0ABD7 1344378.8719 25.0% 32791  40.9984 30... SELECT connectinfo
    #    3 0xE23849EE6FB19DAE   66663.3148  1.2%   842  79.1726 62.99 INSERT errormessage
    ...
    
    # Query 1: 7.52 QPS, 1.15kx concurrency, ID 0xF1132168DB0BFC57 at byte 16243195
    # This item is included in the report because it matches --limit.
    # Scores: V/M = 61.61
    # Time range: 2015-09-24 01:03:17 to 02:00:00
    # Attribute    pct   total     min     max     avg     95%  stddev  median
    # ============ === ======= ======= ======= ======= ======= ======= =======
    # Count         42   25586
    # Exec time     72 3905526s    113s   2720s    153s    271s     97s    124s
    # Lock time     98 3168077s    109s    312s    124s    118s     14s    118s
    # Rows sent      0       0       0       0       0       0       0       0
    # Rows examine   0   1.33M       0   4.80k   54.39       0  504.65       0
    # Query size    48   3.78M     149     157  154.94  151.03    0.52  151.03
    # String:
    # Hosts        192.168.10.147 (2724/10%)... 9 more
    # Users        gate
    # Query_time distribution
    #   1us
    #  10us
    # 100us
    #   1ms
    #  10ms
    # 100ms
    #    1s
    #  10s+  ################################################################
    # Tables
    #    SHOW TABLE STATUS LIKE 'connectinfo'\G
    #    SHOW CREATE TABLE `connectinfo`\G
    update connectinfo set DeviceTag='10705BDDCD000000', ConnectIPAddress='115.231.63.78', UpdateDatetime=now() where ID=6912 and AppType=0\G
    # Converted for EXPLAIN
    # EXPLAIN /*!50100 PARTITIONS*/
    select  DeviceTag='10705BDDCD000000', ConnectIPAddress='115.231.63.78', UpdateDatetime=now() from connectinfo where  ID=6912 and AppType=0\G
    

    PS:这个问题解决后没几天,数据库又出问题,以为又是慢SQL什么的,结果是交换机网口问题。。。数据传输太慢,导致主从不同步。。。

    参考链接:
    mysql 慢日志分析
    慢日志按时间截取
    MySQL 之 slow log
    MySQL优化—工欲善其事,必先利其器(2)
    日志常用统计技巧
    性能优化之MySQL优化(一)
    mysqlsla的安装与使用
    安装DBI组件。 Can’t locate DBI.pm in @INC-mysql接口
    Issue 12: Can’t locate Time/HiRes.pm
    打开MySQL的慢查询记录
    “Can’t locate ExtUtils/MakeMaker.pm” while compile git
    analysing slow MySQL queries with pt-query-digest
    mysqldump备份原理
    mysqldump –single-transaction, yet update queries are waiting for the backup
    mysql 利用binlog增量备份,还原实例
    MySQl备份恢复策略(完全+增量备份策略)

    MySQL秒杀优化

    今天学习了楼方鑫先生《基于SQL的秒杀解决方案》,讲解了如何定位和优化秒杀业务中问题。
    首先介绍了库存业务,库存可以分为前端库存,后端库存,实体库存。秒杀时,存在的主要问题

    • 库存数据不准确,下单、付款后,得知零库存;超卖或少卖
    • 废单较多,只下单不付款,转化率低
    • 热点商品,拖垮整个站

    秒杀过程中,需要解决的技术点包括

    • 余额减一
    • 操作明细,方便追溯对账,防止一个帐号多次参与
    • 完整事务,保障记录明细与扣减库存同时完成
    • 数据落地,内存数据不可靠

    针对库存技术要求,做了多个库存解决方案,比如Mysql + Read /Write Cache 。Read Cache方案不足是读有延迟影响用户体验;Write Cache方案存在多个APP写数据不一致性。Mysql + Cache + NoSQL方案则太复杂未实现。

    于是又重新回到优化Mysql上。Mysql优势在于事务机制成熟,程序稳定。存在技术难点:单行并发,热点商品,瞬间压力,前一分钟,千万用户,容易堵塞,拖垮网站。于是从以下几个方面进行优化

    • 事务优化,单行更新
    • 并发优化,最大并发数
    • 排队优化,抢同一商品

    分析秒杀时的处理逻辑,扫描系统代码,发现大部分程序都在等待确认Update记录数,才提交事务。

    • 开启事务
    • Insert库存明细
    • Update库存余额
    • 提交事务

    在良好设计下,Mysql的Insert操作,不使用自增列是不会阻塞请求。但是Mysql的Update同一条记录是串行的,需要等远程客户端发送提交命令后才能释放锁,让其他会话继续。简单的更新操作,不考虑IO和锁冲突,一条语句执行的时间大约是0.1ms,一般条件下的网络时延为0.4-0.8ms,即等待事务提交通知的时间比真正SQL执行的时间长数倍。
    于是扩展了SQL语法(OneSQL),指定在Update执行完后自动提交,不需要等待客户端发送提交命令,从而节约这一个网络来回的事务等待时间,提升网络性能。

    秒杀时如果遇到大量请求需要进行排队,以免太多的请求拖垮Mysql

    • 在应用层排队的缺点,应用需要改造,使用统一框架(需要考虑跨语言),应用集群扩容时,控制不准确(连接数分配)
    • 在Mysql排队的优点,应用改造极少,只需修改少量SQL语句,无需统一框架,排队精确,发挥InnoDB性能。

    于是开发了兼容Mysql的分布式数据访问层(OneProxy),为并发请求进行排队。

    另外,还对热点商品进行独立数据库拆分和优化。目前,双十一前商品便已挂出,用户可以收藏或预购,对于商家而言可以准备更多商品;对于平台而言可以预先发现热点商品做优化。

    总结,对于业务优化,需要循序渐进,深入了解业务逻辑和技术点,比较不同的解决方案,就算是平常的update操作也有优化空间;同时需要从其他方面进行特定优化,如高并发排队,热点数据分离等。

    除了后端数据库优化,对于秒杀抽奖业务,问题的解决核心就是控制单位时间内的流量,使其不超过后端的处理能力。前端的做法包括

    • 分批次(少量多次)进行秒杀
    • 先玩游戏再抢购,如抽奖
    • 随机过滤掉部分请求,仅部分进入系统,如1/10
    • 阈值控制,一旦达到阈值,不再接收新请求
    • 预约排号,未排号用户返回失败(用户分类)
    • 验证码验证

    另外,OneProxy 提供的连接池功能对于PHP非常有用。PHP运行在CGI下面,每一个请求到来便需要重新创建一个数据库连接与Mysql进行交互,并发量大量的情况下便会出现:too many connetion,乃至拖垮数据库:mysql server has gone away,影响其他业务。因此Mysql连接池,对于PHP显得非常重要。

    更新:小米网在开发抢购系统的时候,最早使用PHP + Mysql碰到了一些问题,例如并发性能,数据一致性,在OneSQL上面都已经做了改进优化,只是小米自己使用Go语言重构,开发大秒系统(BigTap)。

    参考链接:
    限量秒杀等高并发活动的正确性如何保证?
    MySQL 5.6.17/Percona5.6.16/MariaDB 10.0.11/OneSQL 5.6.16 TpmC测试
    由12306.cn谈谈网站性能技术
    “米粉节”背后的故事——小米网抢购系统开发实践
    Web系统大规模并发——电商秒杀与抢购
    OneProxy : 如何给PHP页面以及其他Ruby/Python/Go程序添加连接池功能?
    基于Swoole实现的Mysql连接池

    PHP脚本建立中国县市数据库

    在做快递查询的时候需要一个中国的县市数据库,由于之前的人做行政区域是用了一个Javascript脚本做的不好在后台动态交互,在网上看到了一篇关于建立中国行政区域数据库的文章,非常好,mark一下,以免以后找不到。首先需要到国家统计局下载最新的文档(打开最新的页面,全部复制下来另存为region.txt)。然后建一个数据库表region

    CREATE TABLE `region` (
    `id` int(10) unsigned NOT NULL,
    `parent_id` int(10) unsigned NOT NULL,
    `name` varchar(20) NOT NULL,
    PRIMARY KEY (`id`)
    )ENGINE=MYISAM DEFAULT CHARSET=utf8;
    

    在相同目录下建立一个文件region.php,代码如下 继续阅读