标签归档:pack

使用pack/unpack打包/解包二进制数据

web开发中,通常使用的API接口协议都是基于HTTP(S)的文本协议,比如JSON或XML。这对于跨编程语言的通信已经没有问题了,但有时候会采用二进制的数据来通信,以便获得更高的性能,比如Apache Thrift、gRPC/Protocol Buffer。各个编程语言都有对应的pack/unpack函数,PHP可以使用pack/unpack函数对数据进行二进制编码/解码,它与Perl的pack函数比较类似。
PHP的pack函数至少需要两个参数,第一个参数为打包格式,第二个参数开始为打包的数据。每个格式化字符后面可以跟字节长度,对应打包一个数据;也可以用*自动匹配所有后面的数据,具体的格式化字符可以参考pack函数页面。来看下面的例子pack.php

<?php
declare(strict_types=1);

$data = array(
    array(
        'title' => 'C Programming',
        'author' => 'Nuha Ali',
        'subject' => 'C Programming Tutorial',
        'book_id' => 6495407
    ),
    array(
        'title' => 'Telecom Billing',
        'author' => 'Zara Ali',
        'subject' => 'Telecom Billing Tutorial',
        'book_id' => 6495700
    ),
    array(
        'title' => 'PHP Programinng',
        'author' => 'Channing Huang',
        'subject' => 'PHP Programing Tutorial',
        'book_id' => 6495701
    )
);
$fp = \fopen("example","wb");
foreach ($data as $row) {
    $bin = \pack('Z50Z50Z100i', ...\array_values($row));
    \fwrite($fp, $bin);
}
\fclose($fp);

这里有个books数组,每个book包含4个数据:title、author、subject、bookid,对应的格式化是Z50、Z50、Z100、i。Z50意思是打包成50的字节的二进制数据,剩余部分使用NULL填充,这个跟使用a50是类似的,如果要用空格填充则格式化字符应该是A50。Z100类似Z50,不过字节长度为100。i的意思是打包成有符号的整型对应C里面的signed integer,占4个字节。所以每个book应该占50+50+100+4=204字节,3个book总占用612字节。查看一下

[vagrant@localhost bin]$ ls -la | grep example
-rw-r--r--. 1 vagrant vagrant   612 Aug 14 02:22 example

使用xxd查看它的二进制内容,确定是使用NULL填充不足部分

[vagrant@localhost bin]$ xxd -b example 
0000000: 01000011 00100000 01010000 01110010 01101111 01100111  C Prog
0000006: 01110010 01100001 01101101 01101101 01101001 01101110  rammin
000000c: 01100111 00000000 00000000 00000000 00000000 00000000  g.....
0000012: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000018: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000001e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000024: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000002a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000030: 00000000 00000000 01001110 01110101 01101000 01100001  ..Nuha
0000036: 00100000 01000001 01101100 01101001 00000000 00000000   Ali..
000003c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000042: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000048: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000004e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000054: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000005a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000060: 00000000 00000000 00000000 00000000 01000011 00100000  ....C 
0000066: 01010000 01110010 01101111 01100111 01110010 01100001  Progra
000006c: 01101101 01101101 01101001 01101110 01100111 00100000  mming 
0000072: 01010100 01110101 01110100 01101111 01110010 01101001  Tutori
0000078: 01100001 01101100 00000000 00000000 00000000 00000000  al....
000007e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000084: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000008a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000090: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000096: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000009c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000a2: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000a8: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000ae: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000b4: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000ba: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000c0: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000c6: 00000000 00000000 10101111 00011100 01100011 00000000  ....c.
00000cc: 01010100 01100101 01101100 01100101 01100011 01101111  Teleco
00000d2: 01101101 00100000 01000010 01101001 01101100 01101100  m Bill
00000d8: 01101001 01101110 01100111 00000000 00000000 00000000  ing...
00000de: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000e4: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000ea: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000f0: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000f6: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00000fc: 00000000 00000000 01011010 01100001 01110010 01100001  ..Zara
0000102: 00100000 01000001 01101100 01101001 00000000 00000000   Ali..
0000108: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000010e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000114: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000011a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000120: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000126: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000012c: 00000000 00000000 00000000 00000000 01010100 01100101  ....Te
0000132: 01101100 01100101 01100011 01101111 01101101 00100000  lecom 
0000138: 01000010 01101001 01101100 01101100 01101001 01101110  Billin
000013e: 01100111 00100000 01010100 01110101 01110100 01101111  g Tuto
0000144: 01110010 01101001 01100001 01101100 00000000 00000000  rial..
000014a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000150: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000156: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000015c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000162: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000168: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000016e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000174: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000017a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000180: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000186: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000018c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000192: 00000000 00000000 11010100 00011101 01100011 00000000  ....c.
0000198: 01010000 01001000 01010000 00100000 01010000 01110010  PHP Pr
000019e: 01101111 01100111 01110010 01100001 01101101 01101001  ogrami
00001a4: 01101110 01101110 01100111 00000000 00000000 00000000  nng...
00001aa: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001b0: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001b6: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001bc: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001c2: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001c8: 00000000 00000000 01000011 01101000 01100001 01101110  ..Chan
00001ce: 01101110 01101001 01101110 01100111 00100000 01001000  ning H
00001d4: 01110101 01100001 01101110 01100111 00000000 00000000  uang..
00001da: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001e0: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001e6: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001ec: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001f2: 00000000 00000000 00000000 00000000 00000000 00000000  ......
00001f8: 00000000 00000000 00000000 00000000 01010000 01001000  ....PH
00001fe: 01010000 00100000 01010000 01110010 01101111 01100111  P Prog
0000204: 01110010 01100001 01101101 01101001 01101110 01100111  raming
000020a: 00100000 01010100 01110101 01110100 01101111 01110010   Tutor
0000210: 01101001 01100001 01101100 00000000 00000000 00000000  ial...
0000216: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000021c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000222: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000228: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000022e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000234: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000023a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000240: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000246: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000024c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000252: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000258: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000025e: 00000000 00000000 11010101 00011101 01100011 00000000  ....c.

PHP的unpack函数有三个参数,第一个为解包的格式化字符,第二个为打包过的数据,第三个为起始偏移量。解包后返回一个关联数组,可以在解包的格式化字符串后面跟对应的key。

<?php
declare(strict_types=1);

$fp = \fopen("example","rb");
while (!\feof($fp)) {
    $data = \fread($fp,204);
    if($data) {
        $arr = \unpack("Z50title/Z50author/Z100subject/ibook_id",$data);
        foreach ($arr as $key => $value) {
            echo $key. "->" . $value.PHP_EOL;
        }
        echo PHP_EOL;
    }
}
\fclose($fp);

这样就可以读取对应的二进制文件,并正确解析出来。
不同的编程语言都可以对二进制文件进行读写。比如C语言里面打包上面的数据

#include<stdio.h>
#include<stdlib.h>

struct book {
   char  title[50];
   char  author[50];
   char  subject[100];
   int   id;
};
 
 int main()
 {
     FILE *fp;
     struct book books[3] = {
         {"C Programming", "Nuha Ali", "C Programming Tutorial", 6495407},
         {"Telecom Billing", "Zara Ali", "Telecom Billing Tutorial", 6495700},
         {"PHP Programinng", "Channing Huang", "PHP Programing Tutorial", 6495701}
    };
    fp = fopen("newbook", "wb");
 
    if(fp == NULL)
    {
        printf("Error opening file\n");
        exit(1);
    }
    printf("sizeof books %10lu\n", sizeof(books));
    printf("sizeof book struct %10lu\n", sizeof(struct book));
    printf("sizeof book1 title %10lu\n", sizeof(books[0].title));
    printf("sizeof book1 author %10lu\n", sizeof(books[0].author));
    printf("sizeof book1 subject %10lu\n", sizeof(books[0].subject));
    printf("sizeof book1 id %10lu\n", sizeof(books[0].id));
    fwrite(books, sizeof(books), 1, fp);
    fclose(fp);
    return 0;
 }
 

编译运行一下

[vagrant@localhost bin]$ gcc -o pack pack.c
[vagrant@localhost bin]$ ./pack
sizeof books        612
sizeof book struct        204
sizeof book1 title         50
sizeof book1 author         50
sizeof book1 subject        100
sizeof book1 id          4
[vagrant@localhost bin]$ ls -lah | grep newbook
-rw-r--r--. 1 vagrant vagrant  612 Aug 14 02:50 newbook

可以看到生成的文件大小与上面的一样,使用xxd查看的内容也跟上面的一样,并且可以使用unpak.php解析。类似的也可以用C来解析PHP打包的二进制数据

#include<stdio.h>
#include<stdlib.h>

struct book {
   char  title[50];
   char  author[50];
   char  subject[100];
   int   id;
};
 
 int main()
 {
     int i;
     FILE *fp;
     struct book Book;
     fp = fopen("example", "rb");
 
    if(fp == NULL)
    {
        printf("Error opening file\n");
        exit(1);
    }
    for (i=0;i<3; i++)
    {
        fread(&Book,sizeof(struct book),1,fp);
        printf( "Book title : %s\n", Book.title);
        printf( "Book author : %s\n", Book.author);
        printf( "Book subject : %s\n", Book.subject);
        printf( "Book book_id : %d\n", Book.id);
    }
    fclose(fp);
    return 0;
 }
 

编译运行一下

[vagrant@localhost bin]$ gcc -o unpack unpack.c
[vagrant@localhost bin]$ ./unpack
Book title : C Programming
Book author : Nuha Ali
Book subject : C Programming Tutorial
Book book_id : 6495407
Book title : Telecom Billing
Book author : Zara Ali
Book subject : Telecom Billing Tutorial
Book book_id : 6495700
Book title : PHP Programinng
Book author : Channing Huang
Book subject : PHP Programing Tutorial
Book book_id : 6495701

这里使用C的struct定义打包解包数据,这里面有两个问题:字节对齐NULL填充。字节对齐会导致生成的二进制文件大小与结构体里面定义的不一样,比如将title改为40个字节,整个book预计占用194字节,实际占用196字节(4的倍数),编译器自动填充了2个字节到title后面。所以要注意数据结构体的设计,避免过多对齐浪费。C语言的字符串其实是字符数组加上NUL结束符(\0),像上面那样初始化字符串是编译器默认填充的是NUL,但如果声明时没有初始化而是在后面赋值,比如strcpy,则可能在NUL后面填充非NUL值,导致在PHP解析出这些不必要的字符,显示不准确,这也是为什么在PHP里面解析时我们使用Z而不是A解析符的原因,它在碰到NUL便返回,不管后面填充的内容了。可以使用memset手动填充不足部分为’\0’,这样子就可以保证剩余部分全为NULL了。
对于unsigned short,unsigned long,float,double还有大端序、小端序,字节顺序可以参考这里。通常网络设备传输的字节序为大端序,而大部分CPU(x86)处理的字节序为小端序。这里只要通信双方沟通统一使用大端序或小端序即可,不需要关心传输过程中的自动转换。比如在PHP里面发送一个数字(4字节)和8个字符(8字节)

<?php
declare(strict_types=1);

$host = "127.0.0.1";
$port = 9872;

$socket = \socket_create(AF_INET, SOCK_STREAM, SOL_TCP)
  or die("Unable to create socket\n");

@\socket_connect($socket, $host, $port) or die("Connect error.\n");

if ($err = \socket_last_error($socket))
{

  \socket_close($socket);
  die(\socket_strerror($err) . "\n");
}

$binarydata = \pack("Na8", "256", "Channing");
echo \bin2hex($binarydata).PHP_EOL;
$len = \socket_write ($socket , $binarydata, strlen($binarydata));
\socket_close($socket);

PHP里面数字的格式化字符是N,即采用大端序32位编码。在GO里面接收,使用大端序解析位数字

package main

import (
	"fmt"
	"net"
	"encoding/binary"
)

const BUF_SIZE = 12

func handleConnection(conn net.Conn) {
	defer conn.Close()
	buf := make([]byte, BUF_SIZE)
	n, err := conn.Read(buf)

	if err != nil {
		fmt.Printf("err: %v\n", err)
		return
	}

	//fmt.Printf("\nreceive:%d bytes,data:%x, %s\n", n, buf[:4], buf[4:])
	fmt.Printf("\nreceive:%d bytes,data:%d, %s\n", n, binary.BigEndian.Uint32(buf[:4]), buf[4:])
}

func main() {
	ln, err := net.Listen("tcp", ":9872")

	if err != nil {
		fmt.Printf("error: %v\n", err)
		return
	}

	for {
		conn, err := ln.Accept()
		if err != nil {
			continue
		}
		go handleConnection(conn)
	}
}

运行recieve.go 和 send.php,可以正常收发

[root@localhost bin]$ php send.php
000001004368616e6e696e67


[root@localhost bin]$ go run receive.go 

receive:12 bytes,data:256, Channing

如果在PHP里面使用了不匹配的格式化符号,比如i或者n,GO里面将解析错误。这里也可以采用小端序编码和解析。
pack后的二进制数据可以使用bin2hex转换为16进制查看,使用chr转换单个字节为可读性字符。
通常我们读写的是文本文件,这里读写的是二进制文件,文本文件到二进制文件有个编码过程,比如ASCII、UTF-8。其实文本文件,图片文件,都是二进制文件,都可以使用xxd查看,只不过特定软件能够解析(解码)对应的二进制字节数据。使用xxd查看时,它只能现显示ASCII可打印字符,不能显示的用点号表示。

参考链接:
Apache Thrift – 可伸缩的跨语言服务开发框架
Google Protocol Buffer 的使用和原理
高效的数据压缩编码方式 Protobuf
PHP: 深入pack/unpack
C Programming Files I/O
Unpacking binary data in PHP
大端和小端(Big endian and Little endian)
PHP RPC开发之Thrift