web开发中,通常使用的API接口协议都是基于HTTP(S)的文本协议,比如JSON或XML。这对于跨编程语言的通信已经没有问题了,但有时候会采用二进制的数据来通信,以便获得更高的性能,比如Apache Thrift、gRPC/Protocol Buffer。各个编程语言都有对应的pack/unpack函数,PHP可以使用pack/unpack函数对数据进行二进制编码/解码,它与Perl的pack函数比较类似。
PHP的pack函数至少需要两个参数,第一个参数为打包格式,第二个参数开始为打包的数据。每个格式化字符后面可以跟字节长度,对应打包一个数据;也可以用*自动匹配所有后面的数据,具体的格式化字符可以参考pack函数页面。来看下面的例子pack.php
<?php
declare(strict_types=1);
$data = array(
array(
'title' => 'C Programming',
'author' => 'Nuha Ali',
'subject' => 'C Programming Tutorial',
'book_id' => 6495407
),
array(
'title' => 'Telecom Billing',
'author' => 'Zara Ali',
'subject' => 'Telecom Billing Tutorial',
'book_id' => 6495700
),
array(
'title' => 'PHP Programinng',
'author' => 'Channing Huang',
'subject' => 'PHP Programing Tutorial',
'book_id' => 6495701
)
);
$fp = \fopen("example","wb");
foreach ($data as $row) {
$bin = \pack('Z50Z50Z100i', ...\array_values($row));
\fwrite($fp, $bin);
}
\fclose($fp);
这里有个books数组,每个book包含4个数据:title、author、subject、bookid,对应的格式化是Z50、Z50、Z100、i。Z50意思是打包成50的字节的二进制数据,剩余部分使用NULL填充,这个跟使用a50是类似的,如果要用空格填充则格式化字符应该是A50。Z100类似Z50,不过字节长度为100。i的意思是打包成有符号的整型对应C里面的signed integer,占4个字节。所以每个book应该占50+50+100+4=204字节,3个book总占用612字节。查看一下
[vagrant@localhost bin]$ ls -la | grep example
-rw-r--r--. 1 vagrant vagrant 612 Aug 14 02:22 example
使用xxd查看它的二进制内容,确定是使用NULL填充不足部分
[vagrant@localhost bin]$ xxd -b example
0000000: 01000011 00100000 01010000 01110010 01101111 01100111 C Prog
0000006: 01110010 01100001 01101101 01101101 01101001 01101110 rammin
000000c: 01100111 00000000 00000000 00000000 00000000 00000000 g.....
0000012: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000018: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000001e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000024: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000002a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000030: 00000000 00000000 01001110 01110101 01101000 01100001 ..Nuha
0000036: 00100000 01000001 01101100 01101001 00000000 00000000 Ali..
000003c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000042: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000048: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000004e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000054: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000005a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000060: 00000000 00000000 00000000 00000000 01000011 00100000 ....C
0000066: 01010000 01110010 01101111 01100111 01110010 01100001 Progra
000006c: 01101101 01101101 01101001 01101110 01100111 00100000 mming
0000072: 01010100 01110101 01110100 01101111 01110010 01101001 Tutori
0000078: 01100001 01101100 00000000 00000000 00000000 00000000 al....
000007e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000084: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000008a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000090: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000096: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000009c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000a2: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000a8: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000ae: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000b4: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000ba: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000c0: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000c6: 00000000 00000000 10101111 00011100 01100011 00000000 ....c.
00000cc: 01010100 01100101 01101100 01100101 01100011 01101111 Teleco
00000d2: 01101101 00100000 01000010 01101001 01101100 01101100 m Bill
00000d8: 01101001 01101110 01100111 00000000 00000000 00000000 ing...
00000de: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000e4: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000ea: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000f0: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000f6: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00000fc: 00000000 00000000 01011010 01100001 01110010 01100001 ..Zara
0000102: 00100000 01000001 01101100 01101001 00000000 00000000 Ali..
0000108: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000010e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000114: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000011a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000120: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000126: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000012c: 00000000 00000000 00000000 00000000 01010100 01100101 ....Te
0000132: 01101100 01100101 01100011 01101111 01101101 00100000 lecom
0000138: 01000010 01101001 01101100 01101100 01101001 01101110 Billin
000013e: 01100111 00100000 01010100 01110101 01110100 01101111 g Tuto
0000144: 01110010 01101001 01100001 01101100 00000000 00000000 rial..
000014a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000150: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000156: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000015c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000162: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000168: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000016e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000174: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000017a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000180: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000186: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000018c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000192: 00000000 00000000 11010100 00011101 01100011 00000000 ....c.
0000198: 01010000 01001000 01010000 00100000 01010000 01110010 PHP Pr
000019e: 01101111 01100111 01110010 01100001 01101101 01101001 ogrami
00001a4: 01101110 01101110 01100111 00000000 00000000 00000000 nng...
00001aa: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001b0: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001b6: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001bc: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001c2: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001c8: 00000000 00000000 01000011 01101000 01100001 01101110 ..Chan
00001ce: 01101110 01101001 01101110 01100111 00100000 01001000 ning H
00001d4: 01110101 01100001 01101110 01100111 00000000 00000000 uang..
00001da: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001e0: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001e6: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001ec: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001f2: 00000000 00000000 00000000 00000000 00000000 00000000 ......
00001f8: 00000000 00000000 00000000 00000000 01010000 01001000 ....PH
00001fe: 01010000 00100000 01010000 01110010 01101111 01100111 P Prog
0000204: 01110010 01100001 01101101 01101001 01101110 01100111 raming
000020a: 00100000 01010100 01110101 01110100 01101111 01110010 Tutor
0000210: 01101001 01100001 01101100 00000000 00000000 00000000 ial...
0000216: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000021c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000222: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000228: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000022e: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000234: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000023a: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000240: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000246: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000024c: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000252: 00000000 00000000 00000000 00000000 00000000 00000000 ......
0000258: 00000000 00000000 00000000 00000000 00000000 00000000 ......
000025e: 00000000 00000000 11010101 00011101 01100011 00000000 ....c.
PHP的unpack函数有三个参数,第一个为解包的格式化字符,第二个为打包过的数据,第三个为起始偏移量。解包后返回一个关联数组,可以在解包的格式化字符串后面跟对应的key。
<?php
declare(strict_types=1);
$fp = \fopen("example","rb");
while (!\feof($fp)) {
$data = \fread($fp,204);
if($data) {
$arr = \unpack("Z50title/Z50author/Z100subject/ibook_id",$data);
foreach ($arr as $key => $value) {
echo $key. "->" . $value.PHP_EOL;
}
echo PHP_EOL;
}
}
\fclose($fp);
这样就可以读取对应的二进制文件,并正确解析出来。
不同的编程语言都可以对二进制文件进行读写。比如C语言里面打包上面的数据
#include<stdio.h>
#include<stdlib.h>
struct book {
char title[50];
char author[50];
char subject[100];
int id;
};
int main()
{
FILE *fp;
struct book books[3] = {
{"C Programming", "Nuha Ali", "C Programming Tutorial", 6495407},
{"Telecom Billing", "Zara Ali", "Telecom Billing Tutorial", 6495700},
{"PHP Programinng", "Channing Huang", "PHP Programing Tutorial", 6495701}
};
fp = fopen("newbook", "wb");
if(fp == NULL)
{
printf("Error opening file\n");
exit(1);
}
printf("sizeof books %10lu\n", sizeof(books));
printf("sizeof book struct %10lu\n", sizeof(struct book));
printf("sizeof book1 title %10lu\n", sizeof(books[0].title));
printf("sizeof book1 author %10lu\n", sizeof(books[0].author));
printf("sizeof book1 subject %10lu\n", sizeof(books[0].subject));
printf("sizeof book1 id %10lu\n", sizeof(books[0].id));
fwrite(books, sizeof(books), 1, fp);
fclose(fp);
return 0;
}
编译运行一下
[vagrant@localhost bin]$ gcc -o pack pack.c
[vagrant@localhost bin]$ ./pack
sizeof books 612
sizeof book struct 204
sizeof book1 title 50
sizeof book1 author 50
sizeof book1 subject 100
sizeof book1 id 4
[vagrant@localhost bin]$ ls -lah | grep newbook
-rw-r--r--. 1 vagrant vagrant 612 Aug 14 02:50 newbook
可以看到生成的文件大小与上面的一样,使用xxd查看的内容也跟上面的一样,并且可以使用unpak.php解析。类似的也可以用C来解析PHP打包的二进制数据
#include<stdio.h>
#include<stdlib.h>
struct book {
char title[50];
char author[50];
char subject[100];
int id;
};
int main()
{
int i;
FILE *fp;
struct book Book;
fp = fopen("example", "rb");
if(fp == NULL)
{
printf("Error opening file\n");
exit(1);
}
for (i=0;i<3; i++)
{
fread(&Book,sizeof(struct book),1,fp);
printf( "Book title : %s\n", Book.title);
printf( "Book author : %s\n", Book.author);
printf( "Book subject : %s\n", Book.subject);
printf( "Book book_id : %d\n", Book.id);
}
fclose(fp);
return 0;
}
编译运行一下
[vagrant@localhost bin]$ gcc -o unpack unpack.c
[vagrant@localhost bin]$ ./unpack
Book title : C Programming
Book author : Nuha Ali
Book subject : C Programming Tutorial
Book book_id : 6495407
Book title : Telecom Billing
Book author : Zara Ali
Book subject : Telecom Billing Tutorial
Book book_id : 6495700
Book title : PHP Programinng
Book author : Channing Huang
Book subject : PHP Programing Tutorial
Book book_id : 6495701
这里使用C的struct定义打包解包数据,这里面有两个问题:字节对齐和NULL填充。字节对齐会导致生成的二进制文件大小与结构体里面定义的不一样,比如将title改为40个字节,整个book预计占用194字节,实际占用196字节(4的倍数),编译器自动填充了2个字节到title后面。所以要注意数据结构体的设计,避免过多对齐浪费。C语言的字符串其实是字符数组加上NUL结束符(\0),像上面那样初始化字符串是编译器默认填充的是NUL,但如果声明时没有初始化而是在后面赋值,比如strcpy,则可能在NUL后面填充非NUL值,导致在PHP解析出这些不必要的字符,显示不准确,这也是为什么在PHP里面解析时我们使用Z而不是A解析符的原因,它在碰到NUL便返回,不管后面填充的内容了。可以使用memset手动填充不足部分为’\0’,这样子就可以保证剩余部分全为NULL了。
对于unsigned short,unsigned long,float,double还有大端序、小端序,字节顺序可以参考这里。通常网络设备传输的字节序为大端序,而大部分CPU(x86)处理的字节序为小端序。这里只要通信双方沟通统一使用大端序或小端序即可,不需要关心传输过程中的自动转换。比如在PHP里面发送一个数字(4字节)和8个字符(8字节)
<?php
declare(strict_types=1);
$host = "127.0.0.1";
$port = 9872;
$socket = \socket_create(AF_INET, SOCK_STREAM, SOL_TCP)
or die("Unable to create socket\n");
@\socket_connect($socket, $host, $port) or die("Connect error.\n");
if ($err = \socket_last_error($socket))
{
\socket_close($socket);
die(\socket_strerror($err) . "\n");
}
$binarydata = \pack("Na8", "256", "Channing");
echo \bin2hex($binarydata).PHP_EOL;
$len = \socket_write ($socket , $binarydata, strlen($binarydata));
\socket_close($socket);
PHP里面数字的格式化字符是N,即采用大端序32位编码。在GO里面接收,使用大端序解析位数字
package main
import (
"fmt"
"net"
"encoding/binary"
)
const BUF_SIZE = 12
func handleConnection(conn net.Conn) {
defer conn.Close()
buf := make([]byte, BUF_SIZE)
n, err := conn.Read(buf)
if err != nil {
fmt.Printf("err: %v\n", err)
return
}
//fmt.Printf("\nreceive:%d bytes,data:%x, %s\n", n, buf[:4], buf[4:])
fmt.Printf("\nreceive:%d bytes,data:%d, %s\n", n, binary.BigEndian.Uint32(buf[:4]), buf[4:])
}
func main() {
ln, err := net.Listen("tcp", ":9872")
if err != nil {
fmt.Printf("error: %v\n", err)
return
}
for {
conn, err := ln.Accept()
if err != nil {
continue
}
go handleConnection(conn)
}
}
运行recieve.go 和 send.php,可以正常收发
[root@localhost bin]$ php send.php
000001004368616e6e696e67
[root@localhost bin]$ go run receive.go
receive:12 bytes,data:256, Channing
如果在PHP里面使用了不匹配的格式化符号,比如i或者n,GO里面将解析错误。这里也可以采用小端序编码和解析。
pack后的二进制数据可以使用bin2hex转换为16进制查看,使用chr转换单个字节为可读性字符。
通常我们读写的是文本文件,这里读写的是二进制文件,文本文件到二进制文件有个编码过程,比如ASCII、UTF-8。其实文本文件,图片文件,都是二进制文件,都可以使用xxd查看,只不过特定软件能够解析(解码)对应的二进制字节数据。使用xxd查看时,它只能现显示ASCII可打印字符,不能显示的用点号表示。
参考链接:
Apache Thrift – 可伸缩的跨语言服务开发框架
Google Protocol Buffer 的使用和原理
高效的数据压缩编码方式 Protobuf
PHP: 深入pack/unpack
C Programming Files I/O
Unpacking binary data in PHP
大端和小端(Big endian and Little endian)
PHP RPC开发之Thrift