web开发中,通常使用的API接口协议都是基于HTTP(S)的文本协议,比如JSON或XML。这对于跨编程语言的通信已经没有问题了,但有时候会采用二进制的数据来通信,以便获得更高的性能,比如Apache Thrift、gRPC/Protocol Buffer。各个编程语言都有对应的pack/unpack函数,PHP可以使用pack/unpack函数对数据进行二进制编码/解码,它与Perl的pack函数比较类似。
PHP的pack函数至少需要两个参数,第一个参数为打包格式,第二个参数开始为打包的数据。每个格式化字符后面可以跟字节长度,对应打包一个数据;也可以用*自动匹配所有后面的数据,具体的格式化字符可以参考pack函数页面。来看下面的例子pack.php
<?php declare(strict_types=1); $data = array( array( 'title' => 'C Programming', 'author' => 'Nuha Ali', 'subject' => 'C Programming Tutorial', 'book_id' => 6495407 ), array( 'title' => 'Telecom Billing', 'author' => 'Zara Ali', 'subject' => 'Telecom Billing Tutorial', 'book_id' => 6495700 ), array( 'title' => 'PHP Programinng', 'author' => 'Channing Huang', 'subject' => 'PHP Programing Tutorial', 'book_id' => 6495701 ) ); $fp = \fopen("example","wb"); foreach ($data as $row) { $bin = \pack('Z50Z50Z100i', ...\array_values($row)); \fwrite($fp, $bin); } \fclose($fp);
这里有个books数组,每个book包含4个数据:title、author、subject、bookid,对应的格式化是Z50、Z50、Z100、i。Z50意思是打包成50的字节的二进制数据,剩余部分使用NULL填充,这个跟使用a50是类似的,如果要用空格填充则格式化字符应该是A50。Z100类似Z50,不过字节长度为100。i的意思是打包成有符号的整型对应C里面的signed integer,占4个字节。所以每个book应该占50+50+100+4=204字节,3个book总占用612字节。查看一下
[vagrant@localhost bin]$ ls -la | grep example -rw-r--r--. 1 vagrant vagrant 612 Aug 14 02:22 example
使用xxd查看它的二进制内容,确定是使用NULL填充不足部分
[vagrant@localhost bin]$ xxd -b example 0000000: 01000011 00100000 01010000 01110010 01101111 01100111 C Prog 0000006: 01110010 01100001 01101101 01101101 01101001 01101110 rammin 000000c: 01100111 00000000 00000000 00000000 00000000 00000000 g..... 0000012: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000018: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000001e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000024: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000002a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000030: 00000000 00000000 01001110 01110101 01101000 01100001 ..Nuha 0000036: 00100000 01000001 01101100 01101001 00000000 00000000 Ali.. 000003c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000042: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000048: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000004e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000054: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000005a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000060: 00000000 00000000 00000000 00000000 01000011 00100000 ....C 0000066: 01010000 01110010 01101111 01100111 01110010 01100001 Progra 000006c: 01101101 01101101 01101001 01101110 01100111 00100000 mming 0000072: 01010100 01110101 01110100 01101111 01110010 01101001 Tutori 0000078: 01100001 01101100 00000000 00000000 00000000 00000000 al.... 000007e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000084: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000008a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000090: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000096: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000009c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000a2: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000a8: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000ae: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000b4: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000ba: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000c0: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000c6: 00000000 00000000 10101111 00011100 01100011 00000000 ....c. 00000cc: 01010100 01100101 01101100 01100101 01100011 01101111 Teleco 00000d2: 01101101 00100000 01000010 01101001 01101100 01101100 m Bill 00000d8: 01101001 01101110 01100111 00000000 00000000 00000000 ing... 00000de: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000e4: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000ea: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000f0: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000f6: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00000fc: 00000000 00000000 01011010 01100001 01110010 01100001 ..Zara 0000102: 00100000 01000001 01101100 01101001 00000000 00000000 Ali.. 0000108: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000010e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000114: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000011a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000120: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000126: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000012c: 00000000 00000000 00000000 00000000 01010100 01100101 ....Te 0000132: 01101100 01100101 01100011 01101111 01101101 00100000 lecom 0000138: 01000010 01101001 01101100 01101100 01101001 01101110 Billin 000013e: 01100111 00100000 01010100 01110101 01110100 01101111 g Tuto 0000144: 01110010 01101001 01100001 01101100 00000000 00000000 rial.. 000014a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000150: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000156: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000015c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000162: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000168: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000016e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000174: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000017a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000180: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000186: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000018c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000192: 00000000 00000000 11010100 00011101 01100011 00000000 ....c. 0000198: 01010000 01001000 01010000 00100000 01010000 01110010 PHP Pr 000019e: 01101111 01100111 01110010 01100001 01101101 01101001 ogrami 00001a4: 01101110 01101110 01100111 00000000 00000000 00000000 nng... 00001aa: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001b0: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001b6: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001bc: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001c2: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001c8: 00000000 00000000 01000011 01101000 01100001 01101110 ..Chan 00001ce: 01101110 01101001 01101110 01100111 00100000 01001000 ning H 00001d4: 01110101 01100001 01101110 01100111 00000000 00000000 uang.. 00001da: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001e0: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001e6: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001ec: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001f2: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 00001f8: 00000000 00000000 00000000 00000000 01010000 01001000 ....PH 00001fe: 01010000 00100000 01010000 01110010 01101111 01100111 P Prog 0000204: 01110010 01100001 01101101 01101001 01101110 01100111 raming 000020a: 00100000 01010100 01110101 01110100 01101111 01110010 Tutor 0000210: 01101001 01100001 01101100 00000000 00000000 00000000 ial... 0000216: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000021c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000222: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000228: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000022e: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000234: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000023a: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000240: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000246: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000024c: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000252: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 0000258: 00000000 00000000 00000000 00000000 00000000 00000000 ...... 000025e: 00000000 00000000 11010101 00011101 01100011 00000000 ....c.
PHP的unpack函数有三个参数,第一个为解包的格式化字符,第二个为打包过的数据,第三个为起始偏移量。解包后返回一个关联数组,可以在解包的格式化字符串后面跟对应的key。
<?php declare(strict_types=1); $fp = \fopen("example","rb"); while (!\feof($fp)) { $data = \fread($fp,204); if($data) { $arr = \unpack("Z50title/Z50author/Z100subject/ibook_id",$data); foreach ($arr as $key => $value) { echo $key. "->" . $value.PHP_EOL; } echo PHP_EOL; } } \fclose($fp);
这样就可以读取对应的二进制文件,并正确解析出来。
不同的编程语言都可以对二进制文件进行读写。比如C语言里面打包上面的数据
#include<stdio.h> #include<stdlib.h> struct book { char title[50]; char author[50]; char subject[100]; int id; }; int main() { FILE *fp; struct book books[3] = { {"C Programming", "Nuha Ali", "C Programming Tutorial", 6495407}, {"Telecom Billing", "Zara Ali", "Telecom Billing Tutorial", 6495700}, {"PHP Programinng", "Channing Huang", "PHP Programing Tutorial", 6495701} }; fp = fopen("newbook", "wb"); if(fp == NULL) { printf("Error opening file\n"); exit(1); } printf("sizeof books %10lu\n", sizeof(books)); printf("sizeof book struct %10lu\n", sizeof(struct book)); printf("sizeof book1 title %10lu\n", sizeof(books[0].title)); printf("sizeof book1 author %10lu\n", sizeof(books[0].author)); printf("sizeof book1 subject %10lu\n", sizeof(books[0].subject)); printf("sizeof book1 id %10lu\n", sizeof(books[0].id)); fwrite(books, sizeof(books), 1, fp); fclose(fp); return 0; }
编译运行一下
[vagrant@localhost bin]$ gcc -o pack pack.c [vagrant@localhost bin]$ ./pack sizeof books 612 sizeof book struct 204 sizeof book1 title 50 sizeof book1 author 50 sizeof book1 subject 100 sizeof book1 id 4 [vagrant@localhost bin]$ ls -lah | grep newbook -rw-r--r--. 1 vagrant vagrant 612 Aug 14 02:50 newbook
可以看到生成的文件大小与上面的一样,使用xxd查看的内容也跟上面的一样,并且可以使用unpak.php解析。类似的也可以用C来解析PHP打包的二进制数据
#include<stdio.h> #include<stdlib.h> struct book { char title[50]; char author[50]; char subject[100]; int id; }; int main() { int i; FILE *fp; struct book Book; fp = fopen("example", "rb"); if(fp == NULL) { printf("Error opening file\n"); exit(1); } for (i=0;i<3; i++) { fread(&Book,sizeof(struct book),1,fp); printf( "Book title : %s\n", Book.title); printf( "Book author : %s\n", Book.author); printf( "Book subject : %s\n", Book.subject); printf( "Book book_id : %d\n", Book.id); } fclose(fp); return 0; }
编译运行一下
[vagrant@localhost bin]$ gcc -o unpack unpack.c [vagrant@localhost bin]$ ./unpack Book title : C Programming Book author : Nuha Ali Book subject : C Programming Tutorial Book book_id : 6495407 Book title : Telecom Billing Book author : Zara Ali Book subject : Telecom Billing Tutorial Book book_id : 6495700 Book title : PHP Programinng Book author : Channing Huang Book subject : PHP Programing Tutorial Book book_id : 6495701
这里使用C的struct定义打包解包数据,这里面有两个问题:字节对齐和NULL填充。字节对齐会导致生成的二进制文件大小与结构体里面定义的不一样,比如将title改为40个字节,整个book预计占用194字节,实际占用196字节(4的倍数),编译器自动填充了2个字节到title后面。所以要注意数据结构体的设计,避免过多对齐浪费。C语言的字符串其实是字符数组加上NUL结束符(\0),像上面那样初始化字符串是编译器默认填充的是NUL,但如果声明时没有初始化而是在后面赋值,比如strcpy,则可能在NUL后面填充非NUL值,导致在PHP解析出这些不必要的字符,显示不准确,这也是为什么在PHP里面解析时我们使用Z而不是A解析符的原因,它在碰到NUL便返回,不管后面填充的内容了。可以使用memset手动填充不足部分为’\0’,这样子就可以保证剩余部分全为NULL了。
对于unsigned short,unsigned long,float,double还有大端序、小端序,字节顺序可以参考这里。通常网络设备传输的字节序为大端序,而大部分CPU(x86)处理的字节序为小端序。这里只要通信双方沟通统一使用大端序或小端序即可,不需要关心传输过程中的自动转换。比如在PHP里面发送一个数字(4字节)和8个字符(8字节)
<?php declare(strict_types=1); $host = "127.0.0.1"; $port = 9872; $socket = \socket_create(AF_INET, SOCK_STREAM, SOL_TCP) or die("Unable to create socket\n"); @\socket_connect($socket, $host, $port) or die("Connect error.\n"); if ($err = \socket_last_error($socket)) { \socket_close($socket); die(\socket_strerror($err) . "\n"); } $binarydata = \pack("Na8", "256", "Channing"); echo \bin2hex($binarydata).PHP_EOL; $len = \socket_write ($socket , $binarydata, strlen($binarydata)); \socket_close($socket);
PHP里面数字的格式化字符是N,即采用大端序32位编码。在GO里面接收,使用大端序解析位数字
package main import ( "fmt" "net" "encoding/binary" ) const BUF_SIZE = 12 func handleConnection(conn net.Conn) { defer conn.Close() buf := make([]byte, BUF_SIZE) n, err := conn.Read(buf) if err != nil { fmt.Printf("err: %v\n", err) return } //fmt.Printf("\nreceive:%d bytes,data:%x, %s\n", n, buf[:4], buf[4:]) fmt.Printf("\nreceive:%d bytes,data:%d, %s\n", n, binary.BigEndian.Uint32(buf[:4]), buf[4:]) } func main() { ln, err := net.Listen("tcp", ":9872") if err != nil { fmt.Printf("error: %v\n", err) return } for { conn, err := ln.Accept() if err != nil { continue } go handleConnection(conn) } }
运行recieve.go 和 send.php,可以正常收发
[root@localhost bin]$ php send.php 000001004368616e6e696e67 [root@localhost bin]$ go run receive.go receive:12 bytes,data:256, Channing
如果在PHP里面使用了不匹配的格式化符号,比如i或者n,GO里面将解析错误。这里也可以采用小端序编码和解析。
pack后的二进制数据可以使用bin2hex转换为16进制查看,使用chr转换单个字节为可读性字符。
通常我们读写的是文本文件,这里读写的是二进制文件,文本文件到二进制文件有个编码过程,比如ASCII、UTF-8。其实文本文件,图片文件,都是二进制文件,都可以使用xxd查看,只不过特定软件能够解析(解码)对应的二进制字节数据。使用xxd查看时,它只能现显示ASCII可打印字符,不能显示的用点号表示。
参考链接:
Apache Thrift – 可伸缩的跨语言服务开发框架
Google Protocol Buffer 的使用和原理
高效的数据压缩编码方式 Protobuf
PHP: 深入pack/unpack
C Programming Files I/O
Unpacking binary data in PHP
大端和小端(Big endian and Little endian)
PHP RPC开发之Thrift