文本格式语言规范

Protocol Buffer 文本格式语言指定了一种将 protobuf 数据表示为文本形式的语法，这对于配置或测试通常非常有用。

这种格式与 `.proto` 模式文件中的文本格式不同。本文档包含使用 ISO/IEC 14977 EBNF 中指定的语法编写的参考文档。

注意

这是一份根据 C++ 文本格式实现反向工程得出的草案规范，可能会根据进一步讨论和审查而改变。虽然已努力使文本格式在所有受支持语言中保持一致，但仍可能存在不兼容性。

示例

convolution_benchmark {
  label: "NHWC_128x20x20x56x160"
  input {
    dimension: [128, 56, 20, 20]
    data_type: DATA_HALF
    format: TENSOR_NHWC
  }
}

解析概览

本规范中的语言元素分为词法类别和语法类别。词法元素必须与输入文本完全匹配，而语法元素可以由可选的 `WHITESPACE` 和 `COMMENT` 标记分隔。

例如，一个带符号的浮点值包含两个语法元素：符号 (`-`) 和 `FLOAT` 字面量。符号和数字之间可以存在可选的空白符和注释，但数字内部不能有。示例

value: -2.0   # Valid: no additional whitespace.
value: - 2.0  # Valid: whitespace between '-' and '2.0'.
value: -
  # comment
  2.0         # Valid: whitespace and comments between '-' and '2.0'.
value: 2 . 0  # Invalid: the floating point period is part of the lexical
              # element, so no additional whitespace is allowed.

有一个需要特别注意的边界情况：数字标记（`FLOAT`、`DEC_INT`、`OCT_INT` 或 `HEX_INT`）不能紧跟 `IDENT` 标记。示例

foo: 10 bar: 20           # Valid: whitespace separates '10' and 'bar'
foo: 10,bar: 20           # Valid: ',' separates '10' and 'bar'
foo: 10[com.foo.ext]: 20  # Valid: '10' is followed immediately by '[', which is
                          # not an identifier.
foo: 10bar: 20            # Invalid: no space between '10' and identifier 'bar'.

词法元素

下文描述的词法元素分为两类：大写主元素和小写片段。只有主元素包含在语法分析期间使用的标记输出流中；片段仅用于简化主元素的构建。

解析输入文本时，最长匹配的主元素优先。示例

value: 10   # '10' is parsed as a DEC_INT token.
value: 10f  # '10f' is parsed as a FLOAT token, despite containing '10' which
            # would also match DEC_INT. In this case, FLOAT matches a longer
            # subsequence of the input.

字符

char    = ? Any non-NUL unicode character ? ;
newline = ? ASCII #10 (line feed) ? ;

letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M"
       | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
       | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m"
       | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
       | "_" ;

oct = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" ;
dec = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
hex = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
    | "A" | "B" | "C" | "D" | "E" | "F"
    | "a" | "b" | "c" | "d" | "e" | "f" ;

空白符和注释

COMMENT    = "#", { char - newline }, [ newline ] ;
WHITESPACE = " "
           | newline
           | ? ASCII #9  (horizontal tab) ?
           | ? ASCII #11 (vertical tab) ?
           | ? ASCII #12 (form feed) ?
           | ? ASCII #13 (carriage return) ? ;

标识符

IDENT = letter, { letter | dec } ;

数字字面量

dec_lit   = "0"
          | ( dec - "0" ), { dec } ;
float_lit = ".", dec, { dec }, [ exp ]
          | dec_lit, ".", { dec }, [ exp ]
          | dec_lit, exp ;
exp       = ( "E" | "e" ), [ "+" | "-" ], dec, { dec } ;

DEC_INT   = dec_lit
OCT_INT   = "0", oct, { oct } ;
HEX_INT   = "0", ( "X" | "x" ), hex, { hex } ;
FLOAT     = float_lit, [ "F" | "f" ]
          | dec_lit,   ( "F" | "f" ) ;

通过使用 `F` 和 `f` 后缀，十进制整数可以被转换为浮点值。示例

foo: 10    # This is an integer value.
foo: 10f   # This is a floating-point value.
foo: 1.0f  # Also optional for floating-point literals.

字符串字面量

STRING = single_string | double_string ;
single_string = "'", { escape | char - "'" - newline - "\" }, "'" ;
double_string = '"', { escape | char - '"' - newline - "\" }, '"' ;

escape = "\a"                        (* ASCII #7  (bell)                 *)
       | "\b"                        (* ASCII #8  (backspace)            *)
       | "\f"                        (* ASCII #12 (form feed)            *)
       | "\n"                        (* ASCII #10 (line feed)            *)
       | "\r"                        (* ASCII #13 (carriage return)      *)
       | "\t"                        (* ASCII #9  (horizontal tab)       *)
       | "\v"                        (* ASCII #11 (vertical tab)         *)
       | "\?"                        (* ASCII #63 (question mark)        *)
       | "\\"                        (* ASCII #92 (backslash)            *)
       | "\'"                        (* ASCII #39 (apostrophe)           *)
       | '\"'                        (* ASCII #34 (quote)                *)
       | "\", oct, [ oct, [ oct ] ]  (* octal escaped byte value         *)
       | "\x", hex, [ hex ]          (* hexadecimal escaped byte value   *)
       | "\u", hex, hex, hex, hex    (* Unicode code point up to 0xffff  *)
       | "\U000",
         hex, hex, hex, hex, hex     (* Unicode code point up to 0xfffff *)
       | "\U0010",
         hex, hex, hex, hex ;        (* Unicode code point between 0x100000 and 0x10ffff *)

八进制转义序列最多使用三个八进制数字。额外的数字会直接传递，不进行转义。例如，当对输入 `\1234` 进行反转义时，解析器会使用三个八进制数字 (123) 来反转义字节值 0x53（ASCII 'S'，十进制 83），而后续的 '4' 则作为字节值 0x34（ASCII '4'）直接传递。为了确保正确解析，八进制转义序列应包含 3 个八进制数字，必要时使用前导零，例如：`\000`、`\001`、`\063`、`\377`。当数字字符后面跟着非数字字符时，例如 `\5Hello`，则会使用少于三个数字进行转义。

十六进制转义序列最多使用两个十六进制数字。例如，当对 `\x213` 进行反转义时，解析器仅使用前两个数字 (21) 来反转义字节值 0x21（ASCII '!'）。为了确保正确解析，十六进制转义序列应包含 2 个十六进制数字，必要时使用前导零，例如：`\x00`、`\x01`、`\xFF`。当数字字符后面跟着非十六进制字符时，例如 `\xFHello` 或 `\x3world`，则会使用少于两个数字进行转义。

仅对类型为 `bytes` 的字段使用字节转义。虽然可以在类型为 `string` 的字段中使用字节转义，但这些转义序列必须构成有效的 UTF-8 序列。使用字节转义来表示 UTF-8 序列容易出错。对于 `string` 类型字段的字面量中的不可打印字符和换行字符，更推荐使用 Unicode 转义序列。

较长的字符串可以分成连续几行的多个带引号字符串。例如

  quote:
      "When we got into office, the thing that surprised me most was to find "
      "that things were just as bad as we'd been saying they were.\n\n"
      "  -- John F. Kennedy"

Unicode 码点按照 Unicode 13 表 A-1 扩展 BNF 进行解释，并编码为 UTF-8。

警告

C++ 实现目前将转义的高位代理码点解释为 UTF-16 代码单元，并期望紧随其后的是一个 `\uHHHH` 低位代理码点，且不能跨越不同的带引号字符串进行分割。此外，不成对的代理码点将被直接渲染成无效的 UTF-8。这两种行为都是不符合规范的行为[^surrogates]，不应依赖。

语法元素

消息

消息是字段的集合。文本格式文件是单个消息。

Message = { Field } ;

字面量

字段字面量值可以是数字、字符串或标识符，例如 `true` 或枚举值。

String             = STRING, { STRING } ;
Float              = [ "-" ], FLOAT ;
Identifier         = IDENT ;
SignedIdentifier   = "-", IDENT ;   (* For example, "-inf" *)
DecSignedInteger   = "-", DEC_INT ;
OctSignedInteger   = "-", OCT_INT ;
HexSignedInteger   = "-", HEX_INT ;
DecUnsignedInteger = DEC_INT ;
OctUnsignedInteger = OCT_INT ;
HexUnsignedInteger = HEX_INT ;

单个字符串值可以由多个带引号的部分组成，这些部分由可选的空白符分隔。示例

a_string: "first part" 'second part'
          "third part"
no_whitespace: "first""second"'third''fourth'

字段名称

作为包含消息一部分的字段使用简单的 `Identifiers` 作为名称。`Extension` 和 `Any` 字段名称被方括号括起来并且是完全限定的。`Any` 字段名称带有限定域名作为前缀，例如 `type.googleapis.com/`。

FieldName     = ExtensionName | AnyName | IDENT ;
ExtensionName = "[", TypeName, "]" ;
AnyName       = "[", Domain, "/", TypeName, "]" ;
TypeName      = IDENT, { ".", IDENT } ;
Domain        = IDENT, { ".", IDENT } ;

普通字段和扩展字段可以具有标量值或消息值。`Any` 字段始终是消息。示例

reg_scalar: 10
reg_message { foo: "bar" }

[com.foo.ext.scalar]: 10
[com.foo.ext.message] { foo: "bar" }

any_value {
  [type.googleapis.com/com.foo.any] { foo: "bar" }
}

未知字段

文本格式解析器不支持将未知字段表示为原始字段编号来代替字段名称，因为六种线类型中的三种在文本格式中表示方式相同。某些文本格式序列化器实现使用字段编号和值的数字表示形式来编码未知字段，但这本质上是有损的，因为线类型信息被忽略了。相比之下，线格式是无损的，因为它在每个字段标签中包含线类型，形式为 `(field_number << 3) | wire_type`。有关编码的更多信息，请参阅编码主题。

如果不知道消息模式中的字段类型信息，则无法将该值正确编码为线格式的 proto 消息。

字段

字段值可以是字面量（字符串、数字或标识符）或嵌套消息。

Field        = ScalarField | MessageField ;
MessageField = FieldName, [ ":" ], ( MessageValue | MessageList ) [ ";" | "," ];
ScalarField  = FieldName, ":",     ( ScalarValue  | ScalarList  ) [ ";" | "," ];
MessageList  = "[", [ MessageValue, { ",", MessageValue } ], "]" ;
ScalarList   = "[", [ ScalarValue,  { ",", ScalarValue  } ], "]" ;
MessageValue = "{", Message, "}" | "<", Message, ">" ;
ScalarValue  = String
             | Float
             | Identifier
             | SignedIdentifier
             | DecSignedInteger
             | OctSignedInteger
             | HexSignedInteger
             | DecUnsignedInteger
             | OctUnsignedInteger
             | HexUnsignedInteger ;

字段名和值之间的 `:` 分隔符对于标量字段是必需的，但对于消息字段（包括列表）是可选的。示例

scalar: 10          # Valid
scalar  10          # Invalid
scalars: [1, 2, 3]  # Valid
scalars  [1, 2, 3]  # Invalid
message: {}         # Valid
message  {}         # Valid
messages: [{}, {}]  # Valid
messages  [{}, {}]  # Valid

消息字段的值可以用花括号或尖括号括起来

message: { foo: "bar" }
message: < foo: "bar" >

标记为 `repeated` 的字段可以通过重复字段、使用特殊的 `[]` 列表语法或两者的组合来指定多个值。值的顺序会被保留。示例

repeated_field: 1
repeated_field: 2
repeated_field: [3, 4, 5]
repeated_field: 6
repeated_field: [7, 8, 9]

非 `repeated` 字段不能使用列表语法。例如，`[0]` 对于 `optional` 或 `required` 字段是无效的。标记为 `optional` 的字段可以省略或指定一次。标记为 `required` 的字段必须且只能指定一次。

关联 *.proto* 消息中未指定的字段是不允许的，除非字段名称存在于消息的 `reserved` 字段列表中。`reserved` 字段，无论以何种形式（标量、列表、消息）出现，文本格式都会简单地忽略它们。

值类型

当已知字段关联的 *.proto* 值类型时，以下值描述和约束适用。出于本节目的，我们声明以下容器元素

signedInteger   = DecSignedInteger | OctSignedInteger | HexSignedInteger ;
unsignedInteger = DecUnsignedInteger | OctUnsignedInteger | HexUnsignedInteger ;
integer         = signedInteger | unsignedInteger ;

.proto 类型	值
`float`, `double`	`Float`、`DecSignedInteger` 或 `DecUnsignedInteger` 元素，或者其 `IDENT` 部分等于 \"inf\"、\"infinity\" 或 \"nan\"（不区分大小写）的 `Identifier` 或 `SignedIdentifier` 元素。溢出被视为无穷大或负无穷大。八进制和十六进制值无效。注意：\"nan\" 应解释为静默 NaN
`int32`, `sint32`, `sfixed32`	范围 * -0x80000000* 到 * 0x7FFFFFFF* 内的任何 `integer` 元素。
`int64`, `sint64`, `sfixed64`	范围 * -0x8000000000000000* 到 * 0x7FFFFFFFFFFFFFFF* 内的任何 `integer` 元素。
`uint32`, `fixed32`	范围 * 0* 到 * 0xFFFFFFFF* 内的任何 `unsignedInteger` 元素。请注意，带符号值 (* -0*) 无效。
`uint64`, `fixed64`	范围 * 0* 到 * 0xFFFFFFFFFFFFFFFF* 内的任何 `unsignedInteger` 元素。请注意，带符号值 (* -0*) 无效。
`string`	包含有效 UTF-8 数据的 `String` 元素。任何转义序列在反转义后必须构成有效的 UTF-8 字节序列。
`bytes`	一个 `String` 元素，可能包含无效的 UTF-8 转义序列。
`bool`	匹配以下值之一的 `Identifier` 元素或任何 `unsignedInteger` 元素。真值: \"True\"、\"true\"、\"t\"、1 假值: \"False\"、\"false\"、\"f\"、0 允许使用 0 或 1 的任何无符号整数表示形式：00、0x0、01、0x1 等。
枚举值	包含枚举值名称的 `Identifier` 元素，或范围 * -0x80000000* 到 * 0x7FFFFFFF* 内包含枚举值编号的任何 `integer` 元素。指定一个不是字段 `enum` 类型定义成员的名称是无效的。根据特定的 protobuf 运行时实现，指定一个不是字段 `enum` 类型定义成员的编号可能有效也可能无效。未绑定到特定运行时实现（例如 IDE 支持）的文本格式处理器可能会在提供的数字值不是有效成员时选择发出警告。请注意，某些在其他上下文中是有效关键字的名称，例如 \"true\" 或 \"infinity\"，也可以是有效的枚举值名称。
消息值	一个 `MessageValue` 元素。

扩展字段

扩展字段使用其限定名称指定。示例

local_field: 10
[com.example.ext_field]: 20

扩展字段通常在其他 *.proto* 文件中定义。文本格式语言不提供指定定义扩展字段的文件位置的机制；相反，解析器必须事先知道它们的位置。

`Any` 字段

文本格式支持使用类似于扩展字段的特殊语法来表示 `google.protobuf.Any` 知名类型的扩展形式。示例

local_field: 10

# An Any value using regular fields.
any_value {
  type_url: "type.googleapis.com/com.example.SomeType"
  value: "\x0a\x05hello"  # serialized bytes of com.example.SomeType
}

# The same value using Any expansion
any_value {
  [type.googleapis.com/com.example.SomeType] {
    field1: "hello"
  }
}

在此示例中，`any_value` 是一个类型为 `google.protobuf.Any` 的字段，它存储一个包含 `field1: hello` 的序列化 `com.example.SomeType` 消息。

`group` 字段

在文本格式中，`group` 字段使用正常的 `MessageValue` 元素作为其值，但指定时使用大写的组名而不是隐式的小写字段名。示例

message MessageWithGroup {
  optional group MyGroup = 1 {
    optional int32 my_value = 1;
  }
}

根据上述 *.proto* 定义，以下文本格式是有效的 `MessageWithGroup`

MyGroup {
  my_value: 1
}

与消息字段类似，组名和值之间的 `:` 分隔符是可选的。

`map` 字段

文本格式不提供用于指定 map 字段条目的自定义语法。当在 *.proto* 文件中定义 `map` 字段时，会隐式定义一个包含 `key` 和 `value` 字段的 `Entry` 消息。Map 字段总是重复的，接受多个键/值条目。示例

message MessageWithMap {
  map<string, int32> my_map = 1;
}

根据上述 *.proto* 定义，以下文本格式是有效的 `MessageWithMap`

my_map { key: "entry1" value: 1 }
my_map { key: "entry2" value: 2 }

# You can also use the list syntax
my_map: [
  { key: "entry3" value: 3 },
  { key: "entry4" value: 4 }
]

`key` 和 `value` 字段都是可选的，如果未指定，则默认为其相应类型的零值。如果键重复，解析后的 map 中只会保留最后指定的值。

文本 proto 中不保留 map 的顺序。

`oneof` 字段

虽然文本格式中没有与 `oneof` 字段相关的特殊语法，但一次只能指定一个 `oneof` 成员。同时指定多个成员是无效的。示例

message OneofExample {
  message MessageWithOneof {
    optional string not_part_of_oneof = 1;
    oneof Example {
      string first_oneof_field = 2;
      string second_oneof_field = 3;
    }
  }
  repeated MessageWithOneof message = 1;
}

上述 *.proto* 定义会产生以下文本格式行为

# Valid: only one field from the Example oneof is set.
message {
  not_part_of_oneof: "always valid"
  first_oneof_field: "valid by itself"
}

# Valid: the other oneof field is set.
message {
  not_part_of_oneof: "always valid"
  second_oneof_field: "valid by itself"
}

# Invalid: multiple fields from the Example oneof are set.
message {
  not_part_of_oneof: "always valid"
  first_oneof_field: "not valid"
  second_oneof_field: "not valid"
}

文本格式文件

文本格式文件使用 `.txtpb` 文件名后缀，并包含单个 `Message`。文本格式文件采用 UTF-8 编码。下面提供了一个文本 proto 文件的示例。

重要提示

`.txtpb` 是规范的文本格式文件扩展名，应优先使用。该后缀因其简洁性以及与官方线格式文件扩展名 `.binpb` 的一致性而受到青睐。旧的规范扩展名 `.textproto` 仍然被广泛使用并得到工具支持。一些工具也支持旧的扩展名 `.textpb` 和 `.pbtxt`。除了上述之外的所有其他扩展名都受到**强烈**不鼓励；特别是，`.protoascii` 等扩展名错误地暗示文本格式仅为 ASCII 格式，而 `.pb.txt` 等其他扩展名则不被常用工具识别。

# This is an example of Protocol Buffer's text format.
# Unlike .proto files, only shell-style line comments are supported.

name: "John Smith"

pet {
  kind: DOG
  name: "Fluffy"
  tail_wagginess: 0.65f
}

pet <
  kind: LIZARD
  name: "Lizzy"
  legs: 4
>

string_value_with_escape: "valid \n escape"
repeated_values: [ "one", "two", "three" ]

文件头注释 `proto-file` 和 `proto-message` 向开发者工具告知模式信息，以便它们可以提供各种功能。

# proto-file: some/proto/my_file.proto
# proto-message: MyMessage

以编程方式使用该格式

由于各个 Protocol Buffer 实现输出的文本格式既不一致也不规范，因此修改 TextProto 文件或输出 TextProto 的工具或库必须明确使用 https://github.com/protocolbuffers/txtpbfmt 来格式化其输出。