Simple binary serialization#

Simple binary serialization (SBS) features:

  • schema based

  • small schema language optimized for human readability

  • schema re-usability and organization into modules

  • small number of built in types

  • polymorphic data types

  • binary serialization

  • designed to enable simple and efficient implementation

  • optimized for small encoded data size

Example of SBS schema:

module Module

Entry(K, V) = Record {
    key: K
    value: V
}

Collection(K) = Choice {
    null: None
    bool: Entry(K, Boolean)
    int: Entry(K, Integer)
    float: Entry(K, Float)
    str: Entry(K, String)
    bytes: Entry(K, Bytes)
}

IntKeyCollection = Collection(Integer)

StrKeyCollection = Collection(String)

Schema definition#

SBS shemas are written as UTF-8 encoded files with .sbs file extension. Characters ,, space, \t, \r and \n are considered white-space characters and are ignored. Characters (, ), {, }, :, = and white-space characters are used as delimiters between other identifiers. All other valid identifiers are defined by regex r'[A-Za-z][A-Za-z0-9_]*'. Character # is used as start of comment that spans to the end of line.

Each file represents single SBS schema module. Name of SBS module is defined by module <name> directive where <name> represents user-defined module name. This directive is only mandatory part of each SBS schema and should be placed at the beginning of each .sbs file. Example of minimal valid SBS schema:

module ModuleName

Rest of .sbs files contains arbitrary number of user-defined types. Each type definition is written as <new_type>(<t1> <t2> ...) = <other_type> where:

  • <new_type>

    Name of new user-defined type.

  • <t1>, <t2>, …

    Identifiers representing parametric data type arguments used in <other_type> definition. If these arguments are not used, parenthesis can be omitted.

  • <other_type>

    Other user defined or built in type which encoding should be used for encoding of <new_type>. User defined types are specified as <module_name>.<type_name>(<t1> <t2> ...). If <type_name> refers to type defined in same module, <module_name>. can be omitted. If user defined type is not parametric data type, parenthesis should be omitted.

Data types#

Builtin data types include:

  • simple data types

    • None

      Data type without value.

    • Boolean

      Data type with two possible values representing true and false.

    • Integer

      Unconstrained signed integer value.

    • Float

      Floating point value that can be encoded with 8 bytes according to IEEE 754.

    • String

      UTF-8 encoded string value.

    • Bytes

      Array of byte values of arbitrary length.

  • composite data types

    • Array(<t>)

      Parametric data type that defines arbitrary length Array where all elements are of type defined by <t>.

    • Record { <entry1>: <t1>, <entry2>: <t2>, ... }

      Collection of user-defined entries where each entry has entry identifier (<entry1>, <entry2>, …) and entry type (<t1>, <t2>, …). Encoded data must contain all entries specified by type definition. Number of entries should be greather than zero.

    • Choice { <entry1>: <t1>, <entry2>: <t2>, ... }

      Type that can represent one of types defined by <t1>, <t2>, … Encoded data must contain only single entry identified by entry identifier (<entry1>, <entry2>, …). Choice definition should contain at least one entry definition.

  • derived data types

    These include predefined types that can be expressed as:

    Optional(a) = Choice {
        none: None
        value: a
    }
    

PEG grammar#

Module          <- OWS 'module' MWS Identifier TypeDefinitions OWS EOF
TypeDefinitions <- (MWS TypeDefinition (MWS TypeDefinition)*)?
TypeDefinition  <- Identifier OWS ArgNames? OWS '=' OWS Type

Type            <- TSimple
                 / TArray
                 / TRecord
                 / TChoice
                 / TIdentifier
TSimple         <- 'None'
                 / 'Boolean'
                 / 'Integer'
                 / 'Float'
                 / 'String'
                 / 'Bytes'
TArray          <- 'Array' OWS '(' OWS Type OWS ')'
TRecord         <- 'Record' OWS '{' OWS Entries OWS '}'
TChoice         <- 'Choice' OWS '{' OWS Entries OWS '}'
TIdentifier     <- Identifier ('.' Identifier)? (OWS ArgTypes)?

Entries         <- Entry (MWS Entry)*
Entry           <- Identifier OWS ':' OWS Type

ArgNames        <- '(' OWS Identifiers? OWS ')'
ArgTypes        <- '(' OWS Types? OWS ')'
Identifiers     <- Identifier (MWS Identifier)*
Types           <- Type (MWS Type)*

Identifier      <- [A-Za-z][A-Za-z0-9_]*

# mandatory white-space
MWS             <- (WS / Comment)+
# optional white-space
OWS             <- (WS / Comment)*

Comment         <- '#' (!EOL .)* EOL
WS              <- ',' / ' ' / '\t' / EOL
EOL             <- '\r\n' / '\n' / '\r'
EOF             <- !.

Data encoding#

None#

None value is represented with empty byte array.

Boolean#

Boolean value is encoded as single byte with value 0x01 as true and 0x00 as false.

Integer#

Signed integer values are encoded as variable length byte array. Most significant bit in all bytes, except last one, is set to 0 (last bytes most significant bit is 1). Concatenation of other bits represent big-endian encoded two’s complement binary representation of integer value.

+-----------------+-------+-----------------+
|        0        |       |        m        |
| 7 6 5 4 3 2 1 0 |       | 7 6 5 4 3 2 1 0 |
+-----------------+  ...  +-----------------+
| 0 xn ... x(n-7) |       | 1   x6 ... x0   |
+-----------------+-------+-----------------+

Float#

Floating point values are encoded according to IEEE 754 binary64 (double precision) format.

Bytes#

Bytes array is encoded “as is” and prefixed with bytes count encoded as Integer.

String#

String value is encoded as UTF-8 encoded Bytes.

Array#

Array is encoded as sequential concatenation of each element encoding. This concatenated bytes are prefixed with array’s element count encoded as Integer.

Record#

Record is encoded as sequential concatenation of record’s elements encoding according to elements order defined by schema.

Choice#

Choice encodes single element prefixed with encoded element’s zero-based index as Integer.