Simple binary serialization

Simple binary serialization (SBS) features:

  • schema based

  • small schema language optimized for human readability

  • schema re-usability and organization into modules

  • small number of built in types

  • polymorphic data types

  • binary serialization

  • designed to enable simple and efficient implementation

  • optimized for small encoded data size

Example of SBS schema:

module Module

Entry(K, V) = Record {
    key: K
    value: V
}

Collection(K) = Choice {
    null: None
    bool: Entry(K, Boolean)
    int: Entry(K, Integer)
    float: Entry(K, Float)
    str: Entry(K, String)
    bytes: Entry(K, Bytes)
}

IntKeyCollection = Collection(Integer)

StrKeyCollection = Collection(String)

Schema definition

SBS shemas are written as UTF-8 encoded files with .sbs file extension. Characters ,, space, \t, \r and \n are considered white-space characters and are ignored. Characters (, ), {, }, :, = and white-space characters are used as delimiters between other identifiers. All other valid identifiers are defined by regex r'[A-Za-z][A-Za-z0-9_]*'. Character # is used as start of comment that spans to the end of line.

Each file represents single SBS schema module. Name of SBS module is defined by module <name> directive where <name> represents user-defined module name. This directive is only mandatory part of each SBS schema and should be placed at the beginning of each .sbs file. Example of minimal valid SBS schema:

module ModuleName

Rest of .sbs files contains arbitrary number of user-defined types. Each type definition is written as <new_type>(<t1> <t2> ...) = <other_type> where:

  • <new_type>

    Name of new user-defined type.

  • <t1>, <t2>, …

    Identifiers representing parametric data type arguments used in <other_type> definition. If these arguments are not used, parenthesis can be omitted.

  • <other_type>

    Other user defined or built in type which encoding should be used for encoding of <new_type>. User defined types are specified as <module_name>.<type_name>(<t1> <t2> ...). If <type_name> refers to type defined in same module, <module_name>. can be omitted. If user defined type is not parametric data type, parenthesis should be omitted.

Data types

Builtin data types include:

  • simple data types

    • None

      Data type without value.

    • Boolean

      Data type with two possible values representing true and false.

    • Integer

      Unconstrained signed integer value.

    • Float

      Floating point value that can be encoded with 8 bytes according to IEEE 754.

    • String

      UTF-8 encoded string value.

    • Bytes

      Array of byte values of arbitrary length.

  • composite data types

    • Array(<t>)

      Parametric data type that defines arbitrary length Array where all elements are of type defined by <t>.

    • Record { <entry1>: <t1>, <entry2>: <t2>, ... }

      Collection of user-defined entries where each entry has entry identifier (<entry1>, <entry2>, …) and entry type (<t1>, <t2>, …). Encoded data must contain all entries specified by type definition. Number of entries should be greather than zero.

    • Choice { <entry1>: <t1>, <entry2>: <t2>, ... }

      Type that can represent one of types defined by <t1>, <t2>, … Encoded data must contain only single entry identified by entry identifier (<entry1>, <entry2>, …). Choice definition should contain at least one entry definition.

  • derived data types

    These include predefined types that can be expressed as:

    Optional(a) = Choice {
        none: None
        value: a
    }
    

PEG grammar

Module          <- OWS 'module' MWS Identifier TypeDefinitions OWS EOF
TypeDefinitions <- (MWS TypeDefinition (MWS TypeDefinition)*)?
TypeDefinition  <- Identifier OWS ArgNames? OWS '=' OWS Type

Type            <- TSimple
                 / TArray
                 / TRecord
                 / TChoice
                 / TIdentifier
TSimple         <- 'None'
                 / 'Boolean'
                 / 'Integer'
                 / 'Float'
                 / 'String'
                 / 'Bytes'
TArray          <- 'Array' OWS '(' OWS Type OWS ')'
TRecord         <- 'Record' OWS '{' OWS Entries OWS '}'
TChoice         <- 'Choice' OWS '{' OWS Entries OWS '}'
TIdentifier     <- Identifier ('.' Identifier)? (OWS ArgTypes)?

Entries         <- Entry (MWS Entry)*
Entry           <- Identifier OWS ':' OWS Type

ArgNames        <- '(' OWS Identifiers? OWS ')'
ArgTypes        <- '(' OWS Types? OWS ')'
Identifiers     <- Identifier (MWS Identifier)*
Types           <- Type (MWS Type)*

Identifier      <- [A-Za-z][A-Za-z0-9_]*

# mandatory white-space
MWS             <- (WS / Comment)+
# optional white-space
OWS             <- (WS / Comment)*

Comment         <- '#' (!EOL .)* EOL
WS              <- ',' / ' ' / '\t' / EOL
EOL             <- '\r\n' / '\n' / '\r'
EOF             <- !.

Data encoding

None

None value is represented with empty byte array.

Boolean

Boolean value is encoded as single byte with value 0x01 as true and 0x00 as false.

Integer

Signed integer values are encoded as variable length byte array. Most significant bit in all bytes, except last one, is set to 0 (last bytes most significant bit is 1). Concatenation of other bits represent big-endian encoded two’s complement binary representation of integer value.

+-----------------+-------+-----------------+
|        0        |       |        m        |
| 7 6 5 4 3 2 1 0 |       | 7 6 5 4 3 2 1 0 |
+-----------------+  ...  +-----------------+
| 0 xn ... x(n-7) |       | 1   x6 ... x0   |
+-----------------+-------+-----------------+

Float

Floating point values are encoded according to IEEE 754 binary64 (double precision) format.

Bytes

Bytes array is encoded “as is” and prefixed with bytes count encoded as Integer.

String

String value is encoded as UTF-8 encoded Bytes.

Array

Array is encoded as sequential concatenation of each element encoding. This concatenated bytes are prefixed with array’s element count encoded as Integer.

Record

Record is encoded as sequential concatenation of record’s elements encoding according to elements order defined by schema.

Choice

Choice encodes single element prefixed with encoded element’s zero-based index as Integer.