Bytewise EBNF

This is a description of the EBNF language used in the bytewise EBNF parser. It draws from several sources:


Rules

The left-hand side of a rule is separated from the right by an equal sign (=). A rule is terminated by a semicolon (;).

EBNF

grammar = rule*; rule = left-hand side, "=", element list, ";";

Terminals

Terminals get their format from Python strings. They begin either with a single quote (') or a double quote ("). Within quote deliniated by one type of quote, that type of quote must be backslash escaped and the other doesn't. Standard backslash escapes are respected as well as some new ones:

EscapeMeaning
\"double quote (0x22)
\'single quote (0x27)
\\backslash (0x5C)
\0null (0x00)
\abell (0x07)
\fformfeed (0x0C)
\nlinefeed (0x0A)
\N{name}unicode character named name
\rcarriage return (0x0D)
\thorizontal tab (0x09)
\uhhhhunicode character with 16-bit hexadecimal value hhhh
\Uhhhhhhhhunicode character with 32-bit hexadecimal value hhhhhhhh
\vvertical tab (0x0B)
\xhhcharacter with hexadecimal value hh

\bbbbbbbbbbinary byte specification
\b{bb…}variable length binary bitfield specification (must be a multiple of 8)
\ehhlittle endian binary byte specification
\e{hh…}little endian variable length hexadecimal bitfield specification
\Ehhbig endian binary byte specification
\E{hh…}big endian variable length bitfield specification
\x{hh…}variable length hexadecimal bitfield specification

Notes

Binary and hexadecimal values have an alternate value of 'x' which means "unimportant", so "\b11xxxxxx" would match any bytes starting with 11.

EBNF

terminal = single quoted string | double quoted string | character list | dot; single quoted string = "'", string character*, "'"; double quoted string = '"', string character*, '"'; string character = [^\\] | escaped character | character specification | bitfield specification | endianness designation; escaped character = "\\", [^beEuUx]; character specification = "\\", (("b", [01xX]{8}) | ("x", hex digit{2}) | ("u", hex digit{4}) | ("U", hex digit{8})); bitfield specification = "\\", (("b{", ([01xX]{8})+, "}") | ("x{", hex digit+, "}")); endianness designation = "\\", [eE], "{", (hex digit+ | (character specification | bitfield specification)+), "}"; hex digit = [a-fA-F0-9xX];

Examples

"This is a \"terminal\" with 'single quotes' in it" 'This is a "snowman:" \u2603'

Character Classes

In addition to strings, bytes may be matched by character lists which are enclosed in brackets ([]). Within those lists, there are shorthand character classes. Posix character classes are recognized:

Notes

Character lists may be negated by making the first character a caret (^). To use a caret as a character in a character list, either backslash escape it or don't use it as the first character. Similarly, backslash escaping may be used to include the characters: \, - and ].

ClassMeaning
[:alnum:]Alphanumeric characters ([[:alpha:][:digit:]])
[:alpha:]Alphabetic characters ([a-zA-Z])
[:blank:]Space and TAB characters ([ \t\v])
[:cntrl:]Control characters
[:digit:]Numeric characters ([0-9])
[:graph:]Characters that are both printable and visible. (A space is printable but not visible, whereas an `a' is both.)
[:lower:]Lowercase alphabetic characters ([a-z])
[:print:]Printable characters ([^[:control:]])
[:punct:]Punctuation characters ([^[:alnum:][:control:][:space:]])
[:space:]Space characters ([ \t\v\n\r])
[:upper:]Uppercase alphabetic characters ([A-Z])
[:xdigit:]Characters that are hexadecimal digits ([a-fA-F0-9])

EBNF

character list = "[", ([^\\[] | escaped character | character specification | character class)+, "]"; character class = "[:", ("alnum" | "alpha" | "blank" | "cntrl" | "digit" | "graph" | "lower" | "print" | "punct" | "space" | "upper" | "xdigit"), ":]";

Non-Terminals

Non-terminals are pretty much any string starting with an alpha-numeric character. Spaces are allowed in non-terminal names. Characters that mark the end of a non-terminal: comma (,), bar (|), semicolon (;). Non-terminals may not contain open parentheses (().

Notes

Non-terminals are processed in a lexagraphic step that compresses all whitespace in between words and trims any leading or trailing whitespace. This means that non-terminals may not be distinguished by whitespace.

EBNF

non-terminal = [^"'(,|[], [^(,|]*;

Examples

Null Terminated String 2nd length byte

Left-Hand Side of a Rule: Non-Terminal List

The left-hand side is composed of one or more non-terminals spearated by bars.

EBNF

left-hand side = non-terminal, ("|", non-terminal)*;

Examples

Comment Number | Long String | Text

Alternation

Choices between multiple terminals and non-terminals may be represented using a bar (|).

EBNF

element choice = element, ("|", element)*; element = (group | terminal | non-terminal | named group), repetition?;

Examples

Up | Down "one" | "two"{2} | "three"{3}

Concatenation

Multiple terminals and non-terminals may be combined using a comma (,).

Note

The precedence of concatenation is higher than alternation, so A, B | C, D = (A, B) | (C, D).

EBNF

element list = element choice, (",", element choice)*;

Examples

Start, "middle", End "test", "one", delimiter, "two", delimiter, "three"

Grouping

Sets of terminals and nonterminals may be combined into groups using parenthesis.

EBNF

group = "(", element list, ")";

Examples

(Subject, Predicate) | Command ("ch-"*, "check") | (Mike Check, "click"), Rock

Repetition

Repetition specifies that a terminal or non-terminal is repeated multiple times.

SymbolMeaning
{m,n}Repeated at least m times, but no more than n
{n}Equivalent to {n,n}
*Equivalent to {0,∞}
+Equivalent to {1,∞}
?Equivalent to {0,1}

EBNF

repetition = [*+] | ("{", expression, (",", expression)?, "}"); expression = sum; sum = (sum, [+-], term) | term; term = (term, [*/], shift) | shift; shift = (shift, ("<<" | ">>"), operand) | operand; operand = ("(", expression, ")") | named group | number; number = [0-9]+;

Examples

"check"* Whitespace{15,30}

Named Groups

Named groups allow for the extension of the grammar to account for non-context-free situations. Normally the character preceeding a group will be:

If the characters preceeding an open parenthesis are not one of these, then they are taken to be the name of the group. Handlers may register with the parser and recieve notification both when named groups are started and finished. At this point they have access to the parser and may modify both the grammar and the parse tree. (This will need further information once I know the exact restrictions that will be in place.)

There are a few functions defined in the parser already:

FunctionExecution PointMeaning
$(name) before processing replaces the non-terminal name with a terminal with the value of the last expansion of name
size-of(expansion) after processing replaces the current expansion with the number of bytes in expression

Examples

ABCs = As, "b"{size-of($(As))}, "c"{size-of($(As))} As = "a"*;