Bytewise EBNF

This is a description of the EBNF language used in the bytewise EBNF parser. It draws from several sources:

ISO-14977: ISO standard for EBNF
Python Programming Language

Rules

The left-hand side of a rule is separated from the right by an equal sign (=). A rule is terminated by a semicolon (;).

EBNF

grammar = rule*; rule = left-hand side, "=", element list, ";";

Terminals

Terminals get their format from Python strings. They begin either with a single quote (') or a double quote ("). Within quote deliniated by one type of quote, that type of quote must be backslash escaped and the other doesn't. Standard backslash escapes are respected as well as some new ones:

Escape	Meaning
\"	double quote (0x22)
\'	single quote (0x27)
\\	backslash (0x5C)
\0	null (0x00)
\a	bell (0x07)
\f	formfeed (0x0C)
\n	linefeed (0x0A)
\N{name}	unicode character named name
\r	carriage return (0x0D)
\t	horizontal tab (0x09)
\uhhhh	unicode character with 16-bit hexadecimal value hhhh
\Uhhhhhhhh	unicode character with 32-bit hexadecimal value hhhhhhhh
\v	vertical tab (0x0B)
\xhh	character with hexadecimal value hh

\bbbbbbbbb	binary byte specification
\b{bb…}	variable length binary bitfield specification (must be a multiple of 8)
\ehh	little endian binary byte specification
\e{hh…}	little endian variable length hexadecimal bitfield specification
\Ehh	big endian binary byte specification
\E{hh…}	big endian variable length bitfield specification
\x{hh…}	variable length hexadecimal bitfield specification

Notes

Binary and hexadecimal values have an alternate value of 'x' which means "unimportant", so "\b11xxxxxx" would match any bytes starting with 11.

EBNF

terminal = single quoted string | double quoted string | character list | dot; single quoted string = "'", string character*, "'"; double quoted string = '"', string character*, '"';

string character = [^\\] | escaped character | character specification | bitfield specification | endianness designation;

escaped character = "\\", [^beEuUx]; character specification = "\\", (("b", [01xX]{8}) | ("x", hex digit{2}) | ("u", hex digit{4}) | ("U", hex digit{8})); bitfield specification = "\\", (("b{", ([01xX]{8})+, "}") | ("x{", hex digit+, "}")); endianness designation = "\\", [eE], "{", (hex digit+ | (character specification | bitfield specification)+), "}"; hex digit = [a-fA-F0-9xX];

Examples

"This is a \"terminal\" with 'single quotes' in it" 'This is a "snowman:" \u2603'

Character Classes

In addition to strings, bytes may be matched by character lists which are enclosed in brackets ([]). Within those lists, there are shorthand character classes. Posix character classes are recognized:

Notes

Character lists may be negated by making the first character a caret (^). To use a caret as a character in a character list, either backslash escape it or don't use it as the first character. Similarly, backslash escaping may be used to include the characters: \, - and ].

Class	Meaning
[:alnum:]	Alphanumeric characters (`[[:alpha:][:digit:]]`)
[:alpha:]	Alphabetic characters (`[a-zA-Z]`)
[:blank:]	Space and TAB characters (`[ \t\v]`)
[:cntrl:]	Control characters
[:digit:]	Numeric characters (`[0-9]`)
[:graph:]	Characters that are both printable and visible. (A space is printable but not visible, whereas an `a' is both.)
[:lower:]	Lowercase alphabetic characters (`[a-z]`)
[:print:]	Printable characters (`[^[:control:]]`)
[:punct:]	Punctuation characters (`[^[:alnum:][:control:][:space:]]`)
[:space:]	Space characters (`[ \t\v\n\r]`)
[:upper:]	Uppercase alphabetic characters (`[A-Z]`)
[:xdigit:]	Characters that are hexadecimal digits (`[a-fA-F0-9]`)

EBNF

character list = "[", ([^\\[] | escaped character | character specification | character class)+, "]";

character class = "[:", ("alnum" | "alpha" | "blank" | "cntrl" | "digit" | "graph" | "lower" | "print" | "punct" | "space" | "upper" | "xdigit"), ":]";

Symbol	Meaning
`{m,n}`	Repeated at least m times, but no more than n
`{n}`	Equivalent to `{n,n}`
`*`	Equivalent to `{0,∞}`
`+`	Equivalent to `{1,∞}`
`?`	Equivalent to `{0,1}`

EBNF

repetition = [*+] | ("{", expression, (",", expression)?, "}"); expression = sum; sum = (sum, [+-], term) | term; term = (term, [*/], shift) | shift; shift = (shift, ("<<" | ">>"), operand) | operand; operand = ("(", expression, ")") | named group | number; number = [0-9]+;

Examples

"check"* Whitespace{15,30}

Named Groups

Named groups allow for the extension of the grammar to account for non-context-free situations. Normally the character preceeding a group will be:

equal sign: (=) - start of a rule
comma: (,) - concatenation with the previous element
bar: (|) - choice between the previous element and this one

If the characters preceeding an open parenthesis are not one of these, then they are taken to be the name of the group. Handlers may register with the parser and recieve notification both when named groups are started and finished. At this point they have access to the parser and may modify both the grammar and the parse tree. (This will need further information once I know the exact restrictions that will be in place.)

There are a few functions defined in the parser already:

Function	Execution Point	Meaning
`$(name)`	before processing	replaces the non-terminal name with a terminal with the value of the last expansion of name
`size-of(expansion)`	after processing	replaces the current expansion with the number of bytes in expression

Examples

ABCs = As, "b"{size-of($(As))}, "c"{size-of($(As))} As = "a"*;