This is a description of the EBNF language used in the bytewise EBNF parser. It draws from several sources:
The left-hand side of a rule is separated from the right by an equal sign (=
). A rule is terminated by a semicolon (;
).
grammar = rule*;
rule = left-hand side, "=", element list, ";";
Terminals get their format from Python strings. They begin either with a single quote ('
) or a double quote ("
). Within quote deliniated by one type of quote, that type of quote must be backslash escaped and the other doesn't. Standard backslash escapes are respected as well as some new ones:
Escape | Meaning |
---|---|
\" | double quote (0x22) |
\' | single quote (0x27) |
\\ | backslash (0x5C) |
\0 | null (0x00) |
\a | bell (0x07) |
\f | formfeed (0x0C) |
\n | linefeed (0x0A) |
\N{name} | unicode character named name |
\r | carriage return (0x0D) |
\t | horizontal tab (0x09) |
\uhhhh | unicode character with 16-bit hexadecimal value hhhh |
\Uhhhhhhhh | unicode character with 32-bit hexadecimal value hhhhhhhh |
\v | vertical tab (0x0B) |
\xhh | character with hexadecimal value hh |
\bbbbbbbbb | binary byte specification |
\b{bb…} | variable length binary bitfield specification (must be a multiple of 8) |
\ehh | little endian binary byte specification |
\e{hh…} | little endian variable length hexadecimal bitfield specification |
\Ehh | big endian binary byte specification |
\E{hh…} | big endian variable length bitfield specification |
\x{hh…} | variable length hexadecimal bitfield specification |
Binary and hexadecimal values have an alternate value of 'x
' which means "unimportant", so "\b11xxxxxx
" would match any bytes starting with 11
.
terminal = single quoted string | double quoted string | character list | dot;
single quoted string = "'", string character*, "'";
double quoted string = '"', string character*, '"';
string character = [^\\] | escaped character | character specification | bitfield specification | endianness designation;
escaped character = "\\", [^beEuUx];
character specification = "\\", (("b", [01xX]{8}) | ("x", hex digit{2}) | ("u", hex digit{4}) | ("U", hex digit{8}));
bitfield specification = "\\", (("b{", ([01xX]{8})+, "}") | ("x{", hex digit+, "}"));
endianness designation = "\\", [eE], "{", (hex digit+ | (character specification | bitfield specification)+), "}";
hex digit = [a-fA-F0-9xX];
"This is a \"terminal\" with 'single quotes' in it"
'This is a "snowman:" \u2603'
In addition to strings, bytes may be matched by character lists which are enclosed in brackets ([]
). Within those lists, there are shorthand character classes. Posix character classes are recognized:
Character lists may be negated by making the first character a caret (^
). To use a caret as a character in a character list, either backslash escape it or don't use it as the first character. Similarly, backslash escaping may be used to include the characters: \
, -
and ]
.
Class | Meaning |
---|---|
[:alnum:] | Alphanumeric characters ([[:alpha:][:digit:]] ) |
[:alpha:] | Alphabetic characters ([a-zA-Z] ) |
[:blank:] | Space and TAB characters ([ \t\v] ) |
[:cntrl:] | Control characters |
[:digit:] | Numeric characters ([0-9] ) |
[:graph:] | Characters that are both printable and visible. (A space is printable but not visible, whereas an `a' is both.) |
[:lower:] | Lowercase alphabetic characters ([a-z] ) |
[:print:] | Printable characters ([^[:control:]] ) |
[:punct:] | Punctuation characters ([^[:alnum:][:control:][:space:]] ) |
[:space:] | Space characters ([ \t\v\n\r] ) |
[:upper:] | Uppercase alphabetic characters ([A-Z] ) |
[:xdigit:] | Characters that are hexadecimal digits ([a-fA-F0-9] ) |
character list = "[", ([^\\[] | escaped character | character specification | character class)+, "]";
character class = "[:", ("alnum" | "alpha" | "blank" | "cntrl" | "digit" | "graph" | "lower" | "print" | "punct" | "space" | "upper" | "xdigit"), ":]";
Non-terminals are pretty much any string starting with an alpha-numeric character. Spaces are allowed in non-terminal names. Characters that mark the end of a non-terminal: comma (,
), bar (|
), semicolon (;
). Non-terminals may not contain open parentheses ((
).
Non-terminals are processed in a lexagraphic step that compresses all whitespace in between words and trims any leading or trailing whitespace. This means that non-terminals may not be distinguished by whitespace.
non-terminal = [^"'(,|[], [^(,|]*;
Null Terminated String
2nd length byte
The left-hand side is composed of one or more non-terminals spearated by bars.
left-hand side = non-terminal, ("|", non-terminal)*;
Comment
Number | Long String | Text
Choices between multiple terminals and non-terminals may be represented using a bar (|
).
element choice = element, ("|", element)*;
element = (group | terminal | non-terminal | named group), repetition?;
Up | Down
"one" | "two"{2} | "three"{3}
Multiple terminals and non-terminals may be combined using a comma (,
).
The precedence of concatenation is higher than alternation, so A, B | C, D
= (A, B) | (C, D)
.
element list = element choice, (",", element choice)*;
Start, "middle", End
"test", "one", delimiter, "two", delimiter, "three"
Sets of terminals and nonterminals may be combined into groups using parenthesis.
group = "(", element list, ")";
(Subject, Predicate) | Command
("ch-"*, "check") | (Mike Check, "click"), Rock
Repetition specifies that a terminal or non-terminal is repeated multiple times.
Symbol | Meaning |
---|---|
{m,n} | Repeated at least m times, but no more than n |
{n} | Equivalent to {n,n} |
* | Equivalent to {0,∞} |
+ | Equivalent to {1,∞} |
? | Equivalent to {0,1} |
repetition = [*+] | ("{", expression, (",", expression)?, "}");
expression = sum;
sum = (sum, [+-], term) | term;
term = (term, [*/], shift) | shift;
shift = (shift, ("<<" | ">>"), operand) | operand;
operand = ("(", expression, ")") | named group | number;
number = [0-9]+;
"check"*
Whitespace{15,30}
Named groups allow for the extension of the grammar to account for non-context-free situations. Normally the character preceeding a group will be:
=
) - start of a rule,
) - concatenation with the previous element|
) - choice between the previous element and this oneIf the characters preceeding an open parenthesis are not one of these, then they are taken to be the name of the group. Handlers may register with the parser and recieve notification both when named groups are started and finished. At this point they have access to the parser and may modify both the grammar and the parse tree. (This will need further information once I know the exact restrictions that will be in place.)
There are a few functions defined in the parser already:
Function | Execution Point | Meaning |
---|---|---|
$(name) |
before processing | replaces the non-terminal name with a terminal with the value of the last expansion of name |
size-of(expansion) |
after processing | replaces the current expansion with the number of bytes in expression |
ABCs = As, "b"{size-of($(As))}, "c"{size-of($(As))}
As = "a"*;