Text Data Definition Language

Working on Text Data Definition Language and parser which uses this language, we were focused on the following main goals:
  • Parser must be maximally similar to binary parser in BinaryDOM.
  • Language must be maximally compatible with widely known languages used for structured text grammars.
  • Parser should be able to process files with fixed text format files when every symbol counts (like tab separated values where tab is not a whitespace but important separator), as well as complex programming language constructions where empty spaces and comments should be ignored.
In result, we have a kind of recursive descent parser driven by universal schema language.
As a model of schema language, Yacc/Bison language was taken. But, because we have not LALR parser, and we need not generate any C code, it was significantly simplified:
  • We do not need to have separate files for lexer and syntax analyzer. All text rules declaration looks and works indentically.
  • There are no host language code snippets, special sections with token declarations, etc. We have only one language construction - rule definition. That's all.
Text Data Definition file consists of file header and a set of rule definitions. C-like multiline comments inside /* and */ are supported.

File header


At the beginning of TDDL file, you have to insert the following pseudo-structure:

MiraplacidTextDataDefinitionFileInfo {
version <version_number>;
credits <text>;
license <text>;
comments <text>;
extensions <list_of_values>;
mime_types <list_of_values>;
include <filename>;
IgnoreWhiteSpace;
greedy;
CommentTokens <start_string> <end_string>;
start <start_rule_name>;
}
All fields of this structure are optional.
"Version" value will be used by schema parser to verify its ability to parse this schema file correctly. Current schema version is 2.
"Credits", "license", "comments" are arbitrary text. If text value contains spaces or punctuation symbols, it must be enclosed into double quotation marks.
"Extensions" represent a list of file extensions to help loader recognize files. Comma separated values enclosed by double quotation marks.
"Mime_types" represent a list of mime to help loader recognize files. Comma separated values enclosed by double quotation marks.
"Include" allows including another grammar file into current file. Could be used several times in header. Filename is in doule quotes.
"IgnoreWhiteSpace" option without value means that parser will ignore unnecessary spaces in the file. This allows us, in absent of specialized lexical file, to use in grammar file only significant values. If you will not use this option, every symbol will be used for parsing and in general languages you will need to insert whitespace placeholders (rules) everywhere it can appear. On the other hand, this option will allow handling fixed formats.
"Greedy" option without value should be used in schemas ported from YACC/Bison. In that schemas operator '|' (enumeration) takes precedence over ordered flow of values. In TDDL, without this option, this is not true and enumeration operator works only for values it separates.
"CommentTokens" represents pair of start and stop comment values. Parser internally eliminates document content from comments. This option may be used several times in header if the language uses several variants of comments. If not used, that means the file does not have any comments.
"Start" points to start rule for parsing. If not used, schema compiler detects all suitable top-level rules anyways, and tries all them in parsing, one-by-one.

Rule definitions


Except of header, the only language element of TDDL is rule definition. It based on standard EBNF construction and allow to use most of existing variations of such definitions:

Definition: Rule Equ_symbol Element+ ';'
Rule: '<'? name '>'?
Equ_symbol: ":" | "=" | "::="
Element: (RuleReference | String | CaseInsensitiveString | Character | Group) Flag?
RuleRreference: '{'? name "}'?
String: "'" (Symbol | Escaped_char_for_string)* "'" | '"' (Symbol | Escaped_char_for_string)* '"'
CaseInsensitiveString: '$"' (Symbol | Escaped_char_for_string)* '"'
Character: '[' '^'? (Symbol | Character_range | Escaped_char_for_class) ']'
Group: Sequential_group | Enumeration_group
Sequential_group: '( ' Element (','? Element)* ')'
Enumeration_group: '( ' Element ('|' Element)* ')'
Escaped_char_for_string: "\'" | "\"" | "\r" | "\n" | "\t" | "\f" | "\\" | "\xNUMBER"
Escaped_char_for_class: Escaped_char_for_string | "\]" | "\."
Character_range: (Escaped_char_for_class | Symbol) '-' (Escaped_char_for_class | Symbol)
Flag: '*' | '?' | '+'
Comments:
  • Rule is a name of language rule. It is allowed to enclose name of rule in the left and right part of rule definitions in angle brackets (<>). Use it if rule name contain spaces.
    Rules references in another rule definition build up document hierarchy structure.
    It is possible to include definition of one rule to another if name of rule in reference enclosed in curly brackets ({}). In that case this does not affect hierarchy, just include one definition into another, literally.
  • Some characters, to be included into strings, must be represented as escape sequences: \' (single quote), \" (double quote), \r (carriage return), \n (line feed), \t (tab), \f (form feed), \\ (backslash itself), \s (space character), \xNUMBER (hex value of appropriate Unicode symbol).
  • Separate kind of String is a CaseInsensitiveString. If string is prefixed with a '$', it will be compared in case-insensitive manner while parsing/validating a document.
  • Character defines one Unicode characters of specified class. It could be arbitrary mix of immediate characters, character ranges and escape sequences. In addition to common escapes, character definitions my contain \] (to allow include close bracket into character class), \- (to allow use minus character) and \. (to allow use period character). Single period character (.) means "any character" placeholder.
  • There is no difference between single-quoted and double-quoted strings.
  • Flags, like in most versions of xBNF, have the following meanings: '*' (zero or more), '?' (zero or one), '+' (one or more).
  • Sequential_group represents an ordered list of elements. Elements could be separated by comma. This is the default type of grouping. Use parenthesis if you have mix of enumeration and sequential groups, or if some group of elements have common flag parameters.
  • Enumeration_group is a number of variants to be tried on parsing. Parser tries variants one-by-one, from the first to the last, so order is very important.
    As mentioned before, greedy header option has an influence on enumeration precedence. For example, we have the following rule:
    
    PostfixExpression = LeftHandSideExpression
                           | PostfixExpression '++'
                           | PostfixExpression '--';
    
    If "greedy" is present, this will be converted into:
    
    PostfixExpression = LeftHandSideExpression
                           | (PostfixExpression '++')
                           | (PostfixExpression '--');
    
    If "greedy" is off:
    
    PostfixExpression = (LeftHandSideExpression | PostfixExpression) 
                        ('++' | PostfixExpression)
                        '--';
    
  • Symbol is any printable character wchih is not needed to be escaped.

Important parser implementation details and useful hints


To build new text definition schemas or port ones from yacc/bison existing grammars, it could be useful to know some important details of parser implementation:
  1. Character class specifications processed from the left to the right. Spaces not allowed. If the first symbol after opening bracket is '^', character specification is negative - "character is not a...".
  2. As mentioned before, parser works with enumerations from the beginning to the end, so, order is important. Place more specific (whish are easier to detect) elements before less specific in enumerations.
  3. Our parser does not need special terminal lexemas and do not have special syntax for them. But, it is possible to have some rules to be "specialized" for lexical constructions. If some rule have only character classes and strings in definition (probably, in complex grouping), parser will merge all individual recognized strings and characters into a single string value placed as a sole child of such rule.
    It is useful to have such a set of rules which represents a lexical structure of language/document.
  4. Schema parser performs some optimizations, including unnecessary groups elimination, enumeration variants minimization, direct and indirect left recursion fixing, etc. For example, mentioned above "PostfixExpression" rule will be optimized to
    
    PostfixExpression = LeftHandSideExpression ("++" | "--")*;
    
    Schema parser tries to fix indirect left recursion by conversion the last rule in indirect recursion chain into expression with direct left recursion, which will be fixed later by direct left recursion fixing mechanism.
    Sometimes, it is not possible to fix direct left recursion (if there are no other variants in enumeration except of ones beginning from the same rule). In this case, this is incorrectable left recursion and error will be thrown.
    We recommend checking your schemas how it will look after optimizations, with the help of TextExamples\SchemaParser sample application.
  5. If your document is a complex language and there is no need to handle whitespaces and comments, it would be better to use "IgnoreWhiteSpace" header option. Your schema will not contain unnecessary elements (spaces) and parsing will work faster. But, it will not be possible to re-create document with Content property because all spaces in document will be lost.
  6. Schema parser detects all unused rules in document. If no "start" option is used in the header, parser will try all them as a root rules. This could degrade performance and may lead to incorrect results, so, it is good practice to use "start" in all cases.
  7. During optimization, schema parser detects rules that may not contain data (all included definition elements marked by '*' or '?' flags). If rule which should contain data does not get any content, parser will throw an error.
  8. If your schema root rule definition looks like
    
    RootElement = (Declarations Definitions)*;
    
    you are, probabaly, in trouble. This may look like a good idea to have a weak start rule which allows almost anything including empty file. But, at the same time, this allow any file to pass validation, because, despite of nothing has passed rules checking, root rule says: "Ok, there is no content in this document, this is also ok for us!" - because of (*) which allows 0 language elements to be valid. More, all error messages detected during parsing will be "swallowed" by the system because everything is ok.
    This is why we throw an error when parser gets an empty document after parsing, as a least evil.
    Try to avoid these situations, this is true for every rule definition, not only root one.
  9. How to troubleshoot shemas. Here are some tips:
    • Use as small document as possible, even create a very simple artificial one to isolate the problem. Schema may need to be simplified as well.
    • Use document image generated by ToString() method like in TextExamples\TextParser example to see content of the document parsed.
    • Use Info events to build a map of rules parser visited during parsing process, with results. This will work even some errors were "swallowed" because of false positives, like in TextExamples\TextParser example.
    • Try to convert "weak" rules to "specific" ones, probably, for debugging time. This will allow you to get errors which really happen.
    • If you are disagree with results provided by schema parser optimization engine, try to optimize problematic rules or rules chains manually.

See also: