ANTLR is a parser generator that offers some surface similarities to cxlang. This document describes the fundamental differences in approach between the two projects.
ANTLR operates as two components: an offline tool which compiles an entire grammar into source code, and a runtime context where that generated source code is used to parse input. This technique is useful in scenarios where the grammar is determined as part of your development cycle and never changes at runtime. ANTLR itself is an input mechanism; it is intended to be embedded in the context of a larger process, to tokenise an input stream and provide the host process with the data. It is expected that the host process will then perform further processing on the data and (as necessary) generate any output.
cxlang differs from this approach in several ways. While it is possible to embed cxlang and use it as an input source, that is certainly not the only possible usage. cxlang is capable of parsing input, performing one or more intermediate processing stages, and generating output. As such, it can be used as a standalone utility which implements an entire workflow, and where the workflow is defined by the grammar rather than by code in a host process.
Compiling an ANTLR grammar results in a source file suitable for inclusion in some build project; for example C code.
There are two aspects to the ANTLR grammar- a description of the input tokens, and associated ACTIONs which are written in the target language. The ACTIONs are emitted verbatim at appropriate locations in the generated C source, which allows substantial customisation of the generated source. There are several downsides to this approach:
- Any ACTIONs must be written to suit the target language; if the host environment changes, then the ACTIONs must be rewritten.
- The content of any ACTIONs is not visible to the ANTLR grammar processor, meaning that the grammar processor cannot be optimised or validated based on the contents of the actions.
- Back-and-forth between the grammar proper and the ACTIONs is convoluted and largely hidden from the grammar proper. This leads to the grammar developer avoiding such constructs and instead pushing any non-trivial concepts into the ACTIONs and completely avoiding them in the grammar proper.
- Since the target language is inevitably not transactional, it is not possible to roll back an ACTION. This means that ACTIONs with side effects can only be utilised once the grammar has parsed the input, not to assist in parsing the input. This substantially limits the flexibility of ACTIONs in terms of enhancing the grammar. Further, it is the grammar developer’s responsibility to ensure that ACTIONs do not have inappropriate side-effects; it is not possible for ANTLR to enforce this.
- Codeblocks are written in cxlang, and are agnostic to their host environment.
- Codeblocks are fully visible to the cxlang grammar processor, meaning that it is possible to extend the grammar processor to optimise or validate the grammar including the contents of any codeblocks.
- Back-and-forth between the grammar proper and the codeblocks is expected and is a native component of the syntax.
- Since cxlang is transactional, any codeblock can be rolled back. This allows codeblocks to be fully utilised in speculative parsing without any additional work on the part of the grammar developer. Codeblocks are expected to be used to enhance the parser, and allow simplifications to both the cxlang syntax and any developed grammars.
- Transactional programming means that codeblocks can apply state which affects further parsing, and which is transparently rolled back if the parsing attempt does not find a match. This also permits techniques such as sub-parsing and code-based rejection of grammar based on local or global state.
Lexer vs Parser
ANTLR distinguishes between “lexing” (character processing) and “parsing” (token processing.) cxlang does not require any such distinction.
Dynamic vs Static Grammar
Since ANTLR produces the source code for a parser, rather than being the parser in itself, it is not possible for the parser to adjust the grammar in realtime. This means that a self-modifying grammar (“dynamic grammar”) is impossible.
cxlang does not have this restriction; a cxlang grammar may be extended during the run. This allows escaping and experimental self-extending parsers.
The current cxlang implementation relies heavily on speculative parsing and is expected to perform substantially worse than ANTLR output or other traditional parsers. It is expected that this could be improved by the addition of a rule optimisation pass within cxlang.
It is inevitable that cxlang will always be slower than a custom-built and natively compiled parser, such as the ANTLR output. While cxlang may approach this performance, it will not exceed it. The cxlang parser will always incur some overhead in the maintenance of transactional state. However, depending on the grammar in question, it may be possible for cxlang’s additional flexibility to allow an improvement to the parsing algorithms, which may result in a performance win in some cases.