September 27, 2012
EMFText is a tool for defining textual syntax for Ecore-based metamodels. It enables developers to define their own textual languages—be it domain specific languages (e.g., a language for describing forms) or general purpose languages (e.g., Java)—and generates accompanying tool support for these languages. It provides a domain specific language (DSL) for syntax specification from which it generates a full-fledged Eclipse editor and components to load and store model instances.
To give a quick overview, some of the most compelling features of EMFText are outlined in the following.
EMFText uses a generative approach where all artifacts that form the tooling for a textual language are generated. This includes a parser for loading textual models, a printer for storing model instances and the editor with all its customizable components.
EMFText comes with a simple but rich syntax specification language—the Concrete Syntax Specification Language (CS). It is based on EBNF and follows the concept of convention over configuration. This allows for very compact and intuitive syntax specifications, but still supports tweaking specifics where needed (cf. Chapter 3).
Editors generated by EMFText provide many advanced features that are known from, e.g., the Eclipse Java editor. This includes code completion (with customizable completion proposals cf. Section 4.2.2 and Section 4.2.8), customizable syntax and occurrence highlighting via preference pages, advanced bracket handling, code folding, hyperlinks and text hovers for quick navigation, an outline view and instant error reporting.
EMFText provides numerous other interesting features, some of them outlined below.
Generating an advanced Eclipse editor for a new language with EMFText just requires a few specifications and a generation step. Basically, a language specification for EMFText consists of a language metamodel and a concrete syntax specification. Taking these specifications the EMFText generator derives an advanced textual editor, that uses a likewise generated parser and printer to parse language expressions to EMF models or to print EMF models to languages expressions respectively.
The basic language development process with EMFText is depicted in Fig. 2.1. It is an iterative process that can be passed several times and consists of the following basic tasks:
Each of the these tasks will be explained and exemplified in the subsequent sections:
To kick-start the development of a new language you can use the EMFText project wizard. Select File > New > Other... > EMFText Project. In the Wizard (cf. Fig. 2.2) you just enter the language name and EMFText will initialise a new EMFText Project containing a metamodel folder that holds an initial metamodel and syntax specification.
As EMFText is tightly integrated with the Eclipse Modeling Framework (EMF) [SBPM08] language metamodels are specified using the Ecore Metamodelling Language. The metamodel specifies the abstract syntax of a new language. It can be build from classes with attributes that are related using references. References are further distinguished into containment references and non-containment references. It is important to notice this difference, as both reference types have different semantics in EMF and are also handled differently in EMFText. Containment references are used to relate a parent model element and a child model element that is declared in the context of the parent element. An example which can be found for instance in object-oriented programming languages is the declaration of a method within the body of a class declaration. Non-containment references are used to relate a model element with an element that is declared elsewhere (not as one of its children). An example for programming languages is a method call (declared in a statement in the body of a method declaration) that relates to the method that it calls using a non-containment reference. The referenced method, however, is declared elsewhere: In a class the method relates to with a containment reference.
Example. To define a metamodel for a language, we have to consider the concepts this language deals with, how they interrelate and what attributes they have. In the following we discuss the concepts of an exemplary language to specify forms and how they are represented in a forms metamodel.
The subsequent listing depicts a textual representation of the according EMF metamodel. Besides the mapping of forms concepts to Ecore it also refines the multiplicities and types. A new text.ecore metamodel is created by selecting File > New > Other... > EMFText .text.ecore file. For a detailed introduction on the basics of Ecore metamodelling we refer to [SBPM08].
Each Ecore metamodel is accompanied by an .genmodel. You can create the .genmodel by selecting File > New > Other... > EMF Generator Model. The generator model is used to configure various options for EMF code generation (e.g., the targeted Java runtime). From the root element of the .genmodel you can now start the generation of Java code implementing your metamodel specification. By default the generated files can be found in the src folder of the metamodel plug-in, but this can also be configured in the .genmodel. We suggest to change the code generation folder to src-gen to better separate generated code from hand-written.
After defining a metamodel, we can start specifying our concrete syntax. The concrete syntax specification defines the textual representation of all metamodel concepts. For that purpose, EMFText provides the cs-language. As a starting point, EMFText provides a syntax generator that can automatically create a cs specification conforming to HUTN (Human-Useable Textual Notation) [Obj02] from the language metamodel. To manually specify the concrete syntax create a new syntax specification by selecting File > New > Other... > EMFText .cs file.
The listing at the end of this section depicts a syntax specification for the forms language. It consists of five sections:
The syntax specification rules used in the cs-language are derived from the EBNF syntax specification language to support arbitrary context-free languages. They are meant to define syntax for EMF-based metamodels and, thus, are specifically related to the Ecore metamodelling concepts. Therefore, it provides Ecore-specific specialisations of classical EBNF constructs like terminals, and non terminals. This specialisation enables EMFText to provide advanced support during syntax specification, e.g., errors and warnings if the syntax specification is inconsistent with the metamodel. Furthermore, it enables the EMFText parser generator to derive an parser that directly instantiates EMF models from language expressions.
In the following we conclude the most important syntax specification constructs found in the cs-language and their relation to EBNF and Ecore metamodels. For an extensive overview on the syntax specification language we refer to Sect. 3. Each syntax construct is also related to examples taken from the listing at the end of this section.
Given a complete syntax specification the EMFText code generator can be used to derive an advanced textual editor and an accompanying customisable language infrastructure. There are two alternative ways to use the code generator: Manually within Eclipse or from an Apache Ant script.
Manual code generation can be triggered from the context menu of the concrete syntax specification. Therefore, right click the cs file and select Generate Text Resource. This starts the EMF code generator that produces a number of plug-ins. Fig. 2.3 depicts the plug-ins generated for our exemplary forms language. In the following we shortly discuss their purpose:
Besides the files implementing the language tooling, a number of extension points specific for the language are generated to the schema folder. They can be used to further customise language tooling. For details we refer to Sect. 4.1.3.
A second way of starting the EMFText code generator is using Apache Ant scripts. Therefore EMFText contributes a number of tasks for Apache Ant, which are automatically registered to the Eclipse platform using the naming scheme: emftext.taskName. The following task are shipped with EMFText:
GenerateTextResource This task can be used to generate all language implementation plug-ins. The following listing exemplifies the application of this task and its obligatory parameters:
Further parameters are generateANTLRPlugin="[true|false]", which specifies whether the additional plug-in containing the ANTLR parsing runtime should be generated, and preprocessor="[qualified class name]" referring to an implementation of the org.emftext.sdk.ant.SyntaxProcessor interface, which is provided for realising Java-based syntax specification preprocessors.
RegisterEcoreResourceFactory This task registers an Ecore model’s resource factory for a certain type. This is especially useful for testing purposes without a running Eclipse platform. The following listing exemplifies its application:
RegisterURIMapping This task adds an URI mapping to the EMF URI map, which is useful for mapping symbolic namespace URIs to physical locations, i.e., for locating ecore models. The following listing exemplifies its application:
RemoveURIMapping This task removes an URI mapping from the EMF’s URI map, which is useful for removing unwanted symbolic URI mappings from the URI map. The following listing exemplifies its application:
To execute an Ant script that uses EMFText tasks from within your Eclipse runtime, you have to adjust the script’s run configuration. Therefore, select Run > External Tools > External Tools Configurations... and select your Ant script’s run configuration. In the JRE tab you have to activate the option Run in the same JRE as the workspace to make the EMFText tasks available to the script.
The previous steps are mandatory to generate an initial implementation of basic tooling for your language. The generated text editor already comes with a number of advanced editing features that help editing language expressions a lot. However, there are various ways to make your language tooling more useful. EMFText helps you in customising your language tooling with a number of additional functions ranging from semantic validation of language expressions, language compilation, language interpretation, or editor functions like folding, custom quickfixes, extended code completion, refactoring and more. To discover the full spectrum of possibilities please consider Sect. 4.
An EMFText syntax specification must be contained in a file with the extension .cs and consists of four main blocks:
In the following sections, these four main blocks will be explained in more detail.
The first required piece of information is the file extension that shall be used for the files, which will contain your models:
Note: The file extension must not contain the dot character.
Second, EMFText needs to know the EMF generator model (.genmodel) that contains the metaclasses for which the syntax is specified. EMFText does use the generator model rather than the Ecore model, because it requires information about the code generated from the Ecore model (e.g., the fully qualified names of the classes generated by the EMF). The genmodel can be referred to by its namespace URI:
To find the generator model with the given namespace URI, EMFText tries to load it from the generator model registry. If it is not registered, EMFText looks for a .genmodel file with the same name as the syntax definition. For example, if the syntax specification is contained in a file yourdsl.cs, EMFText looks for a file called yourdsl.genmodel in the same folder.
If your genmodel is not contained in the same folder or is called differently from the syntax file name or if you do not want to use the one in the registry, the optional parameter yourGenmodelLocation can be used:
The value of yourGenmodelLocation must be an URI pointing to the generator model. The URI can be absolute or relative to the syntax specification folder.
Third, the root element (start symbol) must be given. The root element must be a metaclass from the metamodel:
A CS specification can also have multiple root elements, which must be separated by a comma:
Typical candidates for root elements are metaclasses that do not have incoming containment edges.
Altogether a typical header for a .cs file looks something like:
Sometimes it is required to import additional metamodels, e.g., if they are only referenced in the current one and a syntax for some or all of its concepts needs to be specified or reused. Metamodels and syntax specifications can be imported in a dedicated import section, which must follow after the start symbols:
The list of imports must contain at least one entry. If no imports are needed the whole section must be left out. An import entry consists of a prefix, which can be used to refer to imported elements in rules, the metamodel namespace URI and optionally the name of a concrete syntax defined for that metamodel. If a syntax is imported, all its rules are reused and need not to be specified in the current cs specification. Importing syntax rules is optional. One can also just import the metamodel contained in the generator model.
The two locations are again optional. For resolving the generator model the same rules as for the “main” generator model (declared after the FOR keyword) apply. For locating the syntax, EMFText looks up the registry of registered syntax specifications. If no registered syntax is found, locationOfTheSyntax is used to find the .cs file to import. Again, locationOfTheSyntax must be a relative or absolute URI.
EMFText’s code generation can be configured using various options. These are specified in a dedicated optional OPTIONS section:
The list of valid options and their documentation can be found in Appendix A1.
EMFText allows to specify custom tokens. Each token type has a name and is defined by a regular expression. This expression is used to convert characters from the DSL files to form groups (i.e., tokens). Tokens are the smallest unit processed by the generated parser. By default, EMFText implicitly uses a set of predefined standard tokens, namingly:
The predefined tokens can be explicitly excluded by using the usePredefinedTokens option:
To define custom tokens, a TOKENS section must be added to the .cs file. This section has the following form:
Every token name has to start with a capital letter. A regular expression must conform to the ANTLRv3 syntax for regular expressions (without semantic annotations). However, don’t worry: EMFText will complain if there is a problem with your regular expressions, such as typos or overlaps of regular expressions.
Sometimes, regular expressions are quite repetitive and one wants to reuse simple expressions to compose them to more complex ones. To do so, one can refer to other token definitions by their name. For example:
If token definitions are merely used as “helper” tokens, they can be tagged as FRAGMENT. This means the helper token itself is used in other token definitions, but not anywhere else in the syntax specification:
The regular expressions are composed the same way strings are composed in Java programs. Therefore, make sure to put parenthesis around expressions where it is needed.
EMFText does automatically sort token definitions. However, sometimes token definitions might be ambiguous (i.e., the regular expressions defined for two different tokens are not disjoint). In such cases EMFText will always prefer more specific tokens over more general tokens. That is, if one token definition includes another one, the latter is preferred over the former. If the automatic token sorting fails, EMFText will report an error. In this case one must turn off the automatic sorting using the disableTokenSorting option and sort the tokens manually. If automatic token sorting is turned off, one can give a higher priority to imported tokens by using the following directive:
The PRIORITIZE directive can also be used with the predefined tokens TEXT, LINEBREAK and WHITESPACE.
To define the default syntax highlighting for a language, a special section TOKENSTYLES can be used. For each token or keyword the color and style (BOLD, ITALIC, STRIKETHROUGH, UNDERLINE) can be specified as follows:
The default highlighting can still be customized at runtime by using the generated preference pages.
For each concrete metaclass you can define a syntax rule. The rule specifies what the text that represents instances of the class looks like. Rules have two sides—a left and right-hand side. The left side denotes the name of the meta class, while the right-hand side defines the syntax elements. If you have imported additional metamodels you can refer to their metaclasses using the prefix you’ve defined in the import statement. For example pre.MetaClassA refers to MetaClassA from the metamodel with the prefix pre.
The most basic form of a syntax rule is:
This rule states that whenever the text someKeyword is found, an instance of YourMetaClass must be created. Besides text elements that are expected “as is”, parts of the syntax can be optional or repeating. For example the syntax rule:
states that instances of YourMetaClassWithOptionalSyntax can be represented both by #someKeyword and someKeyword. Similar behavior can be defined using a star instead of a question mark. The syntax enclosed in the parenthesis can then be repeated. For example,
allows to represent instances of metaclass YourMetaClassWithRepeatingSyntax by writing someKeyword, #someKeyword, ##someKeyword, or any other number of hash symbols followed by someKeyword. One can also use a plus sign instead of a star or question mark. In this case, the syntax enclosed in the parenthesis can be repeated, but must appear at least once.
If metaclasses have attributes, we can also specify syntax for their values. To do so, simply add brackets after the name of the attribute:
Optionally, one can specify the name of a token inside the brackets. For example:
If the token name is omitted, as in the first example, EMFText uses the predefined token TEXT, which includes alphanumeric characters (see Sect. 3.2). The found text is automatically converted to the type of the attribute. If this conversion is not successfull, an error is raised when opening a file containing wrong syntax. For details on customizing the conversion of tokens, see Sect. 4.2.1.
Another possibility to specify the token definition that shall be used to match the text for the attribute value is do it inline. For example
can be used to express that the text for the value of the attribute yourAttribute must be enclosed in parenthesis. Between the parenthesis arbitrary characters (except the closing parenthesis) are allowed. Other characters can be used as prefix and suffix here as well.
By default, the suffix character (in the example above this was the closing parenthesis) can not be part of the text for the attribute value. To allow this, an escape character needs to be supplied:
Here the backslash can be used inside the parenthesis to escape the closing parenthesis. It must then also be used to escape itself. That is, one must write two backslash characters to represent one.
To give an example on how escaping works, consider the following text: (text(more\)). After parsing, this yields the attribute value text(more). The character sequence \) is replaced by ). Note that the opening parenthesis does not need to be escaped.
For boolean attributes, EMFText provides a special feature to ease syntax specification. All that is required is to give the two strings that represent true and false. To give an example consider the following syntax rule:
This rule states that yes represents the true value and no represents false. You can also use the empty string for one of the values:
This way, the attribute is set to false by default and set to true in the text set is found.
For enumeration attributes, EMFText does also provide a special feature to ease syntax specification. For each literal of the enumeration, the corresponding string representation must be given. For example, consider the following syntax rule:
This rule states that r represents the literal red, g represents the literal green and b represents the literal blue. The literals of the enumeration are identified by their name. You can also use the empty string for one of the values:
This way, the attribute is set to blue by default.
Metaclasses can have references and consequently there is a way to specify syntax for these. EMF distinguishes between containment and non-containment references. In an EMF model, the elements that are referenced with the former type are contained in the parent elements. EMFText thus expects the text for the contained elements (children) to be also contained in the parent’s text.
The latter (non-containment) references are referenced only and are contained in another (parent) element. Thus, EMFText does not expect text that represents the referenced element, but a symbolic identifier that refers to the element. This is very similar to the declaration and use of variables in Java. The declaration of a variable consists of the complete text that is required to describe a variable (e.g., its type). In contrast, when the variable is used at some other place it is simply referred to by its name. Non-containment references are similar to uses of variables.
A basic example for defining a rule for a meta class that has a containment reference looks like this:
It allows to represent instances of YourContainerMetaClass using the keyword CONTAINER followed by one instance of the type that yourContainmentReference points to. If multiple children need to be contained the following rule can be used:
In addition, each containment reference can be restricted to allow only certain types, for example:
does allow only instances of SubClass after the keyword CONTAINER even though the reference yourContainmentReference may have a more general type. One can also add multiple subclass restrictions, which must then be separated by a comma:
A basic example for defining a rule for a metaclass that has a non-containment reference looks like this:
The rule is very similar to the one for containment references, but uses the additional brackets after the name of the reference. Within the brackets the token that the symbolic name must match can be defined. In the case above, the default token TEXT is used. Therefore, the syntax for an example instance of class YourPointerMetaClass can be POINTER a.
Since a is just a symbolic name that must be resolved to an actual model element, EMFText generates a Java class that resolves a to a target model element. This class be customized to specify how symbolic names are resolved to model elements. The default implementation of the resolver looks for all model elements that have the correct type (the type of yourNonContainmentReference) and that have a name or id attribute that matches the symbolic name. For details on how to customize the resolving of references, see Sect. 4.2.2.
By default, EMFText can print all kinds of models. It does also preserve the layout of the textual representation when models are parsed and printed later on. However, to print models that have been created in memory, additional information can be passed to EMFText to cutomize the print result. This (optional) information includes the number of whitespaces and line breaks to be inserted between keywords, attribute values, references and contained elements. If you do not want to print models to text, printing instructions are not needed in your .cs file.
To explicitly print whitespace characters, the # operator can be used on the right side of syntax rules:
It is followed by a number that determines the number of whitespaces to be printed. In the example above, two whitespace characters are printed between the keyword and the attribute value.
To explicitly print line breaks, the ! operator can be used on the right side of syntax rules:
It is followed by a number that determines the number of tab characters that shall be printed after the line break. In the example above, a line break is printed after keyword. The number of tabs refers to the current model element (i.e., EObject), which is printed. To print contained objects with an indendation of one tab, you can use a rule like this:
Here, the first line break operator (!1) makes sure that all the contained objects appear on a new line and that they are preceded by one tab character. The second line break operator (!0) tells EMFText to print the closing parenthesis (}) also on a new line, but without a leading tab.
When defining syntax for an expression language (e.g., arithmetic expressions) EMFText’s standard mechanisms for specifying syntax can lead to structures that can not be optimally handled by an interpreter or evaluator. Furthermore, the underlying parser generator technology used by EMFText causes problems if left recursive rules are required to build an optimal expression tree, which is the case for all expression languages with left-associative binary operators (e.g., -). Therefore, EMFText provides a special feature called operator precendence annotations (@Operator). These annotations can be added to all rules, which refer to expression metaclasses with a common superclass. For example, the rule:
defines syntax for a metaclass Additive. The references left and right must be containment references and have the type Expression, which is the abstract supertype for all metaclasses of the expression metamodel.
The type attribute specifies the kind of expression at hand, which can be binary (either left_associative or right_associative), unary_prefix, unary_postfix or primitive.
The weight attribute specifies the priority of one expression type over another. For example, if a second rule:
is present, EMFText will create an expression tree, where Multiplicative nodes are created last (i.e., multiplicative expressions take precedence over additive expressions).
Unary expressions can be defined as follows:
There is also the option to define unary_postfix rules.
Primitive expressions can be defined as follows:
They should be used for literals (e.g., numbers, constants or variables).
One can certainly mix syntax rules that use the @Operator annotation with ones that do not in the same CS specification. However, one must be careful with the inheritance hierarchy in the metamodel in this case. All rules that use the @Operator annotation must refer to a metaclass that extends the metaclass specified with the superclass attribute. For subclasses of this superclass there must not be other non-@Operator rules. One could say that subtrees of the metaclass hierarchy must be either consistently specified as @Operator rules or not. Mixing is not possible.
For examples how to use @Operator annotations see the SimpleMath language in the EMFText Syntax Zoo1 and the ThreeValuedLogic DSL2 . These do also come with an interpreter which shows how expression trees can be evaluated.
EMFText supports to reuse syntax definitions partially by importing them and overriding rules. Rules can be redefined in the importing syntax by adding an @Override annotation to the overriding rule. You can also remove imported rules by using @Override(remove="true").
Please replace importPrefix with the prefix that you have assigned to the imported syntax in the import statement (see Sect. 3.1.2).
To suppress warnings issued by EMFText in .cs files one can use the @SuppressWarnings annotation. This annotation can be added to rules, token definitions or complete syntax definitions. One can either suppress all warnings or just specific types. To suppress all warnings for a syntax use the following syntax:
A list of all warning types can be found in Appendix A2. For example, to suppress warnings about features without syntax, you may use:
To adjust DSL plug-ins generated by EMFText to specific needs, there are three different customization techniques. Each of the subsequent sections describes one of them.
The most simple way to customize generated artifacts is to tell EMFText that it must not override a specific class or file, which needs to be changed. For all artifacts that are generated by EMFText there is a override option, which can be set to false to preserve such manual changes (see Appendix A1 for a complete list). For example, to customize the hover text shown when the mouse arrow points at an element in the editor, the overrideHoverTextProvider must be set to false.
For all files that do not depend on the rules defined in the .cs file, this customization technique is fine. These files do not change, if new rules are added or existing ones are changed. Thus, manual changes will not cause conflicts if the syntax evolves. Only when EMFText is updated and the code generators are replaced, one may want to compare the manually adjusted files with the ones generated by the new EMFText version to see whether all customizations are still correct. This does particularly apply to generated manifest files and plug-in descriptors. A list of all classes that are syntax dependent can be found in Appendix A3.
For all files that do depend on the rules defined in the .cs file, another customization technique is more appropriate. Instead of setting the override option to false for the artifact that needs to be changed, one can set the override option for the meta information classes to false.
Each of the two generated resource plug-ins contains a meta information class. These are called XyzMetaInformation and XyzUIMetaInformation. Both classes provide factory methods to create instances of some important classes (e.g., createParser() or createPrinter()). To customize these classes (e.g., the printer) one can change the create methods to return instances of subclasses of the original classes. By using subclasses instead of overriding the classes directly, one can regenerate the resource plug-ins and thereby obtain new up-to-date classes, but still make customizations by overriding individual methods.
In addition to overriding generated classes—either directly or using the meta information factory methods—one can use the extension points that are generated by EMFText for all DSLs. Currently EMFText generates two extension points for each DSL—default_load_options and additional_extension_parser.
The former can be used to customize how resources are loaded. For example, post processors can be registered which apply changes to the models that are created from their textual representation (see Sect. 4.2.3). Also, pre processors can be registered to process the input before it is actually passed to the parser. This is particularly useful to handle unicode characters (see the JaMoPP implementation1 for an example how to use it).
The latter extension point can be used to register additional parsers which can handle a particular file extension. EMF on its own does map one file extension to one resource factory, but sometimes it is useful to have multiple resource types for the same file extension. An example for how to use this extension point can be found in the textual syntax for Ecore2 .
To create models from their textual representation, it is necessary to convert the plain text found in Domain-specific Language (DSL) documents to attribute values (i.e., data types). For example, if the string