A primer on implementing DSLs in ABAP
I’ve been interested in programming languages implementation techniques and domain specific languages for quite some time now and also implemented a few DSLs – but not yet in ABAP. Browsing around a bit I did not find much about implementing DSLs in ABAP, one notable exception being Use Macros. Use Them Wisely, which shows a way of writing an internal DSL in ABAP.
With this post I’d like to propose a receipt for implementing an external DSL in ABAP. I will provide some helper classes as well, but my main intention is to show how a DSL can be parsed without too much effort in pure ABAP. I think that the main obstacle with DSLs is to get started and find an approach that works.
The process of parsing a DSL can be broken down into a sequence of rather small steps. So let’s get started…
As example we’ll create parser for a list of dates and date ranges. For example, our parser shall be able to transform an input string
2015-01-01 – 2015-01-15, 2015-01-20, 2015-10-10 – 2015-10-12
into an internal table of (start date, end date) entries. This is not the most practical DSL on earth, for sure, but allows us to explore the stepwise implementation of a DSL parser without getting caught up in too many details.
Step 1 – Define an EBNF grammar for your DSL
The first step is to come up with an EBNF grammar for our DSL.
date-list = date-entry { “,” date-entry }
date-entry = date [ “-” date]
date = “\d\d\d\d-\d\d?-\d\d?”
A date-list is a non-empty, comma-separated list of date-entries.
A date-entry is either a single date or a date range.
A date is described as a regular expression which can be used for matching date strings.
If your are not familiar with grammars described in EBNF or a similar form I suggest you google a bit and take a look at different grammars. The exact syntax you choose for describing the grammar is not that important.
In the following steps I’ll use the following terms:
- a terminal is a symbol given by a regular expression
- if the terminal is given by “name = regex” then I call it a named terminal
- if the terminal is just a regex (like “,” and “-” above) then I call it an unnamed terminal
- a non-terminal is a symbol defined by a production rule which contains the non-terminal on the left-hand side and terminals, non-terminals and EBNF-symbols on the right-hand side
Step 2 – Check out the DSL toolkit
For getting up and running with implementing a DSL, check out the include source code in ZDSLTK_CORE_1.abap in the source folder of my DSL toolkit GitHub repository. Create the include in your test system and create a report for the upcoming implementation. In the report, include ZDSLTK_CORE_1.
We’ll talk about the local classes in the include in the following steps.
Step 3 – Define a custom node class for your DSL
The include ZDSLTK_CORE_1 contains the definition of a node class lcl_node. Each node will have a token type ID (attribute mv_token_id). The token types depend on the concrete DSL. In our case, we need token type ids for the date-list, date-entry and date (non-)terminals.
Therefore I create a sub-class lcl_dates_node of lcl_node and define constants as follows:
CLASS lcl_dates_node DEFINITION INHERITING FROM lcl_node.
PUBLIC SECTION.
CONSTANTS: BEGIN OF gc_token_type,
date_list TYPE lif_dsltk_types=>mvt_token_id VALUE 1,
date_entry TYPE lif_dsltk_types=>mvt_token_id VALUE 2,
date TYPE lif_dsltk_types=>mvt_token_id VALUE 3,
END OF gc_token_type.
ENDCLASS.
The rule is: create an own token type for each non-terminal and for each named terminal of your grammar.
Step 4 – Create a parser sub-class
Our small DSL toolkit contains an abstract parser class lcl_parser. In this and all of the following steps we will create a concrete sub-class, which implements our parsing logic.
Create a sub-class lcl_dates_parser of lcl_parser. Redefine the protected methods parse_input and create_node with (for now) empty implementations.
Now implement create_node to return a new instance of lcl_dates_node, passing on all parameters to the constructor. create_node serves as a factory method for node objects. Since we neither redefined any method nor added any member attributes in our own sub-class this is somewhat lame in this example.
After this step my parser class looks like this:
CLASS lcl_dates_parser DEFINITION INHERITING FROM lcl_parser.
PROTECTED SECTION.
METHODS: parse_input REDEFINITION,
create_node REDEFINITION.
ENDCLASS.
CLASS lcl_dates_parser IMPLEMENTATION.
METHOD parse_input.
ENDMETHOD.
METHOD create_node.
ro_node = NEW lcl_dates_node(
iv_token_id = iv_token_id
iv_token = iv_token
is_code_pos = is_code_pos ).
ENDMETHOD.
ENDCLASS.
Step 5 – Implement read methods for the terminals
The parser class lcl_parser provides a protected method read_token which we’ll use to create individual read_… methods for our terminal tokens.
Step 5.1: For each unnamed terminal X create a new private method read_X without parameters in your parser class, but delcare lcx_parsing as exception. Implement the method by calling read_token with the regular expression that shall be used for matching the unnamed terminal.
Step 5.2: For each named terminal Y create a new private method read_Y in your parser class that returns a lcl_node object and may raise lcx_parsing. Implement the method by calling read_token with the following parameters:
– iv_regex = the regular expression to use for matching the token, including parts of the regex enclosed in (…) to extract the token text
– iv_token_id = the token ID that shall be used for the created node object
– iv_token_text = the text that shall be used for instantiating a lcx_parsing in case that the requested token cannot be parsed
For our dates parser this looks as follows:
CLASS lcl_dates_parser DEFINITION INHERITING FROM lcl_parser.
…
PRIVATE SECTION.
METHODS:
read_comma RAISING lcx_parsing,
read_dash RAISING lcx_parsing,
read_date RETURNING VALUE(ro_node) TYPE mrt_node RAISING lcx_parsing.
ENDCLASS.
CLASS lcl_dates_parser IMPLEMENTATION.
…
METHOD read_comma.
read_token( ‘,’ ).
ENDMETHOD.
METHOD read_dash.
read_token( ‘-‘ ).
ENDMETHOD.
METHOD read_date.
ro_node = read_token(
iv_regex = ‘(\d\d\d\d-\d\d?-\d\d?)’
iv_token_id = lcl_dates_node=>gc_token_type-date
iv_token_text = ‘date’ ).
ENDMETHOD.
ENDCLASS.
Step 6 – Implement parse methods for the non-terminals
In the previous step we created methods for reading individual non-terminals. Now we’ll put these parts together and create methods for parsing larger parts of input. We use the production rules of the grammar for structuring this: For each non-terminal X we create a new private method parse_X that returns a lcl_node object and may raise a lcx_parsing (i.e. the same signature as we used for the read methods for named terminals):
METHODS:
parse_date_list RETURNING VALUE(ro_node) TYPE mrt_node RAISING lcx_parsing,
parse_date_entry RETURNING VALUE(ro_node) TYPE mrt_node RAISING lcx_parsing.
In the method implementations we do the following steps:
1. Create a new node object
2. Use the left-hand side of the production rule to invoke other parse_… or read_… methods
METHOD parse_date_entry.
” Production rule:
” date-entry = date [ “-” date]
ro_node = create_node( lcl_dates_node=>gc_token_type-date_entry ).
DATA(lo_date_1) = read_date( ).
ro_node->add_child( lo_date_1 ).
TRY.
push_offsets( ).
read_dash( ).
DATA(lo_date_2) = read_date( ).
ro_node->add_child( lo_date_2 ).
pop_offsets( ).
CATCH lcx_parsing.
reset_offsets( ).
ENDTRY.
ENDMETHOD.
Here we see the right-hand side of the production rule disguised as “read_date() … read_dash()… read_date().” The rest of the code is housekeeping:
– if we read a named terminal we add it as child to our new date-entry node
– the optional part of the production rule is enclosed in a TRY-CATCH block together with saving and resetting the current source code position correctly
METHOD parse_date_list.
” Production rule:
” date-list = date-entry { “,” date-entry }
ro_node = create_node( lcl_dates_node=>gc_token_type-date_list ).
DATA(lo_date_entry) = parse_date_entry( ).
ro_node->add_child( lo_date_entry ).
DO.
TRY.
push_offsets( ).
read_comma( ).
lo_date_entry = parse_date_entry( ).
ro_node->add_child( lo_date_entry ).
pop_offsets( ).
CATCH lcx_parsing.
reset_offsets( ).
EXIT.
ENDTRY.
ENDDO.
ENDMETHOD.
Here we see how a repetition can be implemented: we use a DO loop until parsing fails and we remember to update the source code position correcty.
Finally we can implement the redefined parse() method by just delegating to the parse_… method of our start symbol:
METHOD parse_input.
ro_node = parse_date_list( ).
ENDMETHOD.
Now we’ve implemented our parser class fully, so it’s time for a first tests.
START-OF-SELECTION.
DATA(go_parser) = NEW lcl_dates_parser( ).
BREAK-POINT.
DATA(lt_error_messages) = go_parser->parse( it_input = VALUE #(
( ` 2015-01-01 – 2015-01-15, 2015-01-20, 2015-10-10 ` )
( ` – 2015-10-12 ` )
) ).
LOOP AT lt_error_messages INTO DATA(ls_msg).
WRITE: /, ls_msg.
ENDLOOP.
BREAK-POINT.
If you take a look at go_parser->mo_root_node in the debugger at the second break-point you’ll see that the parser indeed created a parse tree.
Step 7 – Decide what to do with the parse tree
After the last step our parser is indeed finished and we could work with the created parse tree. However, for lots of applications we really don’t need the full parse tree, but can do our actions inside the parse methods.
In our running example we could e.g. add a sorted table of (start date, end date) entries to our parser class lcl_dates_parser and populate it directly in either the parse_date_list method or in the parse_date_entry_method. If parsing finished without errors we could then retrieve this table and work with this able afterwards instead of passing around the parse tree.
I think we covered a lot of ground in this post, although there is more to be done: we could (and should!) add more syntax checking (2015-13-01 should be recognized as illegal input, for example).
I hope you enjoyed our tour into parsing techniques! Let me know your thoughts on DSLs in ABAP. I’m looking forward to some interesting discussions with you.
I wish you all merry Christmas and a happy new year!