I've been doing some work recently with Antlr parsing a syntax file from an ERP system that enables you to output changes you've done to one system and take them to another system.
The problem was that the syntax was simple enough to be parsed by simple string cutting, but it was complicated and ambiguous enough to make it a headache trying to build an entire framework around parsing it.
So I turned to Antlr... Thing is that Antlr is great in parsing all kinds of stuff, but when it comes to strings that are fixed length, it's not so good, as there's no native way to indicate in the lexer that you want a certain length of text, and no more than that, especially when the length of the text comes from previously found tokens, for example when you have something like this: "4test" where the 4 is the length of the text, and "test" is of course the text itself.
I did some googling and didn't find a conclusive answer to this problem, so I wrote up a little bit of code which works fine for me, though you'd probably have to change it a bit to match your requirements.
In the lexer, you add the following token:
FIXEDLENGTHSTRING : length=INT {start = input.Mark();} WS
{
bool failedToMatch = false;
int textLength = int.Parse( $length.Text );
StringBuilder sb = new StringBuilder();
if (textLength > 0)
{
for (int i = 0 ; i < textLength ; i++)
{
if (!failedToMatch)
{
int currentChar = input.LA(1);
// must start with a capital letter
if ((i == 0) && (currentChar < 'A' || currentChar > 'Z'))
{
failedToMatch = true;
}
else
{
if ((currentChar >= 'A' && currentChar <= 'Z') || (currentChar >= 'a' && currentChar <= 'z') || (currentChar == ' ') || (currentChar >= '0' && currentChar <= '9'))
{
sb.Append((char)currentChar);
input.Consume();
}
else
{
failedToMatch = true;
}
}
}
}
if (failedToMatch)
{
input.Rewind(start);
$type = INT;
}
else
{
$text = sb.ToString();
}
}
else
{
input.Rewind(start);
$type = INT;
}
};
WS : (' '|'\t')+ ;
INT : ('0'..'9')+ ;
Here's a quick overview. In my case, the format was <number> <space> <text>, so as you can see, I've defined the INT token as a sequence of numerics, and the WS token is a space or tab.
In my case, there were cases where it was ambiguous whether the upcoming text was indeed a fixed length string, or just happened to be a number and a space after it, so in some cases I've had to backtrack and declare that this was actually an INT token.
If you'll look at the first line, you'll see that I've marked the place in the input character stream after the INT token. When / if I have to rollback, I'll be rolling it back to this spot, and claiming that it's actually an INT, as well as restoring the characters that I might have consumed so that other tokens can have a go at it.
You can see the rollback toward the end of the code with Input.Rewind(start) and $type = INT;
Note that I'm capturing the text of the length token with "length=INT" and converting it to an integer later on while I'm still parsing the token itself.
The parts after that are just a simple loop that checks for some prerequisites to make sure that this is text is really what I want, and consuming the input with input.Consume() if it's ok.
The trick comes in this line "$text = sb.ToString();", this enables me later on in the parser to handle the FIXEDLENGTHSTRING token as a single entity, and get back just the text that I really want, so using it in the parser grammar becomes as simple as "firstEntityName=FIXEDLENGTHSTRING" and after that "{ currentNode.SubItems.Add(new LinkEntitiesUpdateDetail($firstEntityName.text,$firstEntityType.text,$secondEntityName.text,$secondEntityType.text,$secondEntityOrder.text)); }"
Footnotes:
1. This is relevant to Antlr 3.0
2. I'm referring to INT and WS as tokens, but since we're inside the lexer, this isn't accurate, they're really just pieces of text at this point.
3. In Antlr, internal variables are defined with $variable, so writing $type would convert it to _type and $variable.text would convert it to variable.Text(). Those variable names may not have changed since Antlr 2.x, but using the $ format would make sure that if it does change, you'll be safe. Also, there are quite a lot of code generation language targets for Antlr, so this enables you to copy and paste your grammar between targets.
Posted
Apr 28 2008, 11:07 AM
by
admin