syntax.xle – syntaktischer zucker (II)

Posted on June 8, 2021 by na

english title: how to not get mad, yet, parsing

general disclaimer: these lecture notes are personal and might make as little sense to you as they do for me. I will only share a little subset of them, when I think it might be more helpful than confusing and more correct than wrong. use at own risk.if something you read here evokes strong emotions (like deep disagreement or amour fou) and you can’t stand it, feel free to contact me, if you know how: narnold@cl…

in this article a simple framework of shortcuts for debugging will be constructed around the use of local xlerc files. to install this simply copy the text in a file called xlerc and run xle from the same working directory.

setup / preamble

# fixes äüß etc. input from rc files
set-character-encoding stdio utf-8   

# adjust filename here to fit. 
# starting xle will immediately (try to) load this grammar
# and create the parser
create-parser g5.lfg;


# if you change this xlerc file and want to reload it:
proc sorc {} {source xlerc}

## h for help
proc h {} {help}

shortcuts

typing single letters will be our main method of speeding up the debugging process. c is short for create-parser, r is short for reload which is more common in interactive shells today:

proc r {} {create-parser g5.lfg; lex .}
proc c {} {r}

note that here the grammar file is hardcoded, if you want to specify the filename (without .lfg) use the following:

proc rc {A} {create-parser $A.lfg}

xle often opens a lot of windows, you can close them by the command close-all and we will use alternative names killall, kill and as shortcuts ka and k.

# killall

proc k {} {close-all}
proc ka {} {k}
proc kill {} {k}
proc killall {} {k}

more advanced debugging

there are a few more useful commands for debugging (probably there are more, but those are the ones I got working and found usful).

just to get some kind of mnemonic structure into our shortcuts, we use 3-letter abbreviations for those more advanced debugging stuff, , mainly to avoid potential overlaps.

proc lex {word} {print-lex-entry $word}
proc rul {left} {print-rule $left}

how to prepare for concrete debugging

let’s start by creating a convenient shortcut trying to parse the whole sentence in one go:

proc 1 {} {
parse {Peter geht weinend in den Keller .}
}

think your grammar is ready for it? reload the new xlerc with source xlerc or sorc, enter 1 and hit go. (and let xle teach you better)

so my big advice for developing xle grammars is first creating an analysis of c- and f-structure on paper. in the next step you transfer blocks of your analysis into debugging commands:

e.g. if my the analysis lets me expect the following phrase structure rules, I will create a small sequence of commands, which prints those rules as they are currently present (or not) in the grammar:

proc 1rul {} {
rul NP;
rul PP;
rul VP;
rul IP;
rul I';
}

run 1rul in xle and see, if all of those rules are provided. if at any point you want to check if the rules are correct, you run 1rul and can inspect those rules only - no need to navigate around the .lfg file itself.

interested in a specific rule? use rul NP and glance at it.

we should also check, if all lexical entries are provided and take a look at them in comparison with our analysis:

proc 1lex {} {
lex Peter;
lex geht; 
lex weinend; 
lex in; 
lex den; 
lex Keller;
}

run sorc ; 1lex for checking the lexicon. note: if there is a syntactical error in the .lfg it might only crash until you try to access the lexicon… (maybe even only crashes when accessing a token that is defined after the syntax error, not so sure about this yet).

however, to make recreating the parser also fail on syntax errors in the lexicon specification we could just try to parse the .-token, and always put it as the last entry of the lexicon. That’s why we added lex . to the r-command.

now, as parsing the whole sentence will fail for a long long time from now on, we will already be happy if any parts of 1 will pass the parse.

so we could create a dynamic sequence of parts - sadly only the last one will

proc 1all {} {

parse {I: geht};
parse {A: weinend};
parse {NP: den Keller};
parse {PP: in den Keller};

1;
}

sadly xle is not very good at partial parsing, e.g. just parsing a verb which has an arguments structure in it’s PPRED attribute, will lead to a failing parse. afaik there is no simple way to tell xle: parse {true* in den Keller} which should just accept “whatever” it needs in front of “in den Keller” and optimistically consider all unmet requirements of a partial structure as no big problem, saying: well, this PP is no problem to me, just ensure the surroundings provide everything that’s needed.

this means, a substructure of the final tree might be completely valid for the final tree, but the same substructure fails on it’s own. this makes incremental approximation of the correct solution by first building the substructures harder than necessary.

after parsing the whole sequence, you can close all windows with ka and look at the console output to see, which parses failed and which already worked.

if you also want direct access to elements of the sequence, consider the following setup, where 11 is a shorthand for sentence 1 part 1:

proc 11 {} {parse {I: geht};}
proc 12 {} {parse {A: weinend};}
proc 13 {} {parse {NP: den Keller};}
proc 14 {} {parse {PP: in den Keller};}

proc 1all {} {
11; 12; 13; 14; 1;
}

you should see that 1all still acts the same and the things looking like numbers eleven to fourteen are just commands.

complete xlerc file

# fixes äüß etc. input from rc files
set-character-encoding stdio utf-8   

# adjust filename here to fit. 
# starting xle will immediately (try to) load this grammar
# and create the parser
create-parser g5.lfg;


# if you change this xlerc file and want to reload it:
proc sorc {} {source xlerc}

## h for help
proc h {} {help}

proc r {} {create-parser g5.lfg; lex .}
proc c {} {r}

proc rc {A} {create-parser $A.lfg}

proc k {} {close-all}
proc ka {} {k}
proc kill {} {k}
proc killall {} {k}

proc lex {word} {print-lex-entry $word}
proc rul {left} {print-rule $left}

# load task specific commands from debugrc 
source debugrc

you can outsource the specifics of your debugging parts into a separate file and source it from xlerc, we call it debugrc and it might look like the following:

proc 1 {} {
parse {Peter geht weinend in den Keller .}
}

proc 1rul {} {
rul NP;
rul PP;
rul VP;
rul IP;
rul I';
}

proc 1lex {} {
lex Peter;
lex geht; 
lex weinend; 
lex in; 
lex den; 
lex Keller;
}

proc 11 {} {parse {I: geht};}
proc 12 {} {parse {A: weinend};}
proc 13 {} {parse {NP: den Keller};}
proc 14 {} {parse {PP: in den Keller};}

proc 1all {} {
11; 12; 13; 14; 1;
}

future work

It’s possible to not enter the xle shell, but invoke it directly from the commandline. xle -e "create-parser grammarfile; parse {John sleeps.}". This has the potential of using common capabilities of the shell interpreter for even better automation, skipping the limitations of xle shell and adding control on a higher level.
Update (06-11): looks like this command effectively just opens the xle shell and runs the two commands, so there is no difference from putting the command into xlerc. I hoped this xle -e would in fact return exit codes and also return the user to the main shell instead of keeping you in the xle-shell. In that case, it would have been possible to chain parsing efforts and try “incremental parsing”, parsing more and more complex structures as long as the parse passes, using logical short-circuit connectives like: || and &&. Study the following examples in your shell to understand why it’s useful and inspect the output and exit (error) codes (echo $?):

true ; echo $?
false ; echo $?

true || false

true && echo "needed"

false && echo "no one needs me"

echo "this" && echo "prints" && echo "until" && false && echo "oh, an error happend"