The Dailies. November 5, 2019
Did you work on your language today? Create any new rules of grammar or syntax? New progress on a script? New words in your lexicon?
On the other hand, do any excavating or reading or enjoying stuff you’ve already created? Do you have any favorites to share?
How did you conlang today?
0
One thought on “The Dailies. November 5, 2019”
This isn’t conlanging so much as metaconlanging, but I figured I’d write out my thoughts here anyway:
I want to make WordGen, my program for generating words or sentences based on a generating context-free grammar, more robust, part of which is making regex backreferences into an optional feature, so that fast matching can be guaranteed. As it is, basically the entire text processing stage runs in linear time except for the regex matcher, which, because of the backreferences, runs in exponential worst-case time. (For the record, the grammar expansion stage runs in linear time in the size of the input grammar, I believe—I haven’t verified that because it’s a much more complex process than the transformer.)
I checked, and the only thing I’m using backreferences for is extremely simple cases like
(.)\1
, which there is unfortunately no other practical way to express as a regular expression. However, WordGen has *two* transformation engines, and the other (state machines) should be able to handle this, though it currently can’t. I have some ideas for how to extend the state machines to be more usable:I also want to add Unicode normalization commands to the transformation phase: it’s often convenient to add an accent to a character by just using a combining mark, but then those become annoying to recognize and transform, because they’re two code points. Unicode has a few normalization forms, which can maximally compose or decompose all diacritical marks, which is an ideal solution to this problem. You can just write
normalize: nfc
and it would combine all your marks for you, where possible. (NFD is for decomposition.) This isn’t a state machine feature, it’s a parallel feature, so I didn’t put it in the list.With some of these features in place, it would be a lot easier to extend my English and Firen generators (which are currently the most complex ones I have) to generate more complex sentences and to have more vocabulary.
I’m mostly writing here to organize and document my thoughts. If anyone is interested in WordGen, though, I’m happy to explain it and/or to provide the source code (as a Python script). Unfortunately it seems not to match up well with most people’s thinking styles so I’ve gotten limited interest from others. Its main strength, in my opinion, is that you can describe phonotactics and morphology very well using it. The firen data file has been refined to the point where I barely ever even see a word I would consider ugly in the results, much less an unpronounceable one, and it’s a better specification of verb morphology than any of my English notes.