The Dailies. November 5, 2019

The Dailies. November 5, 2019

Did you work on your language today? Create any new rules of grammar or syntax? New progress on a script? New words in your lexicon?

On the other hand, do any excavating or reading or enjoying stuff you’ve already created? Do you have any favorites to share?

How did you conlang today?


One thought on “The Dailies. November 5, 2019

  1. This isn’t conlanging so much as metaconlanging, but I figured I’d write out my thoughts here anyway:

    I want to make WordGen, my program for generating words or sentences based on a generating context-free grammar, more robust, part of which is making regex backreferences into an optional feature, so that fast matching can be guaranteed. As it is, basically the entire text processing stage runs in linear time except for the regex matcher, which, because of the backreferences, runs in exponential worst-case time. (For the record, the grammar expansion stage runs in linear time in the size of the input grammar, I believe—I haven’t verified that because it’s a much more complex process than the transformer.)

    I checked, and the only thing I’m using backreferences for is extremely simple cases like (.)\1, which there is unfortunately no other practical way to express as a regular expression. However, WordGen has *two* transformation engines, and the other (state machines) should be able to handle this, though it currently can’t. I have some ideas for how to extend the state machines to be more usable:

    1. Give the state machine access to the contents of the string seen so far, enabling it to detect if the current character is equivalent to the previous character, or two the character before that, etc.
    2. Introduce some form of variables to the state machines.
      • This is equivalent to giving them a third tape, with read/write access, and would be a major change that I’m not sure I like. I like that, as transducers, they only have an input tape and an output tape, and their only state is, well, the state they are in.
      • (The previous change would technically change that too, making it so that their state includes the memory of the string seen so far, but that information would have already been provided to the machine. It’s just making it accessible with fewer explicit states.)
      • This would be hard to accomplish neatly, because state machines are using the same template evaluator as the rest of the program, which has largely unsequenced evaluation, making variables less than useful. The state machines are fully sequenced and technically imperative, but I would rather avoid introducing a third novel mini-language for WordGen.
    3. Change the step function to do arbitrary string matching.
      • This isn’t really a solution to this problem specifically, but the whole idea of “code point at a time” processing for Unicode text is a bit unnatural, so I probably want to do this anyway.
      • It would drastically reduce the number of states required to do things, which is currently the main problem with the state machines.
        • For the English generator file, a state machine recognizing V#1010 requires 6 states (for the start state and the 5 characters before the last one), but most of those states have no interesting content. It would be much better to match ‘V#’, ‘101’, ‘0’ instead, taking only 3 states. There’s more than one verb in the file, of course, so the intermediate states are used more than once, so the reduction is a bit less on average, but it’s still significant.
      • This would make the input and output tape handling more consistent: output is not limited to single characters at a time, and most rules do in fact change the number of characters. (For instance, the V#… states above all transform to the empty string, until the final 0 is encountered, when “see” is output all at once. (V#101 represents the verb “to see”, and the final digit determines the inflection. There are 8 forms for English verbs, of which literally only “to be” has all of them distinct; most (all?) other verbs have 5 or fewer distinct forms, so unfortunately the English data file is a little redundant in its vocabulary.)) By allowing both tapes to advance multiple characters at a time, I think the programming model is improved.

    I also want to add Unicode normalization commands to the transformation phase: it’s often convenient to add an accent to a character by just using a combining mark, but then those become annoying to recognize and transform, because they’re two code points. Unicode has a few normalization forms, which can maximally compose or decompose all diacritical marks, which is an ideal solution to this problem. You can just write normalize: nfc and it would combine all your marks for you, where possible. (NFD is for decomposition.) This isn’t a state machine feature, it’s a parallel feature, so I didn’t put it in the list.

    With some of these features in place, it would be a lot easier to extend my English and Firen generators (which are currently the most complex ones I have) to generate more complex sentences and to have more vocabulary.

    I’m mostly writing here to organize and document my thoughts. If anyone is interested in WordGen, though, I’m happy to explain it and/or to provide the source code (as a Python script). Unfortunately it seems not to match up well with most people’s thinking styles so I’ve gotten limited interest from others. Its main strength, in my opinion, is that you can describe phonotactics and morphology very well using it. The firen data file has been refined to the point where I barely ever even see a word I would consider ugly in the results, much less an unpronounceable one, and it’s a better specification of verb morphology than any of my English notes.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.