English Sentence Generation

November 11, 2017 killerbee13 Comments 0 Comment

Today, I’ll describe the inner workings of the English sentence generator I talked about a while ago. (I’m a bit overdue, but I figure better late than never.) (I’m a lot overdue, actually, and left this as a draft for like a month and a half or so, so the other editors may have seen it (I don’t know how WordPress handles permissions on these))

This project leverages most of the features of my word generator, so I would recommend you read its documentation before reading this in order to fully understand this. However, I’ll try to explain things as I go along here anyways.

It’s still somewhat rudimentary at this stage, but it produces sentences that are mostly technically correct (though it often produces sentences with so many deeply-nested phrases that they’re hard to keep track of, and negation is quite complicated so sometimes that gets messy).

The data file for English is available here (GitHub) or here (my personal site). If you’ve read the documentation, then you might be able to understand what it’s doing.

Generation starts with the Sentence node, which contains a top-level description of the grammar of English sentences: the most basic branch is simply an independent clause followed by a period. An independent clause is a subject, a space, and a predicate, and the subject and predicate must agree in number and person. (Though it’s actually pretty simple because English’s agreement categories are mostly merged.) A subject is either a pronoun or a noun, and if it’s a pronoun we have to select the subject form. If a pronoun is selected, then we’ve finally reached the bottom of this recursion, and reached the Pronoun node, which looks like this:

Pronoun:
 - val: "NP1s" # I/me
   gloss: "1s"
 - val: "NP1p" # we/we
   gloss: "1p"
 - val: "NP2s" # you/you
   gloss: "2s"
 - val: "NP2p" # you/you
   gloss: "2p"
 - val: "NP3si" # it/it
   gloss: "3si"
 - val: "NP3sf" # she/her
   gloss: "3sf"
 - val: "NP3sm" # he/him
   gloss: "3sm"
 - val: "NP3sn" # xe/xem
   gloss: "3sn"
 - val: "NP3s" # they/them
   gloss: "3s"
 - val: "NP3so" # one/one
   gloss: "3.indef"
 - val: "NP3p" # they/them
   gloss: "3p"

This is the interesting part, in my opinion. Rather than simply including the lexical forms of each pronoun, I numbered them. A later stage of processing will convert the numbers into text. This allows earlier stages of the generation process to easily identify the part of speech each entry belongs to. It also allows for simplified treatment of pronouns (and anything else) like I/me that have forms that are unrelated to each other. The noun phrase that includes a pronoun is responsible for selecting its case, by appending one of the following: S, O, G, D, R (subject, object, genitive, demonstrative, reflexive.)

General noun phrases are rather more complicated than pronouns, so they need more than just one node. Noun phrases at the top level are split between plural and singular, and between definite and indefinite. (Definiteness is provided as an “argument” to the top-level noun phrase node, which simply passes it down to the next level, determiners and adjectives.) Determiners so far are articles, demonstrative pronouns, possessors, and quantifiers. So far the only adjectives it knows are colors, because I didn’t want to add the whole complex adjective order business. (The reason adjective phrases need to know about definiteness is comparative/superlatives. Superlatives are definite by nature, with only rare exceptions.)

Moving over to the predicates, there is a distinction between intransitive, transitive, and copular predicates. Adjectives are allowed to be the objects of a copular verb. The copula is also considered to be an auxiliary verb, which are also considered to be both transitive and intransitive verbs. Objects are simply noun phrases except that they don’t have to agree with the verb. A predicate can also be negated. (I haven’t added anything to prevent it from negating multiple times, so sometimes you will see three or four ~s in a row if you turn on the transformation history. Multiple negations are simply ignored, rather than canceling each other.)

The “sentence” that has now been generated looks something like this: “NP2pS V#020 NPDddp N#06p.” Now the generator starts to apply transformation phases. (These are at the bottom of the file in the section named “replace”.) The first is verb negation, more specifically do-supporting verbs for the purpose of negation. In this case, there is no negation so the stage does nothing. Next is lexeme expansion. This is performed by a finite-state transducer which I will briefly describe.

Basically, a state machine is a series of rules like the following:

Start:
- Is the next character a “V”? Print nothing, go to state V.
- Is it an “N”? Print nothing, go to state N.
- Is it an A? Print nothing, go to state A.
- Is it a “~”? Print “not”.
- [implicitly] Otherwise, print the character and don’t change state.

By having a whole series of these, a prefix matcher is implemented. For nouns, there is special treatment for plurals: the common types of plurals (-s, -es, -en, and null plurals) are all handled uniformly. Irregular or internally-marked plurals are handled individually, as with “geese”.

 N0:
   '1': ["chicken", regPlu]
   '2': ["box", ePlu]
   '3': ["thing", regPlu]
   '4': ["ox", nPlu]
   '5': ["", N05]
   '6': ["fish", noPlu]
   '7': ["deer", noPlu]
 regPlu:
   p: ["s", S]
   s: ["", S]
 ePlu:
   p: ["es", S]
   s: ["", S]
 nPlu:
   p: ["en", S]
   s: ["", S]
 noPlu:
   p: ["", S]
   s: ["", S]
 N05:
   p: ["geese", S]
   s: ["goose", S]

Verbs are similar, though more complicated, and adjectives are comparatively simple since they uniformly have 3 forms. Another phase now turns any remaining Gs into possessive -‘s clitics.

The next transformation phase attempts to determine which form of the indefinite article to use given the following words. There is no indefinite article in this sentence, but if there were, it would be represented by an “@” in the initial form.

The next phase simply contracts most auxiliary verbs with regular expressions. Finally, the first letter of the sentence is capitalized.

The series of transformations has thus been:

NP2pS V#020 NPDddp N#06p. → NP2pS V#020 NPDddp N#06p. → you do these fish. → you do these fish. → you do these fish. → you do these fish. → you do these fish. → You do these fish.

A more complicated sentence is transformed like this:

@ N#05s of some A#01- N#05s V#011 A#02> than NPDdps. → @ N#05s of some A#01- N#05s V#011 A#02> than NPDdps. → @ goose of some red goose is bluer than that. → @ goose of some red goose is bluer than that. → a goose of some red goose is bluer than that. → a goose of some red goose is bluer than that. → a goose of some red goose is bluer than that. → A goose of some red goose is bluer than that.

Hopefully that was more explanatory than rambly. I’ll be glad to answer any questions (assuming I haven’t bored everyone already.)

The Firen sentence generator does not use this implementation, since Firen doesn’t have nearly as many reasons to move entire words around in sentences, and most of its inflections are regular affixes.

Conlang

a place to talk conlanging, linguistics, and worldbuilding

English Sentence Generation

November 11, 2017 killerbee13 Comments 0 Comment

Leave a Reply Cancel reply