Statistical Sentence Generation


One way of generating a character’s speech, for something like a chatbot or an NPC in a game, instead of scripting out utterances is to statistically generate a conversation. There are many schools of thought on how to approach the subject, ranging from simple to syntax-tagging and analysis. Here we will introduce a tried and true way of coming up with a half-way relevant sentence and provide the python code.

The idea is that we can generate a sentence based on the last two words and the probability of the words that could come after it. We first need a database of sentences to generate those probabilities. In the resulting data structure, we keep track of each word, the word that directly follows it, the word that directly follows that, and how often that combination occurs. Then we can get the probability of one sequence of three words appearing versus a sequence of the first two words and any other third word.

After we have those probabilities, we simply choose one randomly, weighted by those probabilities, and repeat until we reach a word limit or sentence end symbol.

So far, though, we have no control over what the sentence is about (and in truth even by the end we still won’t really, but let’s talk about how to at least include a keyword). If we choose a keyword, arbitrarily or inputted, then we can use the preceding data structure to generate all the words in the sentence that come AFTER that word, and this way we have at most half an idea. For the other half, we actually have to generate a whole second data structure (using the same database) that keeps track of each word, the word directly before it, the word directly before that, and how often that sequence occurs.

So now for some python code:

First we need to import a few libraries:

import re, random

re is regex, a string regular expression library. We’ll use this to separate the words in our text. random will give us random integers.

Next i establish a few constants, a unique, unlikely-to-be-found-in-regular-text word that I can represent the end of a sentence and another for the start of the sentence. Then our regular expression phrase.

Note: Regular expressions are a fast, flexible way of processing strings and finding patterns in them, I recommend reading into them but they are beyond the scope of this tutorial. Basically all you need to know about them here is we want each word, which we’ll say is each thing with a space between it, that may or may not include periods, question marks, quotations, things like that. To dial in how the sentences are generated, this is probably the most important part: how the database separates the words. So I recommend learning regular expressions and adjusting this to suit your needs. I’ve set this one to do a couple interesting things that you may not want like include open and closing quotations with the words.

staticStartChar = '#@start'
staticEndChar = '#@end'
wordRegex = re.compile( r'''(
[a-z0-9"“”\'’;:]+|\.|\!|\?|"|, # any number of characters or numbers
)''', re.VERBOSE )

After that we need two objects/classes/structures/whatever they’re called in python to use for our database. We need two because one will take the text and sort it left-to-right and the other will go right-to-left.

class LRDatabase(object):
    """Table of words and how often three specific words appear from left to right"""
    # this database is a dictionary of dictionaries of dictionaries of integers

    def __init__(self, pathToFile ):
        self.base = dict()
        file = open( pathToFile, encoding="utf-8" )
        self.processText( file.read() )
        file.close()

    def addOccurence(self, seqA, seqB, seqC ):
        if not seqA in self.base.keys():
            # if A isn't even in our database yet
            # add this word, sequence, and occurence to the database
            self.base[seqA] = { seqB : { seqC : 1 } }
        else:
            # this pairing (seqA to seqB)
            dicA = self.base[seqA]
            if not seqB in dicA.keys():
                # if B has not occured after A yet
                # add it to the occurences
                dicA[seqB] = { seqC : 1 }
            else:
                # A|B has already occured at least once
                dicB = dicA[seqB]
                if seqC not in dicB.keys():
                    # A|B|C has not occured yet
                    # add it
                    dicB[seqC] = 1
                else:
                    #A|B|C HAS been observed before
                    # all we want to do is increment it
                    dicB[seqC] += 1

    def processText( self, text ):
        text = text.lower()

        allWords = wordRegex.findall( text )
        self.addOccurence( staticStartChar, allWords[0], allWords[1] )
        for i in range( len(allWords)-2 ):
            self.addOccurence( allWords[i], allWords[i+1], allWords[i+2] )
        self.addOccurence( allWords[len(allWords)-2], allWords[len(allWords)-1], staticEndChar )

    def generateUtterance( self, keyword, maxLength ):
        # we're working forwards( left to right) until we run out of options
        # or reach max length
        if self.base.get( keyword, {} ).items() == 0:
            # if we haven't had any occurences of A then we're out of options
            return ''
        # set the first word
        wordA = keyword
        # grab a random second word
        setA = self.base.get( wordA, {} )
        keyList = list( setA.keys() )
        if len( keyList )==0:
             print( 'GenerateUtterance has no word Bs')
             return ''
        elif len( keyList )==1:
            wordB = keyList[0]
        else:
            index = random.randint( 0, len( keyList )-1 )
            wordB = keyList[ index ]
        if wordB=='.' or wordB=='?' or wordB=='!':
            return ''
        # initialize our result
        resultText = wordB
        length = 1
        for i in range(maxLength):
            # all occurences of A
            setA = self.base.get( wordA, {} )
            # all occurences of A|B
            setB = setA.get( wordB, {} )
            if setB.items() == 0:
                # if we haven't had any occurences of A|B|* then we're out of options
                return resultText
            totalOccurences = 0
            for setC in setB.values():
                totalOccurences += setC
            choice = random.randint( 0, totalOccurences )
            num = 0
            for k in setB.keys():
                num += setB[k]
                if num >= choice:
                    if k == staticEndChar:
                        return resultText
                    resultText += ' ' + k
                    length += 1
                    wordA = wordB
                    wordB = k
                    break;
            if length>=maxLength or wordB=='.' or wordB=='?' or wordB=='!':
                return resultText
        return resultText

We’ve got an __init__ function that is called on first initialization that takes a path to the file that includes our database of sentences. This function creates the dictionary of dictionaries, opens the file, and passes the file-as-string to the processText function.

The processText function uses our regular expression phrase to create a list of all the words in the file and iterates over that list entering the 3-word-combinations into the data structure. It also includes a sequence with the ‘start of sentence’ symbol, the first word in the file and the second word in the file, then the opposite for the end of the file.

The second data class is similar, just going backwards:

class RLDatabase(object):
    """Table of words and how often three specific words appear from right to left"""
    # this database is a dictionary of dictionaries of dictionaries of integers
    def __init__(self, pathToFile ):
        self.base = dict()
        file = open( pathToFile )
        self.processText( file.read() )
        file.close()

    def addOccurence(self, seqA, seqB, seqC ):
        if not seqC in self.base.keys():
            # if C isn't even in our database yet
            # add this word, sequence, and occurence to the database
            self.base[seqC] = { seqB : { seqA : 1 } }
        else:
            # this pairing (seqA to seqB)
            dicC = self.base[seqC]
            if not seqB in dicC.keys():
                # if B has not occured before C yet
                # add it to the occurences
                dicC[seqB] = { seqA : 1 }
            else:
                # B|C has already occured at least once
                dicB = dicC[seqB]
                if seqA not in dicB.keys():
                    # A|B|C has not occured yet
                    # add it
                    dicB[seqA] = 1
                else:
                    #A|B|C HAS been observed before
                    # all we want to do is increment it
                    dicB[seqA] += 1

    def processText( self, text ):
        text = text.lower()
        allWords = wordRegex.findall( text )
        self.addOccurence( allWords[1], allWords[0], staticStartChar )
        for i in range( len(allWords)-2 ):
            self.addOccurence( allWords[i], allWords[i+1], allWords[i+2] )
        self.addOccurence( staticEndChar, allWords[len(allWords)-1], allWords[len(allWords)-2] )

    def generateUtterance( self, keyword, nextWord, maxLength ):
        # we're working backwards(right to left) until we run out of options
        # or reach max length
        if self.base.get( nextWord, {} ).items() == 0:
            # if we haven't had any occurences of A then we're out of options
            return ''
        # set the third word
        wordA = nextWord
        # set the second word which is our keyword
        wordB = keyword
        # initialize our result
        resultText = ''
        length = 0
        for i in range(maxLength):
            # all occurences of A
            setA = self.base.get( wordA, {} )
            # all occurences of A|B
            setB = setA.get( wordB, {} )
            if setB.items() == 0:
                # if we haven't had any occurences of A|B|* then we're out of options
                return resultText
            totalOccurences = 0
            for setC in setB.values():
                totalOccurences += setC
            choice = random.randint( 0, totalOccurences )
            num = 0
            for k in setB.keys():
                num += setB[k]
                if num >= choice:
                    if k =='.' or k=='?' or k=='!' or k==staticStartChar:
                        return resultText
                    resultText = k + ' ' + resultText
                    length += 1
                    wordA = wordB
                    wordB = k
                    break;
        return resultText

    def generateUtteranceNoNextWord( self, keyword, maxLength ):
        # we're working forwards( left to right) until we run out of options
        # or reach max length
        if self.base.get( keyword, {} ).items() == 0:
            # if we haven't had any occurences of A then we're out of options
            return ''
        # set the first word
        wordA = keyword
        # grab a random second word
        setA = self.base.get( wordA, {} )
        keyList = list( setA.keys() )
        if len( keyList )==0:
             print( 'GenerateUtterance has no word Bs')
             return ''
        elif len( keyList )==1:
            wordB = keyList[0]
        else:
            index = random.randint( 0, len( keyList )-1 )
            wordB = keyList[ index ]
        if wordB=='.' or wordB=='?' or wordB=='!':
            return ''
        # initialize our result
        resultText = wordB
        length = 1
        for i in range(maxLength):
            # all occurences of A
            setA = self.base.get( wordA, {} )
            # all occurences of A|B
            setB = setA.get( wordB, {} )
            if setB.items() == 0:
                # if we haven't had any occurences of A|B|* then we're out of options
                return resultText
            totalOccurences = 0
            for setC in setB.values():
                totalOccurences += setC
            choice = random.randint( 0, totalOccurences )
            num = 0
            for k in setB.keys():
                num += setB[k]
                if num >= choice:
                    if k == staticEndChar:
                        return resultText
                    resultText += ' ' + k
                    length += 1
                    wordA = wordB
                    wordB = k
                    break;
            if length>=maxLength or wordB=='.' or wordB=='?' or wordB=='!':
                return resultText
        return resultText

There is one additional function that this second database has, and that is an alternative to the generateUtterance function. Both classes will take a keyword and a max sentence length and work with just that, but the second one will also take the word directly after/to-to-right-of the keyword. This way the generator starts off using the probability of the sequence it already has instead of just picking the first word randomly.

Now for the last bit of code that does the running.

#Create the left-to-right database
lrDatabase = LRDatabase( 'C:\\path\\to\\your\\database.txt' )
#Create the right-to-left database
rlDatabase = RLDatabase( 'C:\\path\\to\\your\\database.txt' )
#Pick a keyword that the sentence will be generated around
keyword = 'what'
#Generate the second half of the sentence first, up to 99 words long
secondHalf = lrDatabase.generateUtterance( keyword, 99 )
#Let's grab the first word in secondHalf (the word that will follow our keyword) for use in the first half generation
nextWord = wordRegex.search( secondHalf )
#if there isn't one for some reason we'll just generate the first half with out it
if nextWord==None:
    firstHalf = rlDatabase.generateUtteranceNoNextWord( keyword, 99 )
else:
    firstHalf = rlDatabase.generateUtterance( keyword, nextWord.group(), 99 )
#And print what we made
print( firstHalf + '' + keyword + ' ' + secondHalf )

Putting this all together will take a text file and generate a half statistically, half randomly generated sentence based on the language in that file.

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on Reddit

Leave a comment

Your email address will not be published. Required fields are marked *