Don't Miss

# Game Of Math: The MaxEnt Algorithm and Game of Thrones.

By on 03/05/2017

Hello again everyone!

For those who don’t know us, a very brief introduction on who we are and where we go! MathIsInTheAir is an italian blog of Applied Maths and surroundings and here are the links: english version and italian version. Although our english blog contains not so many posts, we would like to expand it in the very near future. I don’t want you to get bored, so if you like you may visit our websites or you may read the “not so quick introduction” to MathIsInTheAir I did last year.

Today I just want to present our next english mathematical post, called “Game Of Math: The MaxEnt Algorithm and Game of Thrones” written by Andrea Capozio.

No spoiler on this post 😉 .

by MathIsInTheAir

# Game of Thrones.

Surely you’re wondering how does Mathematics fit in Game of Thrones? Well I confess, I could use any other TV series to tell you what follows in this post, but honestly … what is best than Game of Thrones (GoT)?
The connection to Mathematics is not so much considered in the series but in the type of data used. In particular one kind of data that has been used is the set of the comments regarding GoT posted on the IMDb site reviews to produce this infographic, on which I worked with my brother.

# Infographic.

Firstly, very briefly, let us see what types of information were extracted from various comments (671).

The map on the left shows the major lineages; each one has been placed in its city of reference, indicating:

• The number of reviews in which the single lineage is appointed;
• The number of reviews that speak positively of the family;
• The number of reviews that speak negatively of the family;
• The member of the most cited lineage;
• The not mentioned member of the family.

Instead, in the chart on the right hand side appear:

• the performance of the nominated characters from each house in five seasons of the series;
• the most cited characters that are not attributable to lineages;
• the graphs of co-occurrences between characters and between lineages.

The part on the infographics which, however, I would like to focus attention is the one related to the word-cloud in the bottom right side of the picture, where the adjectives that have been mainly used in the reviews of a lineage are reported.
It needs to be explained how to determine whether a word in a speech has to be regarded as an adjective rather than as a preposition or article. And it is here that Mathematics comes to help.

To perform this type of analysis a POS (Part of Speech) Tagger has been used, namely a tool able to make grammatical analysis of a text. The POS Tagger taken into account is based on OpenNLP library, which is essentially based on the Maximum Entropy model that we will analyze in detail.

# Information Theory and Entropy.

Before examining the MaxEnt algorithm, I would define the concept of entropy used here.
Most commonly we talk about entropy in the following areas:

• thermodynamics: the entropy of a gas is function of its temperature, volume, mass and nature; it is a reversibility index: if there is no change, the process is reversible, otherwise it is irreversible (just think of the exchange of heat between two bodies, one hot and one cold);
• statistical mechanics: the entropy is a measure of the increase of disorder (inability to predict the position and velocity of the particles); the greater is the knowledge, the lower the entropy, and viceversa.

To these interpretations of entropy, one can be added and it plays a very important role in information theory (especially in the field of signal processing) and will be the way in which we understand it in our case.

Let’s consider a source of messages XX. The amount of information transmitted from the message increases with the increase of the uncertainty of the product message. The greater our knowledge about the message produced by the source, the lower the uncertainty, the entropy and the transmitted information.
We formulate the concept of entropy, as introduced by Claude Shannon in 1948.

### Definition 1

Let XX be a source and xx a signal emitted by XX. The information given by xx is called autoinformation and is defined as

I(x)=log2p(x)I(x)=−log2⁡p(x)

where p(x)p(x) is the probability that the event xx happens.

### Definition 2

The entropy of a XX source is the expected value of the autoinformation, i.e. the average information contained in any signal from XX, in particular

H(X)=E[I(X)]=Ni=1P(xi)log2p(xi)H(X)=E[I(X)]=∑i=1NP(xi)log2⁡p(xi)
if XX is a discrete variable

H(X)=p(x)log2p(x)dxH(X)=−∫p(x)log2⁡p(x)dx
if XX is a continuous variable

### Proposition 1

Let XX be a discrete source, then

0H(X)log2N0≤H(X)≤log2⁡N

In particular the maximum of the entropy is log2Nlog2⁡N when all the NN events are equally probable.

Let’s see a simple example.

Example 1
Suppose to have a source XX that emits 11 with probability pp and 00 with probability 1p1−p.

Then

H(X)=(plog2p+(1p)log2(1p))H(X)=−(plog2⁡p+(1−p)log2⁡(1−p))

and the entropy is equal to 11 if p=1/2p=1/2 (unpredictable sources), while it is equal to 00 if p=1p=1(predictable sources, i.e. it always outputs 11 or always 00).

# The MaxEnt Algorithm.

## Introduction.

The classifier Maximum Entropy is a discriminative classifier widely used in the areas of Natural Language Processing, Speech Recognition and Information Retrieval. In particular, it is used in order to solve problems of classification of text such as language detection, topic and sentiment analysis.
The Maximum Entropy algorithm is based on the principle of Maximum Entropy and selects the model that has the maximum entropy (as enunciated by Shannon) on the training set of all tested models.

Recalling Bayes’ theorem (see here), the Max Entropy classifier is used when you do not have any information about the prior distribution of the data and it is not correct to make assumptions about.
Unlike the Naive Bayes classifier, the Maximum Entropy has not the hypothesis that the variables are independent of one another, which reflects the nature of the natural text where the variables into account are the words, that of course are not independent of one another since the grammatical rules of the language; moreover, it requires more time in training while providing more reliable results.

Example 2

Before going into the theory behind the MaxEnt, consider an example which clarifies from the outset what will be said in a formal way in the following.

Suppose we want to determine the grammatical form of the word “set.”

The word “set” can take the following forms:

• Adjective: “He has a set smile.”
• Noun: “Give me your chess set.
• Verb: “They set a fast pace.

We collect a large number of examples from which to extract information to determine the decision-making model pp. The model we’re going to build will assign the word “set” a chance to take a particular grammatical meaning.

As we don’t have other information from the data, you can impose for our pp model:

There are several pp models that hold previous identity, including:

By analyzing the data set further, let’s suppose to get other informations, such as every time the word “set” is preceded by a pronoun is a verb. This, added to the normalization condition, changes the possible chances, reducing the possible models.

The goal of the MaxEnt algorithm is to determine the model pp uniform as possible (maximizing the entropy), according to the information derived from the data, without making any additional assumptions.