Today I’ll look at the paper that established information theory: “A Mathematical Theory of Communication“, by Claude Shannon in 1948. (mirror link)

We’ll start out with my favorite sentence in the paper:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

The above sentence shows glimmers of sense, which is astounding given that it’s generated randomly without any considerations of grammar or meaning. Of course, this may not seem so impressive in a world where we have Google Autocomplete and the What would I say? Facebook app. But it all started here!

Shannon’s paper considers a very general problem: communication. Communication means taking a particular message and reproducing the message at a different location. Another way of thinking about it is that you are selecting the *actual* message from a large number of *possible* messages. **Information** is a quantitative measure of the ability to select the correct message

**The information content of English**

Initially, we can define the information content of a particular message to be the logarithm of the total number of possible messages of the same length. For example, the information content of the written words “I love you” would be the logarithm of the number of possible messages with 10 letters and spaces.^{1}

The total number of strings of 10 characters is 27^10. The information content would be log(27^10) = 10*log(27). By convention, we use log base 2, and we say that information is measured in **bits**.

But most of these strings of characters wouldn’t make any sense. It’s technically possible that my intended message is “sfoml rxkh”, but it’s not very likely. We need a better definition of information that takes into account that not all messages are equally likely.^{2}

One thing we know about English is that the letter “e” is really common. Theoretically, we could make messages shorter if we somehow made the letter e shorter than the other letters. Did I say “theoretically”? We’ve already done it, with Morse code! In Morse code, the letter “e” is shorter than any other letter. In fact, most letters in Morse code seem to be shorter the more common they are.

For some reason the letter “o” is an outlier.

Not every letter is equally common, and that makes messages more predictable. That means that their information content is lower than it would first appear, and we could theoretically “compress” the message with proper encoding. Morse Code is an example that takes into account letter frequencies. However, letters are even more predictable if you base your prediction on the previous letters. The sentence at the top, for instance, was created by generating each word based on the previous two words.

**Redundancy** is a measure of how predictable a message is. In this paper, Shannon estimates the redundancy of English to be about 50%, meaning that the information content of an English message is 50% less than the information content of a random string of characters. In a later paper, Shannon revised this figure to 75% redundancy. That means each letter contains just over a bit of information.

**Information capacity**

Apart from the information content of the message, we can consider the maximum information content that can be conveyed by a particular communication channel.

I already presented the example of Morse code. In Morse code, in each unit of time, there either is a signal, or there isn’t. So Morse code conveys at most one bit of information per unit time. But in fact it’s less than that, since Morse code obeys certain restrictions. Ignoring the spaces between letters and words, Morse code consists of a series of dots and dashes. Each dot takes 2 units of time, and each dash takes 4 units of time. So let’s count up the number of possible Morse code messages of length N.

N = 2: 1 message

N = 4: 2 messages

N = 6: 3 messages

N = 8: 5 messages

N = 10: 8 messages

N = 12: 13 messages

N = 14: 21 messages

You should begin to recognize the Fibonacci sequence. For large N, the number of possible Morse code messages is approximately 1.618^(N/2). The information capacity of Morse code is therefore 1/2 * log(1.618). Of course, this is a simplistic analysis, not even including letters or spaces between letters. But it’s an example!

Shannon also discusses at length noisy communication channels. Suppose that every character in a message had a small probability of being scrambled in transmission. This reduces the information capacity of the channel, not just because characters are being lost, but because the receiver doesn’t necessarily know *which* characters are being lost.

To define the information capacity of a noisy channel, we must first define equivocation. **Equivocation** is a measure of our uncertainty about the original message, given the message that we received. More precisely, the equivocation is the amount of additional information that would be needed to specify the original message. To calculate the capacity of a noisy channel, you just calculate the capacity without noise, and then subtract the equivocation.

In the paper, Shannon proves that it is possible to use a noisy communication channel up to its information capacity (or arbitrarily close to it), while keeping the equivocation arbitrarily small. In order to do this, we first have to take the limit of long messages. The messages are then encoded in such a way that the probabilities are dominated by a small fraction of the total number of possible messages. When the final message is transmitted across the noisy channel, there will be some ambiguity about the original message, but the majority of the time there will be a unique message that is vastly more likely than any other possible message.

In practice, it is difficult to construct a near-perfect encoding technique, and most noisy communication channels would operate significantly below their theoretical capacity. In general, the proof only works in the limit of long messages, which means that in general, a good encoding technique may only work in the limit of long messages. The letter substitution in Morse code is suboptimal, since it’s capable of encoding short messages.

**Summary**

Messages have information content, which is roughly the logarithm of the number of possible messages of similar length. Most messages have some amount of redundancy built into them, meaning that they’re partially predictable and can theoretically be compressed.

Communication channels have an information capacity, which is the logarithm of possible messages they can send in a given amount of time. If a communication channel is noisy, its capacity is lower. However, it is theoretically possible to use a communication channel with arbitrarily low error up to its (reduced) capacity.

1. We could also consider the information content of sounds, but this introduces a lot of complexity. There are a finite number of characters, but sounds can vary continuously. The second section of Shannon’s paper discusses continuously varying messages, but I will not discuss them in this post. (return)

2. Mathematically, if we assign each message i the probability p_{i}, then the information content would be:

The above equation resembles the Gibbs equation for entropy; this is not a coincidence. The information content of a message is a measure of how difficult it is to select the message from a collection of similar messages. Similarly, entropy is a measure of how difficult it is to select the microscopic state of a system from a collection of similar states (ie with the same temperature, pressure, etc.). (return)