To help me become a better writer, please consider filling out this. Jaynes found a very deep and principled connection. If half the time one sends a single bit, and half the time one sends two bits, on average one sends one and a half bits. This is an important question because we need some notion of how good or bad our model is, in order to optimize it to do well. This book presents the fundamental concepts of Information theory in a friendly-simple language and is devoid of all kinds of fancy and pompous statements made by authors of popular science books who write on this subject. In practice, people use particular coding schemes which are efficient to different extents.
But are there other reasons we should care about it? Information gives us a powerful new framework for thinking about the world. If the distributions are the same, this difference will be zero. On the other hand, there are now two rain and sunny labels, for the probabilities of them conditional on me wearing a t-shirt and me wearing a coat respectively. Now the entropy is the volume! It is unique in its presentation of Shannon's measure of information, and the clear distinction between this concept and the thermodynamic entropy. At a high level, it goes like this: Suppose you have some data which you know comes from one of two probability distributions. That is, we should assume the possibility with the most unknown information. Note that the labels have slightly different meanings than in the previous diagram: t-shirt and coat are now marginal probabilities, the probability of me wearing that clothing without consideration of the weather.
. If you take that idea seriously, you end up with information geometry. It gives us ways of measuring and expressing uncertainty, how different two sets of beliefs are, and how much an answer to one question tells us about others: how diffuse probability is, the distance between probability distributions, and how dependent two variables are. Then we look at the probability that another variable, like my clothing, will take on a certain value conditioned on the first variable. Conclusion If we care about communicating in a minimum number of bits, these ideas are clearly fundamental. A lot of this seems to revolve around the information shared between the variables, the intersection of their information.
How much better would it have been to say 85%? Information theory gives us precise language for describing a lot of things. The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent. But the ideas from information theory are clean, they have really nice properties, and a principled origin. If they are very different, you might very quickly become confident. If you want, you can think of it as the way to translate between these two different ways of displaying the probability distribution! Further, he decided he only wanted to communicate in binary.
It means that our initial budget has the interesting property that, if you had a bit more to spend, it would be equally good to invest in making any codeword shorter. SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. We need to be able to look at a sequence of concatenated codewords and tell where each one stops. Sadly, this would cause ambiguity when we decode encoded strings. Is there some way we can see how these four values relate to each other? In the following diagram, each subplot represents one of these 4 possibilities. The Space of Codewords There are two codes with a length of 1 bit: 0 and 1.
We can now see that in both the small case and the large case, Treatment A beats Treatment B. There is a very real sense in which one can have fractional numbers of bits of information, even if actual codes can only use whole numbers. Unfortunately, information theory can seem kind of intimidating. Often, we want one distribution to be close to another. Let me tell you about my imaginary friend, Bob.
If we care about compressing data, information theory addresses the core questions and gives us the fundamentally right abstractions. Well, of course, in practice, if you want to communicate by sending a single codeword, you have to round. Then it would be unclear what the first codeword of the encoded string 0100111 is — it could be either! Treatment B only seemed better because the patients it was applied to were more likely to survive in the first place! It measures how different they are! A short codeword requires you to sacrifice more of the space of possible codewords, preventing other codewords from being short. However, patients with small kidney stones were more likely to survive if they took treatment A. The expression as a whole is the expected difference in how many bits the two codes would use. Thanks also to my first two neural network seminar series for acting as guinea pigs for these ideas.
The cost and length contribution boundaries nicely line up. Suppose you have some system, and take some measurements like the pressure and temperature. Unfortunately, her messages were longer than they needed to be. There are four codes with a length of 2 bits: 00, 01, 10, and 11. But suppose my mom already knows the weather. If we slightly increase the length of the codeword, the message length contribution will increase in proportion to its height at the boundary, while the cost will decrease in proportion to its height at the boundary.
What does it mean to have half a bit? Information theory turns up in all these places because it offers concrete, principled formalizations for many things we need to express. Ample examples are provided which help the reader in understanding the different concepts discussed in this book. It turns out that this code is the best possible code. If we were to send more events at once, it would become smaller still. After Shannon discovered information theory, many noted suspicious similarities between equations in thermodynamics and equations in information theory.
In general, as you get more data points, your confidence should increase exponentially. The choice of which variable to start with is arbitrary. This notation is horrible for two reasons. This allows you to kind of visually slide the distributions and codes together. While this proof only works for two codewords, it easily generalizes to more.