Understanding Entropy What is Entropy?

What is Entropy in Computer Science?

Introduction to Entropy

Entropy, in computer science and information theory, measures the average uncertainty or surprise in a system. First introduced by Claude Shannon in 1948, entropy quantifies the average amount of information contained in a message or dataset. To understand this concept intuitively, we can explore it through the lens of probability and surprise.

Understanding Entropy Through Expected Surprise

Consider a game with two possible outcomes: one occurring with 99% probability and another with 1% probability. In probability theory, we often calculate the Expected Value of such scenarios. For instance, if the 99% outcome pays 1 dollar and the 1% outcome pays 1000 dollars, the expected value would be (0.99 x 1 dollar) + (0.01 x 1000 dollars) = 10.99 dollars per game. This means that, on average, playing 100 times would yield approximately 1099 dollars, or 10.99 per play.

Entropy applies this same concept of expectation, but instead of monetary value, it measures the Expected Surprise of the outcome. Intuitively, when an event is highly probable (like our 99% outcome), we experience little surprise when it occurs. Conversely, when an unlikely event happens (our 1% outcome), we experience great surprise. The question becomes: how do we quantify surprise?

While we could define surprise as 1/p (or rather 1/pi, as you'll see later), resulting in high surprise, or 100, when p = 0.01, and low surprise, or 1.01, when p = 0.99), information theory uses log₂(1/p) to measure surprise. This logarithmic measure offers several advantages, at least one of which is very obvious: when an event is certain (p = 1), the surprise is log₂(1/1) = 0, matching our intuition that certainty brings no surprise. However, don't take this formula as the only way to define surprise. It's just a convenient way to define it, just as you could come up with your own!

Entropy, then, represents the expected surprise given the set of all possible outcomes. It calculates this by weighing each outcome's surprise by its probability of occurrence (just like we weighted each payout by its probability of occurrence), leading to the formal mathematical definition we're about to explore.

The Dreaded Mathematical Definition of Entropy

Following our intuitive understanding, you'll see Shannon's entropy formula is much simpler than you thought! As discussed, the expected surprise is the sum of each probability multiplied by its corresponding surprise measure:

$$\sum_{i=1}^n p_i \log_2(\frac{1}{p_i})$$

where pi represents the probability of the i-th outcome. In our case, p1 = 0.99 and p2 = 0.01, with n equal to 2 because there are only 2 outcomes. Now, using the logarithm property that log of a quotient is the subtraction of logs, or log(a/b) = log(a) - log(b), we can rewrite it as:

$$\sum_{i=1}^n p_i (\log_2 1 - \log_2 p_i)$$

And noting that log₂ 1 = 0:

$$\sum_{i=1}^n p_i (0 - \log_2 p_i)$$

And putting the minus symbol outside, we arrive at the standard form of Shannon's entropy!

$$Entropy = -\sum_{i=1}^n p_i \log_2 p_i$$

Now you have an intuitive understanding of entropy (in information theory), which is simply measuring surprise as the inverse of the probability (something very reasonable) and adding log₂ for convenience, or log₂ (1/pi)!

Applications of Entropy in Computer Science

Ready to see entropy in action? It turns out the concept of entropy shows up everywhere in modern computing - from helping machines learn to compressing our data. Let's look at some of the most exciting applications:

  • Binary Cross-Entropy in Deep Learning: When training neural networks, Binary Cross-Entropy is a commonly used loss function for binary classification tasks. The smaller the entropy (that is, the lower the expected surprise between the prediction and the actual value), the better the model is performing! In other words, when a model closely predicts the true label over and over again, how surprised can we be it does so again?
  • Information Gain in Decision Trees: Without considering stopping criteria, overfitting, and a few other things, in decision trees the goal is to find a series of rules (one rule per node, e.g., income ≤ 135k to branch left, and income > 135k to branch right) so that, when passing each training example through the tree's rules, we get at each leave, ideally, a bunch of examples all with the same y label, e.g. entrepreneur or not (separating or classifying the data perfectly with these rules). Thus, as the tree branches down, the goal is to decrease its entropy, because, if at a node we had a 50-50% split between entrepreneur and not entrepreneur labels, its entropy would be maximum, while we had only 1 type of label -e.g., all entrepreneur- its entropy would be 0). So, by calculating the reduction in entropy (aka information gain) between the parent's entropy and a weighted average of the children's entropy, we can determine which rules do the best job (or give us the most bang for our buck) in separating our data.
  • Data Compression: Ever wondered how files get zipped? Entropy helps determine the theoretical minimum number of bits needed to represent data without losing information. Compression algorithms like Huffman coding use this principle to assign shorter codes to more frequent symbols - pretty clever, right?
  • Cryptography: In the world of security, entropy measures the unpredictability of random number generators - crucial for creating strong encryption keys. The higher the entropy, the harder it is for attackers to guess the key. It's like having a really good randomness detector!

Example Calculation of Entropy

You are an entropy expert at this point, but before you move on to the next article, let's look at an example together. Consider a binary message source that produces 0s with probability 0.7 and 1s with probability 0.3. How could we calculate its entropy? Easy!

$$Entropy = -0.7 \log_2(0.7) - 0.3 \log_2(0.3) \approx 0.8813\text{ bits}$$

This value indicates that, on average, each symbol in the message carries 0.8813 bits of information, less than the maximum possible entropy of 1 bit for a binary source, which would occur with equal probabilities!

Practice Quiz Time

Now that you understand entropy, let's test your knowledge with a quick quiz:

What is the maximum entropy (in bits) for a system with 4 equally likely outcomes?

If you have a theoretically fair six-sided die, what is the entropy of a single roll?

Assuming the same theoretically fair six-sided die, what is the entropy of 100 rolls?