Home smart_toyAI AssistantBeta Leader Board $GET Token GovernanceaddAdd

chevron_left Run-length encoding (RLE) is a simple data compression technique chevron_right

WikiGama

Anna Kowalski

visibility98

calendar_month2026-02-03

Run-length Encoding: Making Repetition Simple

A beginner's guide to the simple yet powerful idea of data compression by counting repeats.

Summary: Run-length Encoding (RLE)^[1] is a fundamental lossless compression technique that shrinks the size of data without losing any information. It works by scanning data and replacing long sequences, or "runs", of the same value with a compact pair: a count and the single value itself. This method is exceptionally effective for data containing large areas of repetition, such as simple images, old computer graphics, and sequenced data logs. Understanding RLE provides a clear window into the broader world of data compression, algorithms, and efficient digital storage.

The Core Idea: From Sequences to Pairs

Imagine you are tasked with writing down the colors of squares in a long row. The row looks like this: black, black, black, black, white, white, white, white, white, black, black. Writing each one takes a lot of space and time. A smarter way is to say: "4 black, 5 white, 2 black." You just used the principle of Run-length Encoding! You replaced the run of identical items with a count and the item.

In the digital world, data is often stored as a series of symbols—letters, numbers, or binary digits (bits). RLE takes advantage when these symbols repeat consecutively. The basic algorithm is incredibly straightforward:

RLE Algorithm (Simple Form):
1. Start at the beginning of the data.
2. Look for the next consecutive sequence (run) of the same symbol.
3. Output the length (count) of that run.
4. Output the symbol that is repeated.
5. Move to the symbol after the run and repeat from step 2.

This process transforms the original data into a series of (count, value) pairs. The "compression" happens because storing the pair "5 A" is much shorter than storing "A A A A A".

A Walkthrough with Letters and Pixels

Let's apply RLE to a text example. Consider the string: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW. This is a famous example used to demonstrate RLE. Manually counting runs, we get:

12 W
1 B
12 W
3 B
24 W
1 B
14 W

The encoded data becomes: 12W1B12W3B24W1B14W. We have converted 67 characters into just 14 characters (the numbers 12,1,12,3,24,1,14 and the letters W,B,W,B,W,B,W)! That's a significant reduction. However, there's a catch: how do we know where one number ends and the next begins? Is "121" count 12 followed by value 1, or count 1 followed by value 21? This leads us to an important concept: delimiters or fixed-length counts.

A common solution is to always use a fixed number of digits for the count. For instance, if we decide every count will be exactly 2 digits, our encoding becomes: 12W01B12W03B24W01B14W. Now, we can always read two digits for the count, then one character for the value. This makes decoding perfectly reliable.

Binary Image Compression: A Classic Application

One of the most famous real-world uses of RLE is in compressing simple, black-and-white images, like old fax machine transmissions or computer icons. In a binary image, each pixel is either 0 (black) or 1 (white). Instead of storing every single pixel, we can store runs of consecutive 0s and 1s.

Let's take a tiny 8x8 pixel image of a smiley face, represented as a sequence of bits (scanning row by row from left to right, top to bottom):

11111111
10000001
10100101
10000001
10100101
10011001
10000001
11111111

As a single stream: 11111111 10000001 10100101 10000001 10100101 10011001 10000001 11111111

Now, let's perform RLE. We'll list runs of consecutive 0s and 1s. For clarity, we'll start with a white (1) run:

Run #	Pixel Value	Run Length	Description
1	1 (White)	8	First row, all white.
2	0 (Black)	6	Second row, middle six pixels.
3	1 (White)	2	Ends of second row.
...	...	...	(Continues for all rows)
Final	1 (White)	8	Last row, all white.

The full RLE sequence for the entire image (using decimal counts) might look like: 8,1,6,0,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,6,0,2,1,... and so on. Storing this list of small numbers is often much more efficient than storing 64 individual bits, especially for images with large solid areas.

Strengths, Weaknesses, and Modern Relatives

Run-length Encoding shines in specific scenarios but can fail in others. Understanding its pros and cons is key to knowing when to use it.

Strengths:

Simplicity: It is very easy to understand, implement, and debug. The algorithm can be written in just a few lines of code.
Speed: Both encoding and decoding are very fast, as they involve simple counting and copying.
High Efficiency for Repetitive Data: For data with long runs (like simple graphics, screens with solid backgrounds, or certain types of scientific data), RLE can achieve impressive compression ratios^[2].
Lossless: No information is lost during compression and decompression; the original data is perfectly restored.

Weaknesses:

Ineffective for Non-Repetitive Data: If the data has very few or very short runs (e.g., natural photographs, complex text), RLE can actually make the file larger. For example, the sequence "A B C D" would be encoded as "1A1B1C1D", which is twice the size of the original!
Run Overhead: Each run requires storing the count. For very short runs (length 1 or 2), this overhead wastes space.
Sensitive to Order: RLE only finds consecutive repeats. Scattered repetitions (like many single black pixels in a white field) are not compressed well.

Because of these limitations, RLE is rarely used alone for general-purpose compression today. Instead, it often serves as one component in more advanced algorithms. For example, the widely used PNG image format uses a filter step on scanlines that often creates sequences of similar byte values, which are then compressed using a method related to RLE. The ZIP format's DEFLATE algorithm first finds repeating strings (LZ77) and then applies Huffman coding; the output of the first step can sometimes resemble run-length data.

Hands-On: Building and Breaking an RLE Code

Let's get practical. Suppose we have a sensor that records temperature every hour, but it only changes value occasionally. The raw data for a day might be: 72, 72, 72, 72, 72, 72, 75, 75, 75, 75, 68, 68, 68, 68, 68, 68, 68, 68, 70, 70, 70, 70, 70, 70.

Step 1: Encode. We scan the list and count runs:

Six consecutive 72s $\rightarrow$ (6, 72)
Four consecutive 75s $\rightarrow$ (4, 75)
Eight consecutive 68s $\rightarrow$ (8, 68)
Six consecutive 70s $\rightarrow$ (6, 70)

The encoded data is the sequence of these pairs: 6 72 4 75 8 68 6 70. We've reduced 24 numbers down to just 8 numbers.

Step 2: Decode. To get the original data back, we read the pairs one by one and expand them:

Read (6, 72): Output "72" six times.
Read (4, 75): Output "75" four times.
Read (8, 68): Output "68" eight times.
Read (6, 70): Output "70" six times.

And we have perfectly reconstructed the original 24 readings. This demonstrates the lossless property of RLE.

Mathematical View:
If the original data is a sequence $S = s_1, s_2, s_3, ..., s_n$, RLE transforms it into a sequence of pairs $R = (r_1, v_1), (r_2, v_2), ..., (r_k, v_k)$ where:

$v_i$ is a symbol from the original data,
$r_i$ is an integer count $> 0$,
$v_i \neq v_{i+1}$ for all $i$, and
$r_1 + r_2 + ... + r_k = n$.

Compression is achieved when $2k < n$. The compression ratio can be loosely approximated as $n / (2k)$.

Important Questions

Q1: Can RLE work on data that isn't numbers or letters, like sound or video?
Yes, but it depends on the format. RLE works on any linear sequence of symbols. For sound (a sequence of sample values) or video (a sequence of frames, each being an image), RLE can be applied if the data contains runs of identical sample values or pixel colors. However, most real-world sound and video have subtle variations, so pure RLE is not very effective. It is more commonly used within the compression of the individual frames of a video (e.g., in older formats like BMP or TGA) or for specific channels of data.

Q2: What happens if we have a run longer than the maximum number our count can hold?
This is a practical issue. If we use a single byte (8 bits) to store the count, it can only hold numbers from 0 to 255. What if we have a run of 300 white pixels? The solution is to split the run. We would encode it as (255, white) followed by (45, white). The decoder, seeing two consecutive runs of the same value, will correctly output 300 white pixels in a row. The file format specification must define if merging consecutive runs of the same value is allowed during decoding.

Q3: Is RLE still used in modern computing?
Absolutely. While not typically the main compression method for complex files, RLE is alive and well in specific areas:

Database Systems: For compressing columns of data that have many repeated values (e.g., a "status" column with millions of "active" entries).
Bitmap (BMP) Image Files: The BMP format supports an RLE compression option for 4-bit and 8-bit indexed color images.
PDF Files: Used for compressing monochrome (black and white) image data in a variant called CCITT Group 3/4, which is essentially a sophisticated run-length encoding.
Scientific Data Arrays: Where large regions might have a constant value (e.g., a 3D model of an object with empty space around it).

Conclusion: Run-length Encoding is a beautiful and intuitive gateway into the world of data compression. By harnessing the power of repetition, it demonstrates how a simple algorithmic idea can lead to significant savings in storage and transmission time. Its strengths in simplicity and speed ensure it remains a useful tool, either on its own for specialized data or as a building block within more complex, modern compression schemes like PNG. Understanding RLE not only teaches you about compression but also encourages a mindset of looking for patterns and efficiency—a foundational skill in computer science and problem-solving.

Footnote

^[1] RLE (Run-length Encoding): The full English name and definition of the topic. A method for lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count.
^[2] Compression Ratio: A measure of the reduction in size of data achieved by a compression method. It is often expressed as the ratio of the original size to the compressed size (e.g., 2:1).

#AS & A Level #Data Compression #Lossless Algorithm #Binary Images #Coding for Beginners #Pattern Recognition

Did you like this article?

Blog

Go to blog See all chevron_forward