VarInt Encoding For Integers

VarInt is a way of storing integers in such a way that only the bytes required by the integer are used. This means that smaller integers require less memory than larger integers. This method is a lot more efficient than simply using, for example, a 32 bit integer to represent a value such as 200, which should only require 8 bits of storage.

Every integer is split into 8 bit components and each 8 bit component is further split into 2 more parts, where the first part consists of the first bit and the second part consists of the remaining 7 bits in the 8 bit sequence.

Most Significant Bit

The MSb refers to the first “part” of the 8 bit sequence mentioned above. Each of the 8 bit component contains an MSb. The MSb is used to determine whether their are more bytes to the integer, this is determined by the value of the MSb. If the MSb has a value of 1 then it is not the last 8 bit sequence, meaning the number is larger and more bytes follow. If the number is 0, then we know that we have reached the last byte.

7-bit Group

The 7BG refers to the second “part” of the 8 bit group mentioned above. The 7BG is used to determine the value stored in the 8 bit sequence. the value is represented using Two’s Complement and the bits are stored in Little-Endian format.

Formula

We take a look at the MSb, if it is a 1, then we know that there are more bytes to come, if it is a 0 then we know that this is the last byte in the sequence. This bit is then removed and the remaining 7 are used to calculate the value.
Because the 7 bit group is stored in Little Endian format, LSb first, we need to reverse all of the 7 bit group.
After reversing the 7 bit groups, if our VarInt is made up of more than 1 byte, we concatenate all the 7 bit groups together. We are left with the Two’s Complement representation of the decimal value.

Signed Integers:

When dealing with negative numbers, the resulting VarInt is always 10 bytes long because it gets treated like a very large unsigned integer, which is very inefficient. To avoid this inefficiency, ZigZag encoding is used.

ZigZag Encoding:

ZigZag encoding is quite simple to understand as it works like its name suggests. Instead of encoding the actual signed integer, integers are mapped to unsigned integers and then encoded. The integers are mapped by “ZigZagging” back and forth between negative and positive integers. We use the formula n x 2 for positive values and abs(n) x 2 – 1 for negative values.

Example 1:

1000 1010 0111 0000

Step 1:

The MSb of the first byte is 1, so we know that more bytes follow. We then write all the bytes, removing the MSb from each.

000 1010 111 0000

Step 2:

We now need to reverse each 7-bit group, there are 2.

0101 000 0000 111

Step 3:

We can see that our MSb is 0, and according to Two’s Complement, a leading 0 means we have a positive value. All that is left to do is calculate the decimal value.

4096 + 1024 + 4 + 2 + 1 = 5127

Conclusion

As we can see from the above formula, we have used 2 bytes to store a value of 5127.