Bits and Bytes ordering in 5 minutes posted February 2019

1. Bits and Their Encoding

Imagine that I generate a key to encrypt with AES. I use AES-128, instead of AES-256, so I need a 128-bit key.

I use whatever mechanism my OS gives me to generate a long string of bits. For example in python:

>>> import os;
>>> random_number = os.urandom(16)
>>> print(bin(int(random_number, 16))[2:])
11111010110001111100010010101111110101101111111011100001110000001000010100001000000010001001000110111000000111101101000000101011

These bits, can be interpreted as a large number in base 2. Exactly how you would interpret 18 as "eighteen" in base 10.

This is the large number in base 10:

>>> print(int(random_number, 16))
333344255304826079991460895939740225579

According to wolframalpha it reads like so in english:

333 undecillion 344 decillion 255 nonillion 304 octillion 826 septillion 79 sextillion 991 quintillion 460 quadrillion 895 trillion 939 billion 740 million 225 thousand 579.

This number can be quite large sometimes, and we can make use of letters to shorten it into something more human readable. Let's try base 16 which is hexadecimal:

>>> print(random_number.encode('hex'))
fac7c4afd6fee1c085080891b81ed02b

You often see this method of displaying binary strings to a more human readable format. Another popular one is base64 which is using, you guessed it, base 64:

>>> import base64
>>> print(base64.b64encode(random_number))
+sfEr9b+4cCFCAiRuB7QKw==

And as you can see, the bigger the base, the shorter the string we get. That is quite useful to keep something human readable and short.

2. Bytes and Bit-wise Operations

Let's go back to our bitstring

11111010110001111100010010101111110101101111111011100001110000001000010100001000000010001001000110111000000111101101000000101011

this is quite a lot of bits, and we need to find a way to store that in our computer memory.

The most common way, is to pack these bits into bytes of 8 bits (also called one octet):

11111010 11000111 11000100 10101111 11010110 11111110 11100001 11000000 10000101 00001000 00001000 10010001 10111000 00011110 11010000 00101011

As you can see, we just split things every 8 bits. In each bundle of 8 bits, we keep the bit-numbering with the most significant bit (MSB) first. We could have had the least significant bit (LSB) first instead, but since our larger binary string already had MSB first, it makes sense to keep it this way. It's also more "human" as we are used to read numbers from left to right (at least in English, French, Chinese, etc.)

Most programming languages let you access octets instead of bits directly. For example in Golang:

a := []byte{98, 99} // an array of bytes
b := a[0] // the byte represented by the base 10 number '98'

To act on a specific bit, it's a bit more effort as we need to segregate it via bitwise operations like NOT, AND, OR, XOR, SHIFTs, ROTATIONs, etc.

For example in Golang:

a := byte(98)
firstBit := a >> 7 // shifting 7 bits to the right, leaving the MSB intact and zero'ing the others

So far, all of these things can be learned and anchored in your brain by writing code for something like cryptopals for example.

3. Memory

OK. How do we store these octets in memory? Unfortunately, because of historical reasons, we have two ways of doing this:

Big-Endian: from low memory address (00000....) to high memory address (999999...) in that order.
Little-Endian: from high memory address (9999999...) to lower memory address (0000000...) in that order.

We call this Endianness.

I'm sorry, but to understand the rest of this article, you are going to have to parse this small snippet of C first:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main(){
  uint8_t a[] = {1, 255};         // storing [1, 255]
  printf("%p: x\n", a, *a);       // 0x7ffdc5e78a70: 01
  printf("%p: x\n", a+1, *(a+1)); // 0x7ffdc5e78a71: ff
}

As we can see, everything works as expected:

a points to an address in memory (0x7ffdc5e78a70) containing $1$
the next address (0x7ffdc5e78a71) points to the value $255$ (displayed in hexadecimal)

The number 0x01ff (the 0x is a nice way to indicate that it is hexadecimal) represents the number $1 \times 16^2 + 15 \times 16^1 + 15 \times 16^0 = 511$ (remember, f represents the number 15 in hexadecimal).

So let's try to store that number in a different way in C:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main(){
  uint16_t b = 0x01ff;               // storing [1, 255] ?
//uint16_t b = 511                   // these two lines are equivalent

  uint8_t *a = (uint8_t*)&b;         // getting octet pointer on b
  printf("%p: x\n", a, *a);       // 0x7ffd78106986: ff
  printf("%p: x\n", a+1, *(a+1)); // 0x7ffd78106987: 01
}

Wait what? Why is the order of 01 and ff reversed?

This is because the machine I used to run this uses little-endianness to map values to memory (like most machines nowadays).

If you didn't know about this, it should freak you out.

But relax, this weirdness almost NEVER matters. Because:

in most languages, you do not do pointer arithmetic (what I just did when I incremented a)
in most scenarios, you do not convert back and forth between bytestrings and number types (like int or uint16_t).

And this is pretty much why most systems don't care too much about using little-endian instead of big-endian.

4. Network

Networking is usually the first challenge someone unfamiliar with endianness encounters.

When receiving bytes from a TCP socket, one usually stores them into an array. Here is a simple example in C where we receive a string from the network:

char *a = readFromNetwork() // [104, 101, 108, 108, 111, 0]
printf("%s\n", a);          // hello

Notice that we do not necessarily know in which order (endianness) the bytes were sent, but protocols usually agree to use network byte order which is big-endian. This works pretty well for strings, but when it comes to number larger than 8-bit, you need to know how to re-assemble it in memory depending on your machine.

Let's see why this is a problem. Imagine that we want to transmit the number $511$. We need two bytes: 0x01 and 0x0ff. We transmit them in this order since it is big-endian which is the prefered network-byte order. On the other side, here is how we can receive the two bytes, and convert them back into a number type:

uint8_t a1[] = {1, 255};      // storing the received octets as-is (from left to right)
uint8_t a2[] = {255, 1};      // storing the octets from right to left after reversing them
uint16_t *b1 = (uint16_t*)a1;
uint16_t *b2 = (uint16_t*)a2;
printf("%"PRIu16"\n", *b1);    // 65281
printf("%"PRIu16"\n", *b2);    // 511

In this case, we see that to collect the correct number $511$ on the other end of the connection, we had to reverse the order of the bytes in memory. This is because our machine is little-endian.

This is what confuses most people!

In reality, it shouldn't. And this should re-assure you, because trying to figure out the endianness of your machine before converting a series of bytes received from the network into a number can be daunting.

Instead, we can rely on bitwise operations that are always emulating big-endianness! Let's take a deep look at this short snippet of code:

uint8_t* a[] = {1, 255};          // the number 511 encoded in network byte-order
uint16_t b = (a[0] << 8) | a[1];
printf("%"PRIu16"\n", b);         // 511

Here, we placed the received big-endian numbers in the correct big-endian order via the left shift operation. This code works on any machine. It is the key to understanding why endianness doesn't matter in most cases: bit-wise operations are endianness-independent.

Unless your job is to implement low-level stuff like cryptogaphy, you do not care about endianness. This is because you will almost never convert a series of bytes to a number, or a number to a series of bytes.

If you do, because of networking perhaps, you use the built-in functions of the language (see Golang or C for example) and endianness-independent operations (like left shift), but never pointer arithmetic.

Comments

Tejaswi

Should be: printf("%p: %x\n", a, *a); (missing % symbol behind x)

Abdul Khadar J

Beautiful Explanation. There is a typo.
Wrong: 1 * 16^2 + 15 * 16^1 + 15 + 16^0 = 511
Right : 1 * 16^2 + 15 * 16^1 + 15 * 16^0 = 511

david

Tejaswi: thanks! It's actually x also. Double mistake :)

Abdul Khadar J: thanks for pointing that out also!

Brent shambaugh

I got an error for 1. Bits and Their Encoding:

print(bin(int(random_number,16))[2:])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 16: b'[\xc2-\xca\xc9\xde\xf7\xf6#\x9f\xda\x06\xde\x8b\xd0;'

Looking up the error gives:
https://stackoverflow.com/questions/1841565/valueerror-invalid-literal-for-int-with-base-10

My solution:

I was able to get a binary number through big endian (most significant bit first):

import os;
random_number = os.urandom(16)
print(bin(int.from_bytes(random_number, byteorder='big'))[2:])

or a different one with little endian (least significant bit first):

import os;
random_number = os.urandom(16)
print(bin(int.from_bytes(random_number, byteorder='little'))[2:])