The Engineering of Encoding
Encoding is the bridge between human-readable information and machine-parsable data. It is not encryption; it is a translation layer governed by strict standards (RFCs) that ensures interoperability across the global internet.
1. URL Encoding (Percent-Encoding)
URL encoding, officially known as Percent-Encoding, is defined in RFC 3986. It provides a mechanism for encoding information in a Uniform Resource Identifier (URI).
Why strictly ASCII?
The internet's infrastructure was built on ASCII. While modern protocols often support wider character sets, the URI syntax is strictly limited to a subset of ASCII. Any character that is not an "Unreserved Character" must be encoded.
The Mechanics
A percent-encoded octet is encoded as a character triplet, consisting of the percent character % followed by the two hexadecimal digits representing that octet's numeric value.
Reserved vs. Unreserved
- Unreserved characters (Safe):
A-Z,a-z,0-9,-,_,.,~. These never need encoding. - Reserved characters (Delimiters):
!,*,',(,),;,:,@,&,=,+,$,,,/,?,#,[,].
Technical Note: The space character is a special case. In a URL path segment, it is encoded as
%20. However, in `application/x-www-form-urlencoded` data (like query strings), it is historically encoded as+. This distinction often causes bugs in API integration.
2. Hexadecimal (Base16)
Hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols: 0-9 to represent values zero to nine, and A-F (or a-f) to represent values ten to fifteen.
Relation to Binary
Hexadecimal is the lingua franca of low-level computing because of its direct relationship to binary. One hexadecimal digit represents exactly four binary digits (a nibble).
| Hex | Binary | Decimal | Description |
|---|---|---|---|
| 0 | 0000 | 0 | Zero |
| F | 1111 | 15 | Max nibble value |
| FF | 1111 1111 | 255 | Max byte value (8 bits) |
Color Theory & Memory
In web design, a hex triplet (e.g., #16A34A) is a six-digit, three-byte hexadecimal number. The bytes represent the red, green, and blue components of the color.16 (Red) = 22A3 (Green) = 1634A (Blue) = 74
3. Unicode, UTF-8 & The Astral Planes
Unicode is the universal character set standard, maintained by the Unicode Consortium. As of version 15.0, it contains over 149,000 characters.
Code Points vs. Encoding Formats
It is crucial to distinguish between a Code Point (the abstract integer ID) and the Encoding (how it is stored in bits).
- Code Point:
U+1F600(😀 Grinning Face) - UTF-8:
F0 9F 98 80(4 bytes). The standard for the web. - UTF-16:
3D 83 DE 00(Variable width). Used by Java and Windows APIs. - UTF-32: Fixed 4 bytes. Simple to process, but memory inefficient.
Surrogate Pairs (JavaScript's Headache)
JavaScript strings use UTF-16 encoding. Characters outside the Basic Multilingual Plane (BMP), such as emojis or rare Kanji, do not fit in a single 16-bit unit. Instead, they are represented by two 16-bit units known as a Surrogate Pair.
// The "Pile of Poo" emoji (U+1F4A9) "💩".length === 2 // true! "💩".charCodeAt(0) // 0xD83D (High Surrogate) "💩".charCodeAt(1) // 0xDCA9 (Low Surrogate)4. HTML Entities & Security (XSS)
HTML Entities ensure that reserved characters are treated as text content rather than markup code. This is the first line of defense against Cross-Site Scripting (XSS) attacks.
The Attack Vector
If a user inputs <script>alert('hacked')</script> and your application renders it raw, the browser executes the JavaScript.
The Defense
By encoding it to <script>alert('hacked')</script>, the browser renders the characters safely without executing them.
Types of Entities
- Named Entities:
©(©),€(€). Easy to read, but not all characters have names. - Decimal Entities:
©(©). Universally supported. - Hexadecimal Entities:
©(©). Useful when working with Unicode code points.
5. Punycode & International Domain Names (IDN)
The Domain Name System (DNS) is historically ASCII-only. To support International Domain Names (IDNs) containing Unicode characters (e.g., münchen.de), the IETF standard Punycode (RFC 3492) was developed.
The Bootstring Algorithm
Punycode uniquely and reversibly transforms Unicode strings into the limited ASCII character set allowed in host names (letters, digits, and hyphens).
- Prefix: All Punycode domains start with
xn--(the ACE prefix). - Separation: ASCII characters in the original string are copied first, followed by a delimiter
-. - Encoding: Non-ASCII characters are encoded as a delta sequence at the end.
Homograph Attacks
Punycode enables a new class of phishing attacks where a malicious actor registers a domain with characters that look identical to a target (e.g., using a Cyrillic 'a' instead of a Latin 'a'). This is called a Homograph Attack. Modern browsers use heuristic policies to decide whether to display the Unicode version or the raw Punycode (xn--...) to warn users.
6. Base64 Encoding (RFC 4648)
Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format. It is designed to carry data across channels that only reliably support text content.
How it Works
Base64 divides every three bytes (24 bits) into four 6-bit units. Each 6-bit unit is then mapped to one of 64 characters in the Base64 alphabet: A-Z, a-z, 0-9, +, and /.
The Padding (=)
If the input is not a multiple of three bytes, padding characters (=) are added to the end of the encoded string so that the final length is a multiple of four.
Base64 vs. Base64URL
Standard Base64 uses + and /, which have special meanings in URLs. Base64URL replaces these with - and _ respectively, and usually omits the padding. This makes it safe for use in filenames and URL query parameters.
7. JSON Web Tokens (JWT)
JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object.
The Anatomy of a JWT
A JWT consists of three parts separated by dots (.):
- Header: Contains metadata about the token (e.g., algorithm used).
- Payload: Contains the claims (the actual data being transmitted).
- Signature: Used to verify that the sender of the JWT is who it says it is and to ensure that the message wasn't changed along the way.
Each part is individually Base64URL encoded.
Security Note: Decoding a JWT is not the same as verifying it. Anyone with the token can decode it to see the payload. Never store sensitive information like passwords in a JWT payload unless it is also encrypted (JWE).
6. Modular Arithmetic & Classical Ciphers
Before the era of computer cryptography, encryption relied on simple mathematical substitutions.
The Caesar Cipher
This is a simple substitution cipher that replaces each letter with the letter n places down the alphabet. Mathematically, it operates modulo 26.
E(x) = (x + n) mod 26D(x) = (x - n) mod 26
ROT13 (Rotate by 13)
ROT13 is a specific instance of the Caesar cipher where n = 13. Since the Latin alphabet has 26 characters, applying ROT13 twice returns the original text.
ROT13(ROT13(x)) = x
Security Warning: These ciphers offer zero security. They are trivially broken using Frequency Analysis (comparing the frequency of letters in the ciphertext to standard language averages). They should only be used for obfuscation (e.g., hiding spoilers), never for protecting sensitive data.