The Engineering of Encoding

Encoding is the bridge between human-readable information and machine-parsable data. It is not encryption; it is a translation layer governed by strict standards (RFCs) that ensures interoperability across the global internet.

1. URL Encoding (Percent-Encoding)

URL encoding, officially known as Percent-Encoding, is defined in RFC 3986. It provides a mechanism for encoding information in a Uniform Resource Identifier (URI).

Why strictly ASCII?

The internet's infrastructure was built on ASCII. While modern protocols often support wider character sets, the URI syntax is strictly limited to a subset of ASCII. Any character that is not an "Unreserved Character" must be encoded.

The Mechanics

A percent-encoded octet is encoded as a character triplet, consisting of the percent character % followed by the two hexadecimal digits representing that octet's numeric value.

Reserved vs. Unreserved

Unreserved characters (Safe): A-Z, a-z, 0-9, -, _, ., ~. These never need encoding.
Reserved characters (Delimiters): !, *, ', (, ), ;, :, @, &, =, +, $, ,, /, ?, #, [, ].

Technical Note: The space character is a special case. In a URL path segment, it is encoded as %20. However, in `application/x-www-form-urlencoded` data (like query strings), it is historically encoded as +. This distinction often causes bugs in API integration.

Try URL Encoder

2. Hexadecimal (Base16)

Hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols: 0-9 to represent values zero to nine, and A-F (or a-f) to represent values ten to fifteen.

Relation to Binary

Hexadecimal is the lingua franca of low-level computing because of its direct relationship to binary. One hexadecimal digit represents exactly four binary digits (a nibble).

Hex	Binary	Decimal	Description
0	0000	0	Zero
F	1111	15	Max nibble value
FF	1111 1111	255	Max byte value (8 bits)

Color Theory & Memory

In web design, a hex triplet (e.g., #16A34A) is a six-digit, three-byte hexadecimal number. The bytes represent the red, green, and blue components of the color.
16 (Red) = 22
A3 (Green) = 163
4A (Blue) = 74

Try Hex Tool

3. Unicode, UTF-8 & The Astral Planes

Unicode is the universal character set standard, maintained by the Unicode Consortium. As of version 15.0, it contains over 149,000 characters.

Code Points vs. Encoding Formats

It is crucial to distinguish between a Code Point (the abstract integer ID) and the Encoding (how it is stored in bits).

Code Point: U+1F600 (😀 Grinning Face)
UTF-8: F0 9F 98 80 (4 bytes). The standard for the web.
UTF-16: 3D 83 DE 00 (Variable width). Used by Java and Windows APIs.
UTF-32: Fixed 4 bytes. Simple to process, but memory inefficient.

Surrogate Pairs (JavaScript's Headache)

JavaScript strings use UTF-16 encoding. Characters outside the Basic Multilingual Plane (BMP), such as emojis or rare Kanji, do not fit in a single 16-bit unit. Instead, they are represented by two 16-bit units known as a Surrogate Pair.

// The "Pile of Poo" emoji (U+1F4A9) "💩".length === 2 // true! "💩".charCodeAt(0) // 0xD83D (High Surrogate) "💩".charCodeAt(1) // 0xDCA9 (Low Surrogate)

Try Unicode Tool

4. HTML Entities & Security (XSS)

HTML Entities ensure that reserved characters are treated as text content rather than markup code. This is the first line of defense against Cross-Site Scripting (XSS) attacks.

The Attack Vector

If a user inputs <script>alert('hacked')</script> and your application renders it raw, the browser executes the JavaScript.

The Defense

By encoding it to <script>alert('hacked')</script>, the browser renders the characters safely without executing them.

Types of Entities

Named Entities: © (©), € (€). Easy to read, but not all characters have names.
Decimal Entities: © (©). Universally supported.
Hexadecimal Entities: © (©). Useful when working with Unicode code points.

Try HTML Tool

5. Punycode & International Domain Names (IDN)

The Domain Name System (DNS) is historically ASCII-only. To support International Domain Names (IDNs) containing Unicode characters (e.g., münchen.de), the IETF standard Punycode (RFC 3492) was developed.

The Bootstring Algorithm

Punycode uniquely and reversibly transforms Unicode strings into the limited ASCII character set allowed in host names (letters, digits, and hyphens).

Prefix: All Punycode domains start with xn-- (the ACE prefix).
Separation: ASCII characters in the original string are copied first, followed by a delimiter -.
Encoding: Non-ASCII characters are encoded as a delta sequence at the end.

Homograph Attacks

Punycode enables a new class of phishing attacks where a malicious actor registers a domain with characters that look identical to a target (e.g., using a Cyrillic 'a' instead of a Latin 'a'). This is called a Homograph Attack. Modern browsers use heuristic policies to decide whether to display the Unicode version or the raw Punycode (xn--...) to warn users.

Try Punycode Tool

6. Base64 Encoding (RFC 4648)

Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format. It is designed to carry data across channels that only reliably support text content.

How it Works

Base64 divides every three bytes (24 bits) into four 6-bit units. Each 6-bit unit is then mapped to one of 64 characters in the Base64 alphabet: A-Z, a-z, 0-9, +, and /.

The Padding (`=`)

If the input is not a multiple of three bytes, padding characters (=) are added to the end of the encoded string so that the final length is a multiple of four.

Base64 vs. Base64URL

Standard Base64 uses + and /, which have special meanings in URLs. Base64URL replaces these with - and _ respectively, and usually omits the padding. This makes it safe for use in filenames and URL query parameters.

Try Base64 Tool

7. JSON Web Tokens (JWT)

JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object.

The Anatomy of a JWT

A JWT consists of three parts separated by dots (.):

Header: Contains metadata about the token (e.g., algorithm used).
Payload: Contains the claims (the actual data being transmitted).
Signature: Used to verify that the sender of the JWT is who it says it is and to ensure that the message wasn't changed along the way.

Each part is individually Base64URL encoded.

Security Note: Decoding a JWT is not the same as verifying it. Anyone with the token can decode it to see the payload. Never store sensitive information like passwords in a JWT payload unless it is also encrypted (JWE).

Try JWT Decoder

6. Modular Arithmetic & Classical Ciphers

Before the era of computer cryptography, encryption relied on simple mathematical substitutions.

The Caesar Cipher

This is a simple substitution cipher that replaces each letter with the letter n places down the alphabet. Mathematically, it operates modulo 26.

E(x) = (x + n) mod 26
D(x) = (x - n) mod 26

ROT13 (Rotate by 13)

ROT13 is a specific instance of the Caesar cipher where n = 13. Since the Latin alphabet has 26 characters, applying ROT13 twice returns the original text.

ROT13(ROT13(x)) = x

Security Warning: These ciphers offer zero security. They are trivially broken using Frequency Analysis (comparing the frequency of letters in the ciphertext to standard language averages). They should only be used for obfuscation (e.g., hiding spoilers), never for protecting sensitive data.

Try ROT13