The Engineering of Encoding

Encoding is the bridge between human-readable information and machine-parsable data. It is not encryption; it is a translation layer governed by strict standards (RFCs) that ensures interoperability across the global internet. This guide covers 12 encoding systems, from the fundamentals of binary and ASCII to the modern JWT standard, with interactive tool links throughout.


1. URL Encoding (Percent-Encoding)

URL encoding, officially known as Percent-Encoding, is defined in RFC 3986. It provides a mechanism for encoding information in a Uniform Resource Identifier (URI) so that any character can be safely transmitted over the web, regardless of its origin.

Why strictly ASCII?

The internet's infrastructure was built on ASCII. While modern protocols often support wider character sets, the URI syntax is strictly limited to a subset of ASCII. Any character that is not an "Unreserved Character" must be encoded before it appears in a URL. This includes spaces, Unicode letters, punctuation marks, and control characters.

The Mechanics

A percent-encoded octet is encoded as a character triplet: the percent character % followed by the two hexadecimal digits representing that octet's numeric value. For example, a space character (ASCII value 32, or 0x20 in hexadecimal) becomes %20.

Reserved vs. Unreserved Characters

  • Unreserved characters (always safe): A-Z, a-z, 0-9, -, _, ., ~. These never need encoding.
  • Reserved characters (structural delimiters): !, *, ', (, ), ;, :, @, &, =, +, $, ,, /, ?, #, [, ]. These must be encoded when used as data, not as structure.

Technical Note: The space character is a special case. In a URL path segment, it is encoded as %20. However, in application/x-www-form-urlencoded data (like HTML form query strings), it is historically encoded as +. This distinction causes frequent bugs in API integration, so always check which convention your framework uses.

2. Hexadecimal (Base16)

Hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols: 0-9 to represent values zero to nine, and A-F (or a-f) to represent values ten to fifteen. Hex is pervasive in computing because it compactly represents raw binary data in a human-readable form.

Relation to Binary

Hexadecimal is the lingua franca of low-level computing because of its direct relationship to binary. One hexadecimal digit represents exactly four binary digits (a nibble). Two hex digits represent one byte (8 bits), which is why memory addresses, color codes, and cryptographic hashes are all displayed in hex.

HexBinaryDecimalDescription
000000Zero
A101010First letter digit
F111115Max nibble value
FF1111 1111255Max byte value (8 bits)
410100 000165ASCII 'A'

Where You'll See Hex in Practice

  • Colors: A web color like #16A34A is three bytes: red 16 (22), green A3 (163), blue 4A (74).
  • Memory addresses: Debuggers and disassemblers display addresses like 0x7fff5fbff8a8.
  • Hashes: A SHA-256 hash is 32 bytes displayed as 64 hex characters.
  • Escape sequences: URL encoding uses hex digits, e.g., %20 for space.

3. Unicode, UTF-8 & The Astral Planes

Unicode is the universal character set standard, maintained by the Unicode Consortium. As of version 15.0, it contains over 149,000 characters spanning 161 scripts, from Latin and Cyrillic to Emoji, Ancient Egyptian Hieroglyphs, and Mathematical symbols. Unicode solves the Babel problem of computing: every character on every system gets one unambiguous identity.

Code Points vs. Encoding Formats

It is crucial to distinguish between a Code Point (the abstract integer ID, e.g., U+1F600) and the Encoding (how that integer is stored as bytes on disk or in memory):

  • Code Point: U+1F600 (😀 Grinning Face), the abstract identity
  • UTF-8: F0 9F 98 80 (4 bytes). The dominant encoding on the web, variable width, backward compatible with ASCII.
  • UTF-16: Variable width (2 or 4 bytes). Used internally by JavaScript, Java, and Windows APIs.
  • UTF-32: Fixed 4 bytes. Simple to index but memory inefficient; rare in practice.

Why UTF-8 Won

UTF-8 encodes the 128 ASCII characters as single bytes identical to ASCII, making it backward compatible with all existing ASCII-only systems. Non-ASCII characters use 2-4 bytes with a specific bit pattern that makes the encoding self-synchronizing; you can find character boundaries even in the middle of a stream.

Surrogate Pairs (JavaScript's Headache)

JavaScript strings use UTF-16 encoding internally. Characters outside the Basic Multilingual Plane (BMP), specifically code points above U+FFFF including most emoji, are represented as two 16-bit units called a Surrogate Pair. This causes the infamous .length bug:

// The "Pile of Poo" emoji (U+1F4A9) "💩".length === 2  // true! Not 1. "💩".charCodeAt(0) // 0xD83D (High Surrogate) "💩".charCodeAt(1) // 0xDCA9 (Low Surrogate) // ES2015+ fix: use spread or codePointAt [..."💩"].length === 1  // true

4. HTML Entities & Security (XSS)

HTML Entities ensure that reserved characters are treated as text content rather than markup. This distinction is the first line of defense against Cross-Site Scripting (XSS), one of the most common web security vulnerabilities (OWASP Top 10).

The Attack Vector

If a user inputs <script>alert('hacked')</script> into a form field and your application renders that string directly as HTML, the browser executes the JavaScript. This allows an attacker to steal cookies, hijack sessions, or redirect users.

The Defense: Output Encoding

By encoding the input to &lt;script&gt;alert('hacked')&lt;/script&gt; before rendering, the browser displays the characters literally without executing any code. This is called output encoding: always encode data for the context it will appear in (HTML, URL, JavaScript, CSS).

Types of HTML Entities

  • Named Entities: &copy; (©), &euro; (€), &lt; (<). Human-readable but not all characters have names.
  • Decimal Entities: &#169; (©), &#60; (<). Universally supported in all browsers.
  • Hexadecimal Entities: &#xA9; (©), &#x3C; (<). Useful when working with Unicode code points directly.

Key Principle: The five critical characters to always encode in HTML contexts are: & → &amp;, < → &lt;, > → &gt;, " → &quot;, and ' → &#x27;. Modern frameworks like React do this automatically, but raw DOM manipulation or server-side templates require explicit encoding.

5. Punycode & International Domain Names (IDN)

The Domain Name System (DNS) is historically ASCII-only; it was designed in 1983 when the internet was primarily English-speaking. To support International Domain Names (IDNs) containing Unicode characters (e.g., münchen.de, 日本語.jp), the IETF developed Punycode (RFC 3492).

The Bootstring Algorithm

Punycode uniquely and reversibly transforms Unicode strings into the limited ASCII character set allowed in DNS hostnames (letters, digits, and hyphens; the LDH rule).

  • ACE Prefix: All Punycode-encoded domain labels start with xn-- (the ASCII Compatible Encoding prefix). For example, münchen.de becomes xn--mnchen-3ya.de.
  • Separation: ASCII characters in the original string are copied first verbatim, followed by a hyphen delimiter.
  • Delta encoding: Non-ASCII characters are encoded as a compact delta sequence appended after the delimiter.

Homograph Attacks

Punycode enables a dangerous phishing vector. A malicious actor can register a domain using characters from other scripts that look visually identical to Latin letters. For example, the Cyrillic letter а (U+0430) is indistinguishable from the Latin a (U+0061) in most fonts. The domain аpple.com (Cyrillic а) would display identically to apple.com (Latin a) but resolve to a different server. Modern browsers apply heuristic IDN policies to display the raw Punycode form when mixed scripts are detected.

6. Base64 Encoding (RFC 4648)

Base64 is a binary-to-text encoding scheme that represents arbitrary binary data as an ASCII string. It was designed to transmit binary content (images, files, cryptographic keys) over channels that only support text, such as email (MIME), XML, and JSON.

How It Works

Base64 divides every three input bytes (24 bits) into four 6-bit groups. Each 6-bit value (0-63) is mapped to a character in the Base64 alphabet: A-Z (0-25), a-z (26-51), 0-9 (52-61), + (62), and / (63). This produces a 33% size increase: three bytes become four characters.

Padding (=)

If the input length is not a multiple of three, one or two = padding characters are appended to bring the output length to a multiple of four. Padding makes it trivial to determine the original data length.

Base64 vs. Base64URL

Standard Base64 uses + and /, which are reserved in URLs. Base64URL (used in JWTs, OAuth tokens, and URL-safe file names) replaces + with - and / with _, and typically omits the = padding. Always check which variant an API or standard expects.

Common Uses

  • Data URIs: src="data:image/png;base64,iVBORw0KGgo...", used for embedding images directly in HTML or CSS.
  • HTTP Basic Auth: Credentials are Base64-encoded in the Authorization: Basic ... header (note: this is not encryption).
  • JWT: Each section of a JSON Web Token is Base64URL-encoded.
  • SSH keys: Public and private key files use Base64 encoding within PEM format.

7. JSON Web Tokens (JWT)

JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting claims between parties. JWTs are the dominant mechanism for stateless authentication in modern web APIs and microservices.

The Anatomy of a JWT

A JWT is three Base64URL-encoded JSON objects joined by dots (.):

  1. Header: Metadata about the token: the signing algorithm (alg, e.g., HS256, RS256) and token type (typ: "JWT").
  2. Payload: The claims, including registered (iss, sub, exp, iat), public, and private claims carrying application data.
  3. Signature: HMAC(Base64URL(header) + "." + Base64URL(payload), secret), which verifies the token has not been tampered with and (for asymmetric algorithms) that it came from the expected issuer.

Security Note: Decoding a JWT is not the same as verifying it. The Base64URL encoding provides zero confidentiality; anyone holding the token can read its payload. Never store passwords, credit card numbers, or sensitive PII in a JWT payload unless the token is also encrypted (JWE, RFC 7516). Always validate the signature server-side before trusting any claim.

Common JWT Claims

  • iss: Issuer, who created the token
  • sub: Subject, the user or entity the token refers to
  • exp: Expiration, a Unix timestamp after which the token is invalid
  • iat: Issued At, a Unix timestamp recording when the token was created
  • aud: Audience, who the token is intended for

8. Modular Arithmetic & Classical Ciphers

Before the era of computer cryptography, encryption relied on simple mathematical substitutions applied to the alphabet. While these ciphers are entirely insecure today, they are excellent for understanding the core principle of substitution and the mathematical concept of modular arithmetic.

The Caesar Cipher

The Caesar cipher replaces each letter with the letter n positions forward in the alphabet, wrapping around at the end. Julius Caesar allegedly used a shift of 3. The encryption and decryption functions are:

E(x) = (x + n) mod 26 D(x) = (x - n + 26) mod 26

ROT13 (Rotate by 13)

ROT13 is the Caesar cipher with n = 13. Because the Latin alphabet has 26 characters, ROT13 is its own inverse; applying it twice returns the original text. It was widely used on early Usenet to hide spoilers.

ROT13(ROT13(x)) === x  // always true

ROT47

ROT47 extends rotation to the 94 printable ASCII characters (codes 33-126), rotating each character by 47 positions. Unlike ROT13, it affects digits and punctuation in addition to letters.

The Vigenère Cipher

The Vigenère cipher uses a repeating keyword to determine a different shift for each character, making it resistant to simple frequency analysis. Each letter of the key specifies a different Caesar shift for the corresponding plaintext letter.

Security Warning: All classical ciphers offer zero modern security. They are trivially broken by frequency analysis (comparing letter frequency in the ciphertext to natural language averages). Use these tools for learning, puzzles, or light obfuscation (like hiding spoilers); never use them for protecting sensitive data. For real encryption, use AES-256 or ChaCha20 via the Web Crypto API.

9. Binary, Octal & Number Base Theory

All digital computers store and process information as sequences of bits, binary digits that can be either 0 or 1. Understanding binary and other positional number systems is foundational to understanding how computers represent text, images, and any other data.

Positional Number Systems

In any base-n positional system, each digit position represents a power of n. The decimal system is base 10; binary is base 2; octal is base 8; and hexadecimal is base 16.

DecimalBinary (base 2)Octal (base 8)Hex (base 16)
00000000000000
10000000100101
70000011100707
80000100001008
15000011110170F
640100000010040
25511111111377FF

Binary in Practice

A single byte is 8 bits, capable of representing 256 values (0-255). This is why the ASCII character set has 128 characters (7 bits), and why hex pairs map so cleanly: one hex pair equals one byte. When you encode text to binary, each character is first converted to its ASCII or Unicode byte value, and then that byte is written out as 8 binary digits.

Octal's Role in Unix

Octal (base 8) may seem obscure, but it has a critical role in Unix file system permissions. The permission string rwxr-xr-- maps directly to the octal value 754:

r w x  r - x  r - - 1 1 1  1 0 1  1 0 0 7      5      4    → chmod 754

Each octal digit represents exactly three bits, one permission triplet (read, write, execute).

10. ASCII: The Original Character Set

ASCII (American Standard Code for Information Interchange) is the foundational character encoding for the English-speaking internet. Standardized in 1963 by ANSI, it maps 128 characters (7-bit values 0-127) to integers: the letters A-Z and a-z, digits 0-9, punctuation marks, and 33 non-printable control characters.

The ASCII Table

Key mappings every developer should know:

  • 48-57: Digits 0-9. The digit '5' is ASCII 53, not integer 5. This is why '5' - '0' === 5 works in C.
  • 65-90: Uppercase letters A-Z. The letter 'A' is ASCII 65.
  • 97-122: Lowercase letters a-z. Exactly 32 higher than their uppercase equivalents: flip bit 5 to toggle case.
  • 32: Space. The only whitespace character with a printable slot.
  • 0: Null (\0). String terminator in C.
  • 10: Newline (\n). Line feed in Unix systems.
  • 13: Carriage return (\r). Combined with LF as \r\n on Windows.

ASCII as the Subset of UTF-8

One of UTF-8's most important design properties is that the first 128 code points (U+0000 to U+007F) are encoded identically to ASCII, stored as single bytes. This means any ASCII-encoded file is also a valid UTF-8 file. The reverse is not true: UTF-8 files with non-ASCII characters are not valid ASCII. This backward compatibility is a major reason UTF-8 achieved near-universal adoption on the web.

Why does ASCII encoding of text look like numbers? When you encode "Hello" in ASCII, you get the decimal values 72 101 108 108 111, or in hex: 48 65 6C 6C 6F. These are the numeric indices of each character in the ASCII table; that's all character encoding ever is.

11. Cryptographic Hash Functions

A cryptographic hash function is a deterministic algorithm that takes an input of any length and produces a fixed-length output (the digest or hash). Hash functions are one-way, meaning it is computationally infeasible to reconstruct the input from the output. They are the backbone of data integrity verification, password storage, digital signatures, and blockchain technology.

Key Properties

  • Deterministic: The same input always produces the same hash. SHA256("hello") is always 2cf24dba....
  • Pre-image resistance: Given the hash, you cannot determine the input (without brute force).
  • Avalanche effect: A single-bit change in the input produces a completely different hash. "hello" and "Hello" have entirely different SHA-256 outputs.
  • Collision resistance: It should be infeasible to find two different inputs that produce the same hash.

MD5

MD5 produces a 128-bit (32 hex characters) digest. It was widely used for file integrity checks and password hashing but is now considered cryptographically broken; researchers have demonstrated practical collision attacks, meaning two different inputs can produce the same MD5 hash. Do not use MD5 for security purposes. It is only safe for non-security purposes like detecting accidental file corruption.

SHA-1

SHA-1 (Secure Hash Algorithm 1) produces a 160-bit (40 hex characters) digest. It was the internet's dominant hash function for over a decade, used in SSL certificates and Git object IDs. In 2017, Google's Project Zero demonstrated the first practical SHA-1 collision (the SHAttered attack). SHA-1 is now deprecated for security use but remains intact in Git, where collision resistance is less critical.

SHA-256 and SHA-512

The SHA-2 family (including SHA-256 and SHA-512) was designed by the NSA and standardized by NIST. No practical collision attacks are known. SHA-256 produces a 256-bit (64 hex characters) digest; SHA-512 produces a 512-bit (128 hex characters) digest.

  • SHA-256 is used in Bitcoin proof-of-work, TLS certificates, code signing, and HMAC (e.g., in JWTs with HS256).
  • SHA-512 is preferred when higher security margins are required or when processing 64-bit data. On 64-bit hardware, SHA-512 can actually be faster than SHA-256.

Hashing is not Encryption: A hash is irreversible. Encryption is reversible given the key. Never confuse the two. Storing a password as MD5("password123") is not safe, since rainbow tables pre-compute hashes for all common passwords. For password storage, use a dedicated algorithm like bcrypt, Argon2, or scrypt, which are designed to be slow and add salt automatically.

12. Unix Timestamps & Epoch Time

A Unix timestamp (also called epoch time or POSIX time) is the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970, a point in time called the Unix epoch. It is the universal language for representing moments in time in computing systems, from database records to HTTP headers to JWT expiration claims.

Why Seconds Since 1970?

The Unix epoch date was chosen somewhat arbitrarily when Unix was developed in the early 1970s; it was simply a recent, round date. The key advantage of a single integer over a structured date string is simplicity: comparing, sorting, storing, and transmitting timestamps becomes trivial arithmetic.

The Year 2038 Problem

On 32-bit systems, Unix timestamps are stored as a signed 32-bit integer, giving a maximum value of 2,147,483,647. This value corresponds to 03:14:07 UTC on January 19, 2038, after which 32-bit systems will overflow and typically roll back to 1901. Modern 64-bit systems are not affected; a 64-bit signed timestamp won't overflow for approximately 292 billion years.

ISO 8601

ISO 8601 is the international standard for representing dates and times in a human-readable, unambiguous string format. The canonical form is:

2026-01-15T14:30:00.000Z // YYYY-MM-DDThh:mm:ss.mssZ // The 'T' separates date from time. // The 'Z' denotes UTC (Zulu time).

ISO 8601 strings are lexicographically sortable; alphabetically sorting them also sorts them chronologically, a valuable property for log files and databases.

Timezones and UTC

Unix timestamps are always in UTC and contain no timezone information. The conversion from a Unix timestamp to a local time requires knowing the observer's timezone offset. This is why it is best practice to store timestamps in UTC and convert to local time only at the point of display. Never store "local time" in a database without explicitly storing the timezone alongside it.


Want to explore more? Visit the homepage for a full list of all available tools, or read more about the project on the About page.