URL Encoder / Decoder ยท 6 min read
Reserved vs Unreserved Characters: The Rules Behind URL Encoding
RFC 3986 splits URL characters into reserved, unreserved, and everything else. Knowing which group a character belongs to explains every percent-encoding rule.
Why does %20 mean space but %2F sometimes means slash and sometimes means "please don't parse this as a path separator"? Why is ~ safe in a URL but ! a maybe? The answers all come from RFC 3986, which divides every printable ASCII character into three buckets: reserved, unreserved, and everything else. Once you know the buckets, every percent-encoding rule falls into place.
The Three Buckets
RFC 3986 defines them precisely.
Unreserved characters
Always safe. Never need encoding. Encoding them is allowed but discouraged because %41 and A mean the same thing โ and that ambiguity has historically caused security bugs.
A-Z a-z 0-9 - . _ ~
Reserved characters
Have a syntactic role in URLs. Must be encoded if they appear as data rather than as a delimiter.
gen-delims: : / ? # [ ] @ sub-delims: ! $ & ' ( ) * + , ; =
Everything else
Spaces, control characters, non-ASCII bytes, <, >, ", {, }, |, \\, ^, backtick. None of these are legal in a URL at all. They must always be percent-encoded โ or the URL is technically invalid, even if your browser quietly fixes it.
Why Reserved Characters Are Conditional
Reserved characters are the structural punctuation of a URL. : separates scheme from authority, / separates path segments, ? introduces the query, # introduces the fragment, & and =structure query parameters. Each one is fine where it's structural โ and dangerous where it isn't.
Take /. In a path it's a separator. Inside a query parameter value, a literal / is harmless; most parsers accept it unencoded. Inside a path segment, an unencoded / is interpreted as splitting that segment into two โ which is rarely what you wanted if you're trying to pass users/jane as a single ID.
The encoding decision is therefore context-sensitive. The same character flips from "leave alone" to "encode" based on which part of the URL it lives in.
The Encoding Rule by URL Component
A simplified version of the RFC 3986 grammar:
- Scheme: only letters, digits,
+,-,.. Nothing else, ever. - Userinfo: unreserved + sub-delims +
:. Encode@,/,?,#. - Host: unreserved + sub-delims. Non-ASCII hosts go through Punycode (
xn--), not percent-encoding. - Path: unreserved + sub-delims +
:,@. Encode/if it's data, encode?and#always. - Query: unreserved + sub-delims +
:,@,/,?. Encode&,=,+if they're data, encode#always. - Fragment: same set as query. Encode
#if you somehow have one inside.
Why Different Encoders Disagree
JavaScript ships two encoders that pick different sets:
encodeURI()โ assumes the input is already a complete URL. Leaves all reserved characters alone.encodeURIComponent()โ assumes the input is one component (a path segment, a query value). Encodes all reserved characters except a small subset of sub-delims.
Neither is wrong. They're solving different problems. encodeURIComponent is the right answer when you're assembling a URL from pieces; encodeURIis the right answer for the rare case of normalising a URL that's already structurally valid.
Python's urllib.parse.quote defaults to leaving / unencoded โ useful for paths, dangerous for path segments. The safe="" argument forces it to encode everything reserved.
The Plus Sign Problem
+ is the only reserved character whose meaning depends on the URL section, not on the structural rule. In application/x-www-form-urlencoded โ the format used by HTML form submissions and most query strings โ + means a literal space. In every other part of a URL, + is just a plus sign.
Result: ?q=hello+world means "hello world" in a form-encoded query but "hello+world" in a path. Servers usually pick one interpretation per route, and clients have to match. When in doubt, encode the literal as %2B and don't rely on the +-as-space convention.
The WHATWG URL Standard
Browsers don't implement RFC 3986 strictly. They follow the WHATWG URL Standard, which is more permissive โ it accepts a wider set of characters in paths, normalises some inputs, and includes IDN host handling. For most application code this is invisible. For URL parsers and security-sensitive code (open redirect filters, SSRF protection), the gap between RFC 3986 and WHATWG has historically been a source of bypasses.
The mental model: unreserved = always safe, reserved = depends where it sits, everything else = encode it. That's the entire rulebook in one sentence โ the rest is figuring out which part of the URL you're currently building.
References
- Berners-Lee, T., Fielding, R., & Masinter, L. (2005). RFC 3986: Uniform Resource Identifier (URI) โ Generic Syntax. Internet Engineering Task Force.
- WHATWG. (2024). URL Living Standard. Web Hypertext Application Technology Working Group.
- Berners-Lee, T. (1994). RFC 1738: Uniform Resource Locators (URL). Internet Engineering Task Force.
- Duerst, M. & Suignard, M. (2005). RFC 3987: Internationalized Resource Identifiers (IRIs). Internet Engineering Task Force.