GoWin Tools
Tools
โ† URL Encoder / Decoder

URL Encoder / Decoder ยท 6 min read

Reserved vs Unreserved Characters: The Rules Behind URL Encoding

RFC 3986 splits URL characters into reserved, unreserved, and everything else. Knowing which group a character belongs to explains every percent-encoding rule.

Why does %20 mean space but %2F sometimes means slash and sometimes means "please don't parse this as a path separator"? Why is ~ safe in a URL but ! a maybe? The answers all come from RFC 3986, which divides every printable ASCII character into three buckets: reserved, unreserved, and everything else. Once you know the buckets, every percent-encoding rule falls into place.

The Three Buckets

RFC 3986 defines them precisely.

Unreserved characters

Always safe. Never need encoding. Encoding them is allowed but discouraged because %41 and A mean the same thing โ€” and that ambiguity has historically caused security bugs.

A-Z  a-z  0-9  -  .  _  ~

Reserved characters

Have a syntactic role in URLs. Must be encoded if they appear as data rather than as a delimiter.

gen-delims:  :  /  ?  #  [  ]  @
sub-delims:  !  $  &  '  (  )  *  +  ,  ;  =

Everything else

Spaces, control characters, non-ASCII bytes, <, >, ", {, }, |, \\, ^, backtick. None of these are legal in a URL at all. They must always be percent-encoded โ€” or the URL is technically invalid, even if your browser quietly fixes it.

Why Reserved Characters Are Conditional

Reserved characters are the structural punctuation of a URL. : separates scheme from authority, / separates path segments, ? introduces the query, # introduces the fragment, & and =structure query parameters. Each one is fine where it's structural โ€” and dangerous where it isn't.

Take /. In a path it's a separator. Inside a query parameter value, a literal / is harmless; most parsers accept it unencoded. Inside a path segment, an unencoded / is interpreted as splitting that segment into two โ€” which is rarely what you wanted if you're trying to pass users/jane as a single ID.

The encoding decision is therefore context-sensitive. The same character flips from "leave alone" to "encode" based on which part of the URL it lives in.

The Encoding Rule by URL Component

A simplified version of the RFC 3986 grammar:

  • Scheme: only letters, digits, +, -, .. Nothing else, ever.
  • Userinfo: unreserved + sub-delims + :. Encode @, /, ?, #.
  • Host: unreserved + sub-delims. Non-ASCII hosts go through Punycode (xn--), not percent-encoding.
  • Path: unreserved + sub-delims + :, @. Encode / if it's data, encode ? and # always.
  • Query: unreserved + sub-delims + :, @, /, ?. Encode &, =, + if they're data, encode # always.
  • Fragment: same set as query. Encode # if you somehow have one inside.

Why Different Encoders Disagree

JavaScript ships two encoders that pick different sets:

  • encodeURI() โ€” assumes the input is already a complete URL. Leaves all reserved characters alone.
  • encodeURIComponent() โ€” assumes the input is one component (a path segment, a query value). Encodes all reserved characters except a small subset of sub-delims.

Neither is wrong. They're solving different problems. encodeURIComponent is the right answer when you're assembling a URL from pieces; encodeURIis the right answer for the rare case of normalising a URL that's already structurally valid.

Python's urllib.parse.quote defaults to leaving / unencoded โ€” useful for paths, dangerous for path segments. The safe="" argument forces it to encode everything reserved.

The Plus Sign Problem

+ is the only reserved character whose meaning depends on the URL section, not on the structural rule. In application/x-www-form-urlencoded โ€” the format used by HTML form submissions and most query strings โ€” + means a literal space. In every other part of a URL, + is just a plus sign.

Result: ?q=hello+world means "hello world" in a form-encoded query but "hello+world" in a path. Servers usually pick one interpretation per route, and clients have to match. When in doubt, encode the literal as %2B and don't rely on the +-as-space convention.

The WHATWG URL Standard

Browsers don't implement RFC 3986 strictly. They follow the WHATWG URL Standard, which is more permissive โ€” it accepts a wider set of characters in paths, normalises some inputs, and includes IDN host handling. For most application code this is invisible. For URL parsers and security-sensitive code (open redirect filters, SSRF protection), the gap between RFC 3986 and WHATWG has historically been a source of bypasses.

The mental model: unreserved = always safe, reserved = depends where it sits, everything else = encode it. That's the entire rulebook in one sentence โ€” the rest is figuring out which part of the URL you're currently building.

References

  1. Berners-Lee, T., Fielding, R., & Masinter, L. (2005). RFC 3986: Uniform Resource Identifier (URI) โ€” Generic Syntax. Internet Engineering Task Force.
  2. WHATWG. (2024). URL Living Standard. Web Hypertext Application Technology Working Group.
  3. Berners-Lee, T. (1994). RFC 1738: Uniform Resource Locators (URL). Internet Engineering Task Force.
  4. Duerst, M. & Suignard, M. (2005). RFC 3987: Internationalized Resource Identifiers (IRIs). Internet Engineering Task Force.