faehnri.ch

Web Development

notes on the Information Super Highway


URLs

Break down of URLs1.

<scheme>://<username>:<password>@<host>:<port>/<path>;<parameters>?<query>#<fragment>

These are reserved characters, and need to be encoded if not used for their usual use in a url:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

These are unreserved and can be used as-is:

"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

These are 'unwise' characters to use. Not strictly reserved, but a server might individually consider them special, so just encode always any way:

"{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

This last set is the excluded set, which are all ASCII control characters and these delimiter characters:

"<" | ">" | "#" | "%" | '"'

To encode a character, it's just it's two-character ASCII hex value appened to %, so space is encoded as %20

Remember, you only want to encode the parts themselves, not the whole URL because you won't know if something is from the original or encoded.

And don't double encode/decode. Consider this:

http://example.com/stuff.html?param1=abc%613

Encoded:

http://blah.com/yadda.html?param1=abc%25613

Decode:

http://example.com/stuff.html?param1=abc%613

Decode again:

http://example.com/stuff.html?param1=abca3

There's an idea of absolute vs. relative URL. If the URL contains the scheme (http), that's absolute.

But a relative URL is interpreted relative to another URL (duh). The other URL is the base. To figure out the absolute from the relative we need to figure out the base, then depending on the syntax we combine it with the base.

The base URL may have been explicitly specified with the <base> tag.

If no base tag, then the URL of the document in which the relative URL is found should be the base.

If still none of those, then really is relaive and do the following:

Then append any query string or fragment from the relative to the absolute url.

Some examples:

1)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: rel1
final absolute: http://www.blah.com/yadda1/yadda2/rel1

2)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: /rel1
final absolute: http://www.blah.com/rel1

3)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ../rel1
final absolute: http://www.blah.com/yadda1/rel1

4)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ./rel1?param2=baz#bar2
final absolute: http://www.blah.com/yadda1/yadda2/
rel1?param2=baz#bar2

5)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ..
final absolute: http://www.blah.com/yadda1/
  1. From an article in Hacker Monthly #3, which itself was from someones post that I don't have a reference for. But that also just got it from RFC 2396.