Web Development

notes on the Information Super Highway

URLs

URLs

Break down of URLs¹.

<scheme>://<username>:<password>@<host>:<port>/<path>;<parameters>?<query>#<fragment>

scheme: specifies the protocol used to access the resource. A scheme is official if it's registered with the IANA (http, ftp), or unofficial if not registered (sftp, svn (although they might be registered by now?)). The scheme must start with a letter and is separated from the rest of the URL by the first : (the // are not part of the separator but is the beginning of the next part).
username: this, the password, host, and port form what's known as the authority of the URL
password: separated from the username with a :, and separated from the host by an @. You can supply just the username both username and password:
```
ftp://user@example.com/
ftp://user:pass@example.com/
```
If a user or pass aren't supplied, the application (browser) may supply defaults
host: domain name or IP
port: network port for the application. HTTP default is 80, and if omitted it is assumed.
path: separate from preceding parts by a /, and itself a sequence of segments separated by / characters. Usually tells where on the server the resource is. Each path segment can contain a parameter separated by a ;.
```
http://www.example.com/this;param1=foo/my;param2=bar/path.html
```
parameters: after the path, before the query string, separated by ;
```
http://www.example.com/this/my/path.html;param1=foo;param2=bar
```
query: like params, but these are more common. Separated from rest of URL by a ? and from each other by &, but also can be separated with ; like params
```
http://www.example.com/this/my/path.html?param1=foo&param2=bar
http://www.example.com/this/my/path.html?param1=foo;param2=bar
```
fragment: used to address a part of the resource, usually seen as a link to a section of an HTML document. Separated from the rest of the URL with a #. Client may not send this to the server, the client just uses it one its end.

These are reserved characters, and need to be encoded if not used for their usual use in a url:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

These are unreserved and can be used as-is:

"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

These are 'unwise' characters to use. Not strictly reserved, but a server might individually consider them special, so just encode always any way:

"{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

This last set is the excluded set, which are all ASCII control characters and these delimiter characters:

"<" | ">" | "#" | "%" | '"'

To encode a character, it's just it's two-character ASCII hex value appened to %, so space is encoded as %20

Remember, you only want to encode the parts themselves, not the whole URL because you won't know if something is from the original or encoded.

And don't double encode/decode. Consider this:

http://example.com/stuff.html?param1=abc%613

Encoded:

http://blah.com/yadda.html?param1=abc%25613

Decode:

http://example.com/stuff.html?param1=abc%613

Decode again:

http://example.com/stuff.html?param1=abca3

There's an idea of absolute vs. relative URL. If the URL contains the scheme (http), that's absolute.

But a relative URL is interpreted relative to another URL (duh). The other URL is the base. To figure out the absolute from the relative we need to figure out the base, then depending on the syntax we combine it with the base.

The base URL may have been explicitly specified with the <base> tag.

If no base tag, then the URL of the document in which the relative URL is found should be the base.

if no scheme, authority, or path, then the relative url is a referenceto the base url
if there is a scheme, then it's actually an absolute url
if no scheme but there's an authority (host, port), then the relative url is likely a network path, take the scheme from the base url and append the relative url to it with ://

If still none of those, then really is relaive and do the following:

inherit the scheme and authority (host, port) from the base
if the relative begins with /, then it's an absolute path, append it to the scheme and authority
if it doesn't begin with /, then take the path of the base, discard everything after the last / (in the base and append the relative?)
yeah, then we take the relative url and append to the resulting path
if there's a ./ anywhere in the resulting path we remove it (so if ./thing.html, remove the ./)
if there is a ../, we remove it and the preceding segment of the path. So all "<segment>/../" are removed, keep going until there are no more ../
if the path ends with .., remove it and the ending segment. (so remove "<segment>/..", and means our relative path was ..)
if the path ends with ., remove it (relative path was ., just the same page you're on?)

Then append any query string or fragment from the relative to the absolute url.

Some examples:

1)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: rel1
final absolute: http://www.blah.com/yadda1/yadda2/rel1

2)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: /rel1
final absolute: http://www.blah.com/rel1

3)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ../rel1
final absolute: http://www.blah.com/yadda1/rel1

4)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ./rel1?param2=baz#bar2
final absolute: http://www.blah.com/yadda1/yadda2/
rel1?param2=baz#bar2

5)
base: http://www.blah.com/yadda1/yadda2/
yadda3?param1=foo#bar
relative: ..
final absolute: http://www.blah.com/yadda1/

From an article in Hacker Monthly #3, which itself was from someones post that I don't have a reference for. But that also just got it from RFC 2396.↩