Basic concept: URL to website

Sun
09/04/11

Introduction

The Snap! Websites system is driven by the URL used to access the site. Everything in the URL may play a role in deciding what is going to be displayed (actually, other Browser parameters may also affect the data returned such as the Browser language, however, only the URL is used to determine which website you are accessing.)

Note: This is fully implemented and works as expected (domain & website determination and full discovery of the website concerned.)

Example

The following URL:

http://en.example.net/0.9/documentation

has the following parts defined in it:

http - this is the protocol; we support HTTP and HTTPS1

en - this represents a sub-domain, in this case it represents a language too

example.net - this is the domain name

0.9 - this is part of the URL path

documentation - another part of the URL path

Each part is used to determine the website to be displayed. First all the parts are used. Then each part is removed one by one until a website is found. If no site can be found this way, we can either use a default site or return a 4042.

Canonicalization

First, we want to canonicalize the URL. This way we make sure that identical URLs written in several different ways all match the same entry as expected. For example the plus (+) character or the "%20" sequence are both changed to a space. All three are considered to be exactly the same character and we are expected to use the space once canonicalized.

The canonicalization is done in such a way that the search of the website will be done with one read and the first result shall be the correct result. Once the characters are canonicalized, we then canonicalize the different parts as follow:

Protocol

The protocol is kept as in the URL. So it is set to "http" or "https". At this time we do not support any other protocols.

Domain

Using tables of possible extensions, just the domain is extracted from the request. For example, m2osw.co.uk is the complete domain name. Our system needs to know that .co.uk is a valid domain extension3.

Note that the sub-domains are separated because these may be removed from the search request, whereas the domain cannot be reduced a bit.

Some systems (i.e. Google) invert the domain parts as well. I do not see the need here, especially thinking that many will be .com and having them all together is not an advantage for us.

Sub-domains

To move in the right order in the search, the sub-domains need to be sorted in reverse order (i.e. the sub-domains in my.snap.website.m2osw.com become: website snap my.)

It will be the responsibility of the maintainer to not use invalid sub-domains here. For example, if sub-domain language support is offered, then no website can match the language sub-domain part (but at this point we cannot know that it is a language, so we have to include it...)

Question: Is the www extension a special case?

Path

Contrary to the Sub-domains, the path elements are already in the correct order so we do not need to change those.

When a full domain name is used, then the path is not used in the canonicalization result. It only becomes part of the site key (i.e. the row with the content for that one page.)

Options

We also support GET variables as is. These appear in the query string.

Note that we have to support one variable that is always ignored. In many cases that variable is called v (?v=123). This is used to avoid caches for files such as CSS files. You do so by setting that variable to a random value. This version feature in the query string has been changed to actually always using a version in the filename as in:

http://www.example.com/css/editor/editor_1.2.3_ie.css

The version is automatically managed by the system when you upload a JavaScript or CSS file making it really easy for website developers to handle those files.

As you can see that filename also includes a browser name to avoid, as much as possible, specifics of one browser in CSS files. It allows CSS files to be smaller and thus load faster and consoles to not generate as many warnings.

Synonyms

We should allow synonyms to allow users to use the "wrong" word and still get the right page. With the canonicalized URL in the header, search engines still know what the "correct" word is.

For example, you may want to use /journal/ in your path to define your blog. Using /blog/ should work as well, only the canonicalized URL will always say /journal/4.

Default Option

Some parameters need to support a default value to know whether the option is being forced by the end user (i.e. en.m2osw.com/... for the English version,) and not defined by some automatic selection (i.e. user browser language is set.)

Not only that, it is important to understand that the browser language definition will be ignored when the user specify the language in the URL. We may use the same parameter name with a somewhat different syntax when set to the default opposed to the forced value. We created the canonicalize_revision() function for that purpose.

The order for the language (and it should be followed for other options) is:

Language defined as a sub-domain (en.example.com)
Language defined as in the path (www.example.com/en/)
Language defined as a GET variable (www.example.com?lang=en)

Note that in most cases an option is not used to find the data in the database.

Search of the page

Now that we have all the elements canonicalized, we can search for the corresponding page.

Since each page has a reference back to the website that it was created in, we know which website it partains to. Now we can check other things such as permissions to access the website, then the page, the elements in the page, etc.

Page Not Found?

Whenever a user enters a full path for the page chances are it doesn't exist. The path plugin searches using the following mechanisms:

Check the database with the canonicalized URL: protocol, sub-domain, domain, path. If that works, then the plugin attached to that page has its execute() function called and that is the result. The execute may then make use of additional parameters such as the language, branch, and revision information.
Otherwise the path plugin calls the dynamic_path() event to let any one plugin capture the page. If that works, it is viewed as a dynamic page (i.e. a page without data saved in the database.) This process is extremely fast since it only happens in memory and only with the plugins that offer dynamic content which is fairly limited.
If no page was found yet, the path plugin sends a page_not_found() signal which another plugin can implement. For example, a search plugin could break up the words in the URL and search on those words.

The following represents that complete search, except for the protocol:

Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- No Path (index)

Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1
Protocol -- Domain -- Sub-Domain 1, 2 -- No Path (index)

Protocol -- Domain -- Sub-Domain 1 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1 -- Path 1
Protocol -- Domain -- Sub-Domain 1 -- No Path (index)

Protocol -- Domain -- Path 1, 2, 3
Protocol -- Domain -- Path 1, 2
Protocol -- Domain -- Path 1
Protocol -- Domain -- No Path (index)

All those tests should be repeated switching the protocol from the current one (say HTTPS) to the other one (say HTTP).

This search only happens on Page not Found errors so it doesn't matter too much.

The result of a Page not Found search is to either redirect the user to an existing page (302) or to present a Page not Found error. What to do will be defined in that specific website definition.

Assuming we can create a clean way to properly sort out all of those definitions in the correct order each time (i.e. the "--" would always represent something smaller than whatever can appear in the path) then we can do the search with a single readRows() call requesting the cell defining the website row name.

Website Definition and what we return

In most CMS, there is a back door for administrators to be able to set the current website in a maintenance mode. In that case, we're expected to return a 503 and possibly a time when the server will be available again.

This means depending on cookies, we may either show the website normally (to admins) or not (to anyone else.)

Sub-domains and Page not Found

When a page is found with less sub-domains than defined in the website domain name, we want to verify that each missing sub-domain is indeed a website parameter such as the language, user area, admin panels, etc.

If it is not, then the behavior will depend on the website definition, but by default it is expected that the website will not accept unknown sub-domains and intead a "website does not exist" error is generated. (i.e. yes! this is a not the same as a Page not Found error!) In this case, the only error code we can return is still 404. But the error message should be Website not Found.

Note that if the resource is found but not yet available, then we would not enter this very case (i.e the user creates a page and Saves it, but does not yet publish it.)

Drupal frivolous Page not Found non-errors

In Drupal, using extraneous path entries returns a page anyway. For example, the path /home exists and represents the home page. The path /home/plus/random/names will match the /home URL and thus will return the home page.

There is a huge problem with this scheme as it allows hackers to access your site will millions of URIs that do not exist, and each time you return a 200 code, so they can continue to kick you website like crazy.

Old Search Documentation

The test uses, in that order:

The full URL
The URL without protocol
The URL with the last part of the URL path removed
Repeat point 3 until only the domain and sub-domains remain
Repeat point 1 to 5 after the left-most sub-domain was removed

In our example, version 0.9 of the website would certainly match with this URL:

example.net/0.9

The en. part will be handled internally to select the language.

Note that selectors such as the en. sub-domain can appear on either or both sides. For example, the URL http://www.example.net/0.9/en/documentation/ could be setup to work the same way. The handling of such selectors is explained in a separate page.

For added speed and avoid total craziness, we want to limit the sub-directory to two by default. Thus the system would ignore basic-url/protocol in a path such as 0.9/en/documentation/basic-url/protocol.

Remember that Apache is changing the path using a rewrite. This means FastCGI gives us a GET variable with the query (i.e. http://en.example.net/snap.cgi?path=0.9/documentation)

Note:

This concept is taken from Drupal.

1. Note that with Apache2 this information comes as a separate variable: HTTPS=on as the HTTP URI does not include the protocol. The Apache server knows that it is secure when it receives the data via an SSH tunnel.
2. A 404 on totally invalid URLs is most certainly wiser than showing a default site. Note that some hackers will hit your website with a URL that is not managed by your server. Answering those request at all is not wise. These should not even return 404, but one of the 500 errors.
3. This is done using the libtld library which knows of all the valid domain name extensions currently in use in the whole world.
4. The canonicalized URL is shown in the canonical meta data found in the header of the HTML returned by the server.