Snap! Websites
An Open Source CMS System in C++
The Snap! Websites system is driven by the URL used to access the site. Everything in the URL may play a role in deciding what is going to be displayed (actually, other Browser parameters may also affect the data returned such as the Browser language, however, only the URL is used to determine which website you are accessing.)
Note: This is fully implemented and works as expected (domain & website determination and full discovery of the website concerned.)
The following URL:
http://en.example.net/0.9/documentation
has the following parts defined in it:
http - this is the protocol; we support HTTP and HTTPS1
en - this represents a sub-domain, in this case it represents a language too
example.net - this is the domain name
0.9 - this is part of the URL path
documentation - another part of the URL path
Each part is used to determine the website to be displayed. First all the parts are used. Then each part is removed one by one until a website is found. If no site can be found this way, we can either use a default site or return a 4042.
First, we want to canonicalize the URL. This way we make sure that identical URLs written in several different ways all match the same entry as expected. For example the plus (+) character or the "%20" sequence are both changed to a space. All three are considered to be exactly the same character and we are expected to use the space once canonicalized.
The canonicalization is done in such a way that the search of the website will be done with one read and the first result shall be the correct result. Once the characters are canonicalized, we then canonicalize the different parts as follow:
The protocol is kept as in the URL. So it is set to "http" or "https". At this time we do not support any other protocols.
Using tables of possible extensions, just the domain is extracted from the request. For example, m2osw.co.uk is the complete domain name. Our system needs to know that .co.uk is a valid domain extension3.
Note that the sub-domains are separated because these may be removed from the search request, whereas the domain cannot be reduced a bit.
Some systems (i.e. Google) invert the domain parts as well. I do not see the need here, especially thinking that many will be .com and having them all together is not an advantage for us.
To move in the right order in the search, the sub-domains need to be sorted in reverse order (i.e. the sub-domains in my.snap.website.m2osw.com become: website snap my.)
It will be the responsibility of the maintainer to not use invalid sub-domains here. For example, if sub-domain language support is offered, then no website can match the language sub-domain part (but at this point we cannot know that it is a language, so we have to include it...)
Question: Is the www extension a special case?
Contrary to the Sub-domains, the path elements are already in the correct order so we do not need to change those.
When a full domain name is used, then the path is not used in the canonicalization result. It only becomes part of the site key (i.e. the row with the content for that one page.)
We also support GET variables as is. These appear in the query string.
Note that we have to support one variable that is always ignored. In many cases that variable is called v (?v=123). This is used to avoid caches for files such as CSS files. You do so by setting that variable to a random value. This version feature in the query string has been changed to actually always using a version in the filename as in:
http://www.example.com/css/editor/editor_1.2.3_ie.css
The version is automatically managed by the system when you upload a JavaScript or CSS file making it really easy for website developers to handle those files.
As you can see that filename also includes a browser name to avoid, as much as possible, specifics of one browser in CSS files. It allows CSS files to be smaller and thus load faster and consoles to not generate as many warnings.
We should allow synonyms to allow users to use the "wrong" word and still get the right page. With the canonicalized URL in the header, search engines still know what the "correct" word is.
For example, you may want to use /journal/ in your path to define your blog. Using /blog/ should work as well, only the canonicalized URL will always say /journal/4.
Some parameters need to support a default value to know whether the option is being forced by the end user (i.e. en.m2osw.com/... for the English version,) and not defined by some automatic selection (i.e. user browser language is set.)
Not only that, it is important to understand that the browser language definition will be ignored when the user specify the language in the URL. We may use the same parameter name with a somewhat different syntax when set to the default opposed to the forced value. We created the canonicalize_revision() function for that purpose.
The order for the language (and it should be followed for other options) is:
Note that in most cases an option is not used to find the data in the database.
Now that we have all the elements canonicalized, we can search for the corresponding page.
Since each page has a reference back to the website that it was created in, we know which website it partains to. Now we can check other things such as permissions to access the website, then the page, the elements in the page, etc.
Whenever a user enters a full path for the page chances are it doesn't exist. The path plugin searches using the following mechanisms:
The following represents that complete search, except for the protocol:
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- Path 1
Protocol -- Domain -- Sub-Domain 1, 2, 3 -- No Path (index)
Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1, 2 -- Path 1
Protocol -- Domain -- Sub-Domain 1, 2 -- No Path (index)
Protocol -- Domain -- Sub-Domain 1 -- Path 1, 2, 3
Protocol -- Domain -- Sub-Domain 1 -- Path 1, 2
Protocol -- Domain -- Sub-Domain 1 -- Path 1
Protocol -- Domain -- Sub-Domain 1 -- No Path (index)
Protocol -- Domain -- Path 1, 2, 3
Protocol -- Domain -- Path 1, 2
Protocol -- Domain -- Path 1
Protocol -- Domain -- No Path (index)
All those tests should be repeated switching the protocol from the current one (say HTTPS) to the other one (say HTTP).
This search only happens on Page not Found errors so it doesn't matter too much.
The result of a Page not Found search is to either redirect the user to an existing page (302) or to present a Page not Found error. What to do will be defined in that specific website definition.
Assuming we can create a clean way to properly sort out all of those definitions in the correct order each time (i.e. the "--" would always represent something smaller than whatever can appear in the path) then we can do the search with a single readRows() call requesting the cell defining the website row name.
In most CMS, there is a back door for administrators to be able to set the current website in a maintenance mode. In that case, we're expected to return a 503 and possibly a time when the server will be available again.
This means depending on cookies, we may either show the website normally (to admins) or not (to anyone else.)
When a page is found with less sub-domains than defined in the website domain name, we want to verify that each missing sub-domain is indeed a website parameter such as the language, user area, admin panels, etc.
If it is not, then the behavior will depend on the website definition, but by default it is expected that the website will not accept unknown sub-domains and intead a "website does not exist" error is generated. (i.e. yes! this is a not the same as a Page not Found error!) In this case, the only error code we can return is still 404. But the error message should be Website not Found.
Note that if the resource is found but not yet available, then we would not enter this very case (i.e the user creates a page and Saves it, but does not yet publish it.)
In Drupal, using extraneous path entries returns a page anyway. For example, the path /home exists and represents the home page. The path /home/plus/random/names will match the /home URL and thus will return the home page.
There is a huge problem with this scheme as it allows hackers to access your site will millions of URIs that do not exist, and each time you return a 200 code, so they can continue to kick you website like crazy.
The test uses, in that order:
In our example, version 0.9 of the website would certainly match with this URL:
example.net/0.9
The en. part will be handled internally to select the language.
Note that selectors such as the en. sub-domain can appear on either or both sides. For example, the URL http://www.example.net/0.9/en/documentation/ could be setup to work the same way. The handling of such selectors is explained in a separate page.
For added speed and avoid total craziness, we want to limit the sub-directory to two by default. Thus the system would ignore basic-url/protocol in a path such as 0.9/en/documentation/basic-url/protocol.
Remember that Apache is changing the path using a rewrite. This means FastCGI gives us a GET variable with the query (i.e. http://en.example.net/snap.cgi?path=0.9/documentation)
Note:
This concept is taken from Drupal.
Snap! Websites
An Open Source CMS System in C++