URL Test

Sun
10/16/11

This page is a proof of concept more for myself than for the community although it can be useful for you to further understand the search mechanism offered by the URL.

URL Parts

The URL may include several different types of parts that are defined in detail below.

http://user:password@en.3.5.example.com:port/software/snap?page=1

This URL parts are:

http -- protocol
user:password -- authentication, always an option
en.3.5. -- sub-domain
port -- port, must also be defined in Apache
example.com -- domain and TLD
software/snap -- path
page=1 -- extra options

In the Cassandra database we make use of 3 layers:

Domain
Website
Content (generally speaking: pages)

The Domain makes use of (a) the domain and TLD part (example.com) as its key. The TLD is extracted using the tld() function (from the libtld library); and (b) the sub-domain part to determine which website is being accessed. A fork example that can happen at this level is a sub-domain used as website name (i.e. foo.snap.m2osw.com -- "foo" defines a specific website.)

The Website makes use of all these parts to determine the base of the key to use to find a page of Content although in many cases the path. Of course, the website cannot remove required fields that were determined by the domain check process. (We'd have to determine how that can be verified as we create domains and websites. We may instead use the rewritten URL as the Domain sees it--with flags removed--and not use the original URI)

The Content makes use of the results from the Website computation and appends the path and extra options that were not marked as being flags (TBD--we may have to do that the other way around for the extra options, although it is a good thing to generate an error Page not Found if a parameter is wrong.)

The Content is used to determine which module needs to run to generate the output page.

Domain

The domain is the base part that was registered with a registrar. For example, snapwebsites.org is a domain.

The domain is extracted from the URL in order to use it as a search key to learn how the other parts of the URL need to be typed. We do not want to enforce one specific scheme and therefore we need to have such a distinction.

NOTE

The domain name (including all the sub-domain parts) is always case insensitive so we only deal with lowercase characters and assume that the input is UTF-8. This is part of the canonicalization.

Options

Different sections of the URL may be considered an option. Options may not be used with all the paths of the website, but they are defined in a symmetrical way for the entire site (there is no way around it when the options are part of the URL or you enter a can of worms, I think.)

Options do not take part in the search of the page to be displayed by the URL. Instead, it defines some parameters on how to display the page. For example, the language may be specified as a sub-domain name (i.e. the "en." part in en.m2osw.com.)

Options are given names and as we find them we assign the URL value (i.e. in case of the language, we would have options["language"] = "en"). When the user does not specify a value, the website can define a default (including empty when it makes sense.)

Website Parts

What we call a website part is a static entry that is used to determine the website being hit.

Quite often, this is a sub-domain, but it can also be the first few names in the path. Such a path allows you to create one website that uses different technologies. For example, you may have written a book in HTML and you want to keep it as is.

Group Parts

Group parts are essentially the same as Website Parts except that it is dynamic in the sense that multiple websites can ensue from one group part.

The difference in term is used because a website part is a static string such as "www." or "snap." and it is said to be well defined, whereas, a group part is a regular expression such as "[a-z0-9]+\." which is said to be dynamic.

IMPORTANT NOTE

Note that a group part may use a regular expression and match different sub-domains or paths, but in the end it will be transformed to one static string. For example, the "www." could result from "w.", "ww.", "wwww." as well. Therefore we can support a certain amount of dynamism and still have a Website part.

Path

The remainder of the path (opposed to the names defined in the path that represent options) is not used to determine the website. The path is used to determine the page to be displayed though.

Note that some people put the language of a page at the end of the path as in:

page.es.html

This concept can work with Snap! Websites, but in most cases the parts are cut at slashes, not any other character. Also, using path as options at the end of the path is likely to cause fork problems, where URLs don't match very well with a directory tree structure.

/here/page.es
/here/page/sub-page.es

In that example you can notice that the sub-page is not exactly at the right place. It should instead be:

/here/page.es/sub-page.es

On the other hand, having the language as the first element in the path looks like this:

/es/here/page
/es/here/page/sub-page

and that's a good directory tree.

Parameters (query string)

After the path, you may have parameters (i.e. the & separated variable=value entries after a question mark.) At this point, these are all considered options except for the path which is indicated with the q variable (as in [q]uery.)1

To Be Noted

The q parameter was not chosen by chance. It is the variable name used by Google Search and by Drupal to represent a path. This is important for proper search engine optimization if you do not want to use clean URLs.

Final Canonicalizations

There are two canonicalizations:

Page Canonicalization

The page canonicalization is used to find the page in the database. This canonicalization strips out all the flags / options and we end up with a clean URL and a set of variables. For example, the URL http://en.m2osw.com/ may be transformed to www.m2osw.com and the language variable set to "en".

URL Canonicalization

The URL Canonicalization is used to tell search engines what the exact URL of a page is. So if a user attempts to access a page such as http://en.m2osw.com/ when you actually want the language to appear in your path instead as in http://www.m2osw.com/en/ but both URL should work without a redirect, you'll want to put the second URL in the header to let the search engine know that even though they just accessed the page with http://en.m2osw.com/ they are expected to index the page as http://www.m2osw.com/en/. (this is important to avoid duplicate "errors" too!)

In order to properly canonalize a URL we need to have all the parameters in a very specific order so as to find the exact same page whatever the original URL.

This means the system must support a way to put the parts together in one specific place. For example, if you support a language, you could have 3 places where it appears:

http://en.m2osw.com/
http://www.m2osw.com/en/
http://www.m2osw.com/index.html.en
http://www.m2osw.com/?lang=en

In these examples, the result is 100% the same. The system returns the exact same HTML page that includes a canonalize META tag with one of these URLs.

For example, if the official language specification is right after the domain name, the canonalized URL is: http://www.m2osw.com/en.

Now, our process needs to detect the language in either place and once properly recognized, it needs to be placed in the proper canonalized location to find the corresponding data and create the META tag. To do so, our description of the domain name and sub-domains, the path and query strings, makes use of the namespace capability. Parts that are expected to be used for the URL canonicalization are made part of the outer canonalize namespace (opposed to the domain, website, path, and options namespaces.)

Process

1) Break-up the raw URL in groups:

Protocol (force uppercase)
Sub-domains (force lowercase)
Domain (force lowercase)
Path
Parameters

The Path and Parameters case is not touched by default. It can be forced on a per option or on the entire path/parameters if requested. Note that it is common to force the path to all lowercase in the Unix world, not so much under MS-Windows.

2) Search for Domain in the Domain table, if not found, return a very simple 404 Unknown Domain (see Error Pages feature [core]), and that should prevent useless caching. (Hackers do send you requests with invalid domain names! We may even add their IP in a Ban list that goes to our firewall.)

3) Using the domain settings, parse all the Sub-domains for options, websites, and groups. This creates a new URL which only includes website and group parameters. This new URL is used to search for the Website (it is the key in the Website table.)

4) Search for Website, if not found, return a very simple 404 Unknown Website. This error page, however, can include a choice to go to a default website (i.e. the user arrived on snap.m2osw.com which does not represent a website, offer them to go to www.m2osw.com instead.) However, we do not want to do an auto-redirect because if we're here the combination of sub-domains are wrong.

5) Now that we have access to a Website entry, check the Protocol, Path and Parameters (as mentioned, the path is the "q" parameter, this parameter is handled automatically.) One Website entry from the Website table may define a large number of websites (i.e. many paths.)

For details about this step, look at the URL Parts section described earlier.

4) Once the website and group parts were determined, reassemble them with the protocol, sub-domains, and domain (protocol only if the website options say so, i.e. if not marked as "either".)

5) Search for the website from the rebuilt simplified canonalized URL as defined in (4).

5.1) If the search fails, try again with the input in lowercase against the lowercase version (some form of internal redirect to allow case insensitive URLs.)

6) Load the website info so we have them at hand

7) Using the website unique identifier (we probably want to use the URI as defined in (5) although we may also want to do an MD5, not too sure what would be faster!), search for the page (i.e. website ID + path)

8) Execute the page and return the result.

Tables

Domain Table

Here is a simplified version of the domain table.

Domain Table
Column Name	Value Description
domains	Rules as defined below. (script of sub-domain rules)
domains_compiled	Compiled sub-domains rules. (binary variable size structures)

Keys to the data in the Domain Table are the domain name with the TLD, but no protocol, no sub-domain, no path (i.e. we use "example.com" and not "www.example.com").

The <rule #> parameter is used to order the sub_domains definitions (i.e. Column Name is an integer so we can sort by counting starting at 1. 0 is reserved for the rules definition saved as a string.) When saving a new script, we remove all the existing sub-domain rules, compile the script, and create new entries. It is important to understand that all the rules are defined in order, so the order is important because the first match is the one used.

Sub-Domain search

The Domain Table defines a list of sub_domains that it supports. It is not expected to be growing dynamically (especially because a corresponding DNS entry needs to exist,) only as users define new websites or groups of websites. The sub-domain definitions include a small structure that defines the different features each sub-domain or group supports.

For example, the w., ww., www., wwww. and <nothing> could be expressed as:

optional www = website("w{1,4}\.", "www");

This is one website (not a group) as defined by the other settings. Since multiple entries can be accepted, we always replace that entry, even when <nothing> is used, with the second parameter defined in the website function ("www" here.) The <nothing> is represented by the fact that the variable is set using the optional keyword.

Another example, m2osw.com supports hosting with URLs named "<sitename>.snap.m2osw.com". This can be represented as a group with:

required host = "[-a-z0-9]+\.snap\.";

In this case, it is marked as a group and thus each domain name that match this expression is viewed as a different website. Note that this host name is required.

Of course you may have a mix of options, websites, and group matching. Assuming we allow a set of regular expressions to define a rule, and each can be defined as an option, a website, or a group. In that case we can, for example, define the language and a version in the list of sub-domain names as in:

optional version = flag("([0-9]+\.[0-9]+)\.", "1.0"); — Version (option)
required host = "[a-zA-Z_][a-zA-Z0-9_]*\."; — Project name (group, defines a website)
optional language = flag("([a-z][a-z]\.)", ""); — 2 letter language (option)
optional www = website("w{1,4}\.", "www"); — www (defines a website)

To check on these, the system concatenates the regular expressions and adds a ^ at the very beginning and a $ at the very end then matches the result against all the sub-domains defined in the URL. If it matches, we found the options and website.

Note that the concatenation takes many more things in account:

For required variables, use the expression without adding a <nothing> part.
For optional variables, add "(|" at the start and ")" at the end.
For the options that are not marked with website(), make sure we capture the content (a regex part) and save the captured value in the corresponding variable. In case of a website() we always set the variable to the second string.

URL Canonicalization

The domain, website, path, and options (query string) canonicalization works by gathering all the variables from each part. In some cases, some variables overlap (are defined multiple times.) For example, our previous example with a language specification defines the language variable 4 times:

domain
path
extension
option

Now our goal is to create a unique URL in the end. For that purpose, we must define ONE of these positions for the language as the official position.

In order to allow for so many definitions, we want to make use of namespaces. The main namespace is called canonalize and it includes any number of namespaces used to distinguish each part of the URL. At least we'll have the domain and website namespaces. Others can be added as required when a website (or even a domain) is to define multiple areas in the path or options.

Now assuming that the path position is the official position of the language, a URL defining the language in one of the other positions is converted to the path position. In other words, the URL:

http://en.m2osw.com/

is canonalized into:

http://www.m2osw.com/en/

This works by indicating the preferred (official) language variable. To do so we use the namespace notation to move that specific definition to the global namespace (i.e. the canonalize namespace.)

A default variable definition goes like this:

optional language = flag('(en|fr|de)', 'en');

In this case the language variable appears in the inner most namespace (although unspecified, the default namespace is well defined: domain or website.)

optional canonalize::language = flag('(en|fr|de)', 'en');

Here the variable language is forced to the canonalize namespace and as such makes it in the official definition. The one that will be used to rebuild the canonalized URL.

Problem The language specification could appear more than once and be contradictory. For example, the URL: http://en.m2osw.com/fr/index.html.de?lang=es specifies 4 languages... What's the correct solution in this case?

IMPLEMENTATION

The implementation uses a string that the administrator can edit as required. That string is compiled at the time it is modified. The string is what the administrator sees. The compiled result is what the system sees and uses when receiving a hit.

The administrator string is a Domain Rules script using the following syntax:

start: rule_list

rule_list: rule
         | rule_list rule
rule: name '{' sub_domain_list '}' ';'

sub_domain_list: sub_domain
               | sub_domain_list sub_domain
               | namespace name '{' sub_domain_list '}'
sub_domain: optional sub_domain_var ';'
          | required sub_domain_var ';'
sub_domain_var: qualified_name '=' string
              | qualified_name '=' website '(' string ',' string ')'
              | qualified_name '=' flag '(' string [ ',' string ] ')'

qualified_name: name
              | qualified_name '::' name

name: [a-z_][a-z0-9_]*

string: '"' [^"]* '"'

*Note: The string definition is very simplified. We support the full C/C++ syntax.
       The namespace syntax is not yet supported as the qualified names are sufficient.

In other words, a rule is a list of variables set to a string or the result of a function call. The variable names are used later in the Snap! processing (you can add your own to interact with your own plug-ins.) For example, a variable named version can be used to determine which version of a file to present to the end user.

The keyword before the variable name determines whether the content is optional or required. Note that the fact that it is an option (i.e. FLAG) doesn't make a parameter optional in the URL.

The website() function is used for variables representing a website. The second parameter string is the exact content used to search for the website. For example, if you accept "<nothing>, w, ww, www, wwww", you need to change that into one specific value: "www". For example, you could write:

optional www = website("w{1,4}\.", "www.");

to represent that the www sub-domain variable matches any "www" sub-domain representation, but the result is always set to "www." so we can find a website which name is "www.example.com" even when the user entered "ww.example.com" (and we are not forcing a redirect in Apache, which is not required when we define the canonical URL in the page header.)

The flag() function is used with variables that represent an option. Those are ignore when searching the website since these are options used later in the process (i.e. the language of the page.) In this case, the second string represents the default value which is used if the option is missing in the URL. The second parameter is used only if the OPTIONAL keyword was used and there was no match for that option.

When neither the website() nor the flag() functions are used then the parameter represents a group of websites. This means the resulting value of that variable is to be used as is in the search of the website.

Example:

long_form {
  required host = "[a-z0-9]+\.";
  optional version = flag("[0-9]+\.[0-9]+\.", "1.0.");
  required www = website("w{1,4}", "www");
  namespace canonalize {
    optional language = flag("[a-z][a-z]", "");
  };
}

Details about the long_form rule:

host — Define a set of hosts (a group of websites.) Whatever the variable host is set to will be used to search the corresponding website.

version — Offer an optional version with two numbers such as 3.12. When the version is not specified in the URI, use 1.0. as the default. Note that you could also defined two variables: major and minor, instead of just one version parameter.

www — Accept 1 to 4 "w" in the following sub-domain (i.e. w., ww., www., or wwww.) In this example the www variable is mandatory and we do not allow the empty string.

language — Accept an optional language name. If the user doesn't specify a language on the URI, then set the parameter to an empty string which the language plugin will interpret as: Use the language specified in the user browser.

The URL canonicalization of the language string always happens in this location (because it is defined in the "canonalize" namespace.)

This example would match a URL such as test.3.59.w.en.example.org. Since the version and language parameters are options, it would transform the URI (Page Canonicalization) by removing those two parameters. Also we have one website entry which is always replaced with www. The resulting name of the website is test.www.example.org. Since the first sub-domain name matches any letters and digit combinaisons, the name could be changed to any word and still work: snap.6.22.wwww.fr.example.org would become snap.www.example.org. Notice how the "w" and "wwww" become "www" and thus make the website entry unique even though multiple entries are matched.

Website Table

Here is a simplified version of the website table.

Website Table
Column Name	Value Description
websites	Set of rules as entered by the end user (i.e. original script)
websites_compiled	The binary compiled set of rules.

Keys in the Website table are the URI as defined while processing the domain. Therefore the key only includes a set of sub-domains the domain with its TLD.

WARNING:

The list of options must be 100% understood. In other words, if a user adds an option such as ?test=123 and "test" is not an acceptable option, then Snap! must fail with a 404. This is important because otherwise the search engines may assume that ?test=123 exists and you would get the same data in the output with and without it (i.e. duplication of data! although the canonilized path in the page header should solve this problem...) On the other hand, such options could be used with external products such as Google Analytics to track what's happening but not have any other side effects on the website.

It is also possible to emit a 301 or 302, but 404 is more likely the best option. We can always offer the user to select the result on a per website basis and possibly on a per option basis (i.e. if option blah was used before and is not available anymore, then the user should be allowed to use it but redirect to a URL without it.)

This can be a problem for options such as a paging query string. For example, you present a list and want to show page 3 using the option: ?page=3. In this case the page parameter is not an option and it is not required.

Protocol search

The search of the protocol checks whether the protocol is HTTP or HTTPS. We may still want to support a regular expression in case some users want to include plugins that support other protocols (i.e. gopher, ftp, etc.) but at this point that would not do very well in Snap!

If either protocol can be used then we can write "HTTP|HTTPS" (note that the protocol is changed to uppercase.) To accept any protocol whatsoever, use the "any" keyword.

Note that at this point the canonalized URL uses the protocol string as it appears when we get a match (i.e. if you used HTTP|HTTPS then the canonalized result is HTTP|HTTPS.) However, if the any keyword or both protocols are included, then the protocol cannot be used to search the final page.

Note: For the canonical meta tag, we will use HTTP by default unless the page requires HTTPS or the website does not support HTTP.

Port search

The port can be checked just like the path although obviously it's only a number. Each port can determine a different website or an option.

Note that even if a port is not specified in the original URL, a port is always defined for test purposes. The default is 80 as expected for HTTP and 443 when the protocol is HTTPS.

Usage example, the Network Administrator may be given a specific port such as 8080 to access the administration of the entire Snap! installation. In this case the port number represents a different website.

Similarly, you can force your administrators to use port 8888 to access their Snap! account (for each indivual website.) Loging in their account will fail if the port is not 8888. In this case, the port is an option required by the user login mechanism to be a certain value.

Path search

The path search is essentially the same as the sub-domain search. It just makes use of the paths instead of sub-domains (and / instead of . to separate each part.) The paths search can also be an option (i.e. /en/, /u/, /1.2/, etc.) or a website / group definition (i.e. you can create two completely different websites with the same domain, but two different folders, i.e. domain.com/blog and domain.com/company).

Since the order in which rules are defined is kept, it is possible to match domain.com/blog and only if not a match, check another rule: domain.com/ (i.e. no path) and get a match on that empty string.

Query Strings search

At this point, we're thinking that whenever query strings are defined, they have to match entries defined for that website. This is however a problem because any plugin could add support for additional query strings without having access to this part of the website search.

The other possibility is to determine a set of query strings that we want to manage in a specific way and view all the other entries as optional options (i.e. how should we manage something such as: ?page=3&order=desc).

Also, paging is a real issue when more than one list may appear on your screen. The page option should apply to the main list, other lists need a different option... (probably the name of the list, underscore page such as todo_list_page).

Yet, a query string could determine a website:

http://snap.m2osw.com/?site=abc

In this example, you'd access the website "abc" through the snap.m2osw.com domain.

Therefore we need to have such a check.

Website search

Once we determine that such and such part of the URL was an option or a group / website, we can rewrite the URL eliminating the options to do the search of the website.

So, a URL as follow:

3.5.my-site.en.m2osw.com/u/123

Assuming that 3.5, .en. and /u/ are options, is rewritten:

my-site.m2osw.com/123

How the options are going to be used is not defined here, it is defined in the website itself (3.5 may be a version, .en. is likely the language, and the /u/ could be a size selection.)

IMPLEMENTATION

Assuming we reuse a similar syntax as for the Domain search, we get a BNF that looks like this:

start: rule_list

rule_list: rule
         | rule_list rule
rule: name '{' website_rule_list '}' ';'

website_rule_list: website_rule
                 | website_rule_list param
website_rule: protocol_rule ';'
            | port_rule ';'
            | path_rule ';'
            | query_rule ';'

protocol_rule: protocol '=' protocol_choice
protocol_choice: "HTTP" | "HTTPS" | "any"

port_rule: port '=' string

path_rule: path website
         | website

query_rule: query website

website: optional website_var
       | required website_var
website_var: qualified_name '=' string
           | qualified_name '=' website '(' string ',' string ')'
           | qualified_name '=' flag '(' string [ ',' string ] ')'

qualified_name: name
              | qualified_name '::' name
name: [a-z_][a-z0-9_]*

string: '"' [^"]* '"'

*Note: the string definition is very simplified. We support the C/C++ syntax.

This is very similar to the definitions to the sub-domain search. To distinguish between the protocol, port, path, and query we added 4 keywords. The path keyword is optional. The different parts can be defined in any order. However, the order of multitple entries of one part is respected and kept as is since it can influence the result.

The port is followed by a string even though it is expected to be a number. At some point we may support a full regular expression.

Note that the protocol variable is limited to "HTTP", "HTTPS" or "any", although we can accept "HTTP|HTTPS" instead of "any". At this point other protocols won't be supported, although we are open to find ways to support more protocols as we move forward.

Content Table

Here is a simplified version of the Content table.

Content Table
Column Name	Value Description
website	The website this row is part of.
title	The title of this page shown in H1 and window title

Keys are the full URI except all the parts that were determined to be options. This means the complete path since that determine the exact name of the page to be read.

Query

Test 1

Query A: http://staff.example.net/snap/hours/2011-10

This URL is expected to show staff hours on October 2011.

The /snap/ sub-folder is used because multiple website features are running, not just Snap! Websites (some backward compatibility.)

The staff. sub-domain defines a different website than the usual www.example.net domain name.

The canonicalization of the URL is like this:

The Website table would defined as this:

Example.net Website
Column Name	Value
protocol	Any
sub_domain	staff
start_path	snap

The Example.net canonalized URL used as the key would be:

"B" | "example.net" | "staff" | "snap"

Query B: http://www.example.net/snap/book/seo

The URL sends the user to a book on a page named SEO.

The /snap/ part is the same as for the other website.

The sub-domain is the default www. as you'd generally expect.

The canonicalization of the URL looks like this:

"N" | "example.net" | "www" | "snap" | "book" | "seo"

If the "www" can be ignored, then another canonicalization if this very URL would look like this:

If either canonicalization matches, then we have a good entry.

We may also always want to consider "www.example.net" == "example.net" because both URLs should never be used for different websites. Similarly, the "w.", "ww.", "wwww." can all react as if the all were "www.".

1. By recompiling Snap! C++ you can rename this parameter to something else. It is otherwise hard coded. TBD -- I think we can define that on a per website basis...