Core Features

The XML is dynamically generated so we do not check it against a DTD at this time, although we will probably add that capability at some point so in Debug mode we'll be able to detect whether the we make a mistake when we create the XML.

XML File Header

Since it will generally be created in memory, the XML header will generally not be seen. However, for debug purposes, our XML data can be saved in a file in which case the XML header will appear. At this point, we do not include any information about the DTD.

<?xml version="1.0"?>

Root Tag (<snap>)

The root tag is <snap>. The entire set of XML data is saved inside this <snap> tag.

No attributes are expected.

<snap>
<!-- Content: (head | page)* -->
</snap>

Note that the DTD is relatively simplified as we're expecting to always have a head and a page and in that order: head, page.

Question: Do we want to include the HTTP header in the XML file? At this point we have a clean separate vector which is most certainly a lot faster although that prevents us the ability to let layout end users eventually transform the data. The main question is: would that ever be necessary?

HTML Header (<head>)

Different plugins will define information in the HTML header. Snap is expected to prevent duplication when not allowed (i.e. two plugins defining a favicon URL.)

The HTML header is defined with the <head> tag as in HTML.

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<!-- Content: (metadata | style | script)* -->
</head>

As you can see, we want to include the Content-Type immediately. All the data in Snap! will always be UTF-8 so we can safely make use of a default Content-Type tag. It will always fit all Snap! types.

Page Descriptions (<metadata>)

The <metadata> tag includes a sequence of <desc> tags that describe the page. These are transformed by the XSLT files in whatever entry fits with the output format. In some cases, the data may be transformed in multiple HTML tags as shown below for the title information.

<metadata>
<!-- Content: (desc)* -->
</metadata>

By default, a layout defines the XSLT to be used, so it would also define the set of metadata used by the page. However, it seems that this would be a repeat of the exact same data in all layouts so instead we want to have an internal XSLT that generates the content of the HTML <head> tag.

Note that some tags should always be defined in the output. (For sure the title of a page since it is required by HTML.) Other meta and link tags are optional and we can give the user a way to include them or not.

Some of the data generated in the header may actually be taken from other parts of the content. For example, the creation date of a page appears in /snap/page/body/created. There is no reason to have to repeat that information in a metadata description.

Examples of HTML tags appear in the <head> tag:

A specification for the favorite icon would make use of a link in HTML:

<link rel="shortcut icon" href="/images/ico/snap-favicon.ico" type="image/x-icon"/>

The page title makes use of the <title> tag in HTML as well as 2 or 3 meta tags that different systems read as well. Notice that the <title> tag includes the name of the website, whereas the other two entries don't

<title>The Snap! XSLT Specification | My Super Website</title>
<meta property="og:title" content="The Snap! XSLT Specification"/>
<meta name="dc.title" content="The Snap! XSLT Specification"/>

Metadata Descriptions (<desc>)

The <desc> tag includes a set of sub-tags that describe the entry and it makes use of attributes: it's type (mandatory) and name (optional.) The type is a token since it is limited to a set of supported types (although you are welcome to add your own types, of course.) Types known by the XSLT script will be converted. Unknown types are simply ignored.

<desc
  type = token
  name? = string>
<!-- Content: (data | long-data | short-data)* -->
</desc>

The name is defined only for a few cases such as User Meta Tags (i.e. meta tags defined to prove that you are the owner of the website.)

The <data> tag defines the data entry of this metadata. In case a short or long form are useful, they may also be defined with <short-data> and <long-data> respectively. For example, a user may provide a regular title (around 60 characters) a short form (20 or less characters) and a long form (120 characters or so.)

Details about all the generated metadata can be found here: Meta Tags and Links supported by Core.

The following is a list of support description types. Note that there aren't very many descriptions tags supported because most of the metadata are generated from each page data and not the website (global) data.

base_uri

The base URI of the page. This presents the folder in which the current page is defined. It is equal to website_uri when no subfolder appears in the URI. The path always ends with a slash.

The base_uri is used to build full URI from relative href paths.

For example, the page http://www.example.com/parent-folder/page-name has a base_uri of http://www.example.com/parent-folder/.

name

The global name of the website. This name is often appended or prepended to the name of the page when assigned to the title (i.e. "Page XSLT definitions | Snap! A C++ Open Source CMS".)

The name can include the short, normal, and long version of the website name.

page_uri

This URI represents the canonical URI of the page. This means using this URI you can get back to the exact same page every time.

This URI is primarily used to fill the canonical meta tag.

user

Define a user metadata tag. This special description requires a name which is the name used in the HTML <meta> tag. Only the <data> tag is used for the content of user tags. The long and short entries are ignored.

website_uri

The URI of the website without any folder unless it is included in the base of the website. This URI always ends with a slash.

The website_uri is used to build full URI from root based paths (path starting with a slash.)

CSS Data (<style>)

The style tags define CSS data that is used in the output file. The data may either be inline, a URL, or a path in the Cassandra database. URL are used for resources outside of our realm and should generally be rare.

A path in the Cassandra database may be transformed to inline data if the file content is very small (i.e. under 1Kb once compressed.) External URLs could also be made inline if small. This can be done by the backend as it can take the time to discover the content of those CSS files and bring it to the Cassandra database. However, the backend is responsible to constantly check the file for changes and bring in newer versions as they become available. (Note that if the XML passed to the XSLT parser include the url attribute, it means it was not cached in the Cassandra database yet or the cache was out of date.)

The style tag may include a path attribute in which case the specified path is data available in Cassandra. When no path attribute is found, we check the href attribute. If an href attribute is defined, then the style references a file on another server.

If neither the path nor the href attributes were defined, then the css-script is the inline content of the CSS, otherwise that data is ignored.

Script Data (<script>)

The script tags are defined as in HTML. They get copied verbatim to the output.

Like with the CSS code, scripts can be compressed (variable names reduced in length, useless spaces removed, etc.) and smaller scripts loaded verbatim instead of loaded with a src="..." URL. All of this is done before the XSLT transformations are applied though.

Page (<page>)

This is similar to the HTML <body> tag with one extra layer. The extra layer defines the different parts found in the page. These parts are defined by the layout and thus will vary from layout to layout. For example, the main layout will eventually have a place to show a search form. A box layout would probably never have such a part.

A main page is expected to have at least a header, body, and footer. The body often includes left and right columns. The header often includes a menu area (for the login/register buttons) and space for a search form. The footer will generally include a legal menu or reduced sitemap along a copyright notice and details about the business.

The other parts in a page each have a layout. These layouts generally include a content part and often will have a title, some also have a footer like area. A box on the side of the content will generally include those three things: title, content, footer (most often empty.) The title appearing in the header of the page generally defines two parts: title, sub-title.

Note

A menu is most often showed in a box. Each item can itself be laid out using a layout to know how each link is to be displayed. That sub-layout may just be a content part, however, at this point I'm thinking that the system will include all the menu items in a sequence and that sequence be parsed at once by one XSLT. There should be no need to offer sub-layouts for such things since each link can be defined using a <link> tag and many sub-tags to define all the parameters of the link instead of directly using a complex anchor tag.

The concept of the layouts in our environment is a tree of layouts. Each layout is an XSLT file. Each layout is applied as the part being worked on is defined and the result is passed as a new tag to the next level up until the main page layout is run and its output is saved in the snap_child object output. At that point we're ready to send the resulting data to the user.

Page Body (<body>)

I still have to determine whether this is wise, but before calling and generating all the parts, it is likely that we first want to generate the main body. The reason is that this way all the other parts have access to the information saved in the body. For example, the creation date, author name, revision information, etc. could be displayed in a box on the side without having to read that information a second time from the database (although we have a cache so it would still be fast, but there is just no reason for doing anything like that.) If possible, we may even want to get all the data ready before running any layout. This may, however, make things a little more complicated.

TBD — could it be that when running the main layout we'd also want to have the detailed information of all the other parts available? If so, we could generate all the data at once, then run each layout one by one. However, each layout would need to be capable of extracting exactly the data it needs and it may make that a tad bit more complicated. Although, a part P could have a <result> tag which is the result of applying its layout against it. That way we run QXmlQuery and append the result to the input XML for further processing.

<abstract>

A small description of the page (should fit in about 80 characters.)

<accepted>

Date when the content of this resource was accepted for publication.

Note that multiple people may have to accept the content before this date gets fully defined.

<accrual-method>

Method the most used to update this page. This can be one of:

ONLINE — the user generally edits the page online
XML-RPC — the user creates and updates the content offline and then upload it using the Snap! XML-RPC protocol
EMAIL — the user sends emails to our mail server; the email includes an identifier that attaches the emails to his website
AGGREGATED — the content was taken from another website via RSS or some similar method
LIST — this page is a list of some other data on the website
GENERATED — the Snap! system generated this page

<accrual-periodicity>

How often this page is being updated. This is similar to the values given in the XML sitemap and the revisitAfter meta tag.

<accrual-policy>

This entry defines the policy used to manage this page. In most cases, an entire website uses one specific policy throughout. Snap! let you define a different policy for the website, each content type, and each page.

<author href? = uri>

Specify the name of an author. Any number of <author> tags can be defined in the source. The href attribute is optional. If present, it can point to a page about the author. The page (and hence the URI) can be external or local.

<bookmarks>

Add links to the navigation system of advanced browsers such as SeaMonkey (and very useful for browsers used by blind people.)

This tag includes a set of <link> tags similar to the link tags defined in HTML <head>.

The rel attribute is expected to be one of the known bookmark relation as follow:

help -- a link to the Help page for your website
glossary -- a link to your website glossary
relation -- another resource talking about the same subject
source -- the source you used to create your page of content

The href attribute is a link to the given resource.

The title is optional although often recommanded. The help and glossary relations do not require it, most other bookmarks do.

<content>

This tag defines the raw content of the page. This is what is found in the database after applying static filters (filters that would not change the content dynamically when used.)

This tag is used to generate the <output> tag which is later added to the <body> tag of HTML output.

The content is expected to be valid HTML data.

<contributor href? = uri>

Specify the name of a contributor to this resource. Any number of <contributor> tags can be defined. The href attribute is option and if present a <link> is generated with the link to a page describing the contributor. The path (and hence the URI) can be external or local.

<copyrighted>

Defines the date when this resource was officially copyrighted.

<created>

The date and time when the page was created. By default, this is the date when the user saves the page for the first time. In most cases, we allow users to change this date in case it was created earlier with another medium.

<description>

A one or two sentences briefly describing the content of the resource.

This is often used in search results when no other content properly matches a search. For example, a page that's just a picture would not have other content so its description is likely going to be used in search results.

<formats>

All the formats in which this page can be offered. This generally includes a printable version, a text version, a PDF version. Many other formats will be added with time (Libre Office Word Document, for example.)

The formats are defined using <f> tags defined as follow:

<f format = token type = mime-type>title</f>

The format is one of the supported formats such as "txt".

The type is the MIME type of the proposed format such as "text/plain".

The title is what is likely presented to the end users.

<identifier>

Define a unique identifier for the page.

The unicity may be within a page group rather than the entire website. For example, all the pages of the same book could have the same identifier (i.e. the book ISBN.) Similarly, a product page and a blog page could both talk about the same product and thus both make use of the same identifier.

<image idref = id>

Defines references to the images included in the document. Especially, it gives a reference to the main image that the user wants to share with websites such as Facebook.

The idref attribute is a reference to an image defined in the <content> tag.

The tag can also include <shortcut> tags to define the favorite icon and Apple Touch icons. This tag looks like this:

All the attributes are expected to be defined.

The type defines the MIME type such as "image/x-icon" for the favicon. Nearly any image format will work here.

The href attribute points to the given image resource.

The width and height attributes define the size of the image in pixels. This is used to know whether the image is a favicon (16x16) or not (any other size and we mark the image as an Apple Touch icon.) If the favicon has multiple sizes, including 16x16, then use 16x16 for the width and height otherwise it won't be selected as a favicon.

<instructions href? = uri>

Intruction on how to use one thing or another. Possibly a link to a document to learn more about the content of this resource.

<issued href? = uri>

A date when the resource was issued. This is generally used when the content of the page was officially published in a formal magazine or book online or offline.

The optional href attribute can be used to indicate the official publisher or a place where the official publication can be ordered or is described with words or pictures.

<lang>

This tag defines the language of the page (or box.)

The page language is copied in the <html> tag as attribute lang and xml:lang. It is also copied in the language and dcterms.language meta tags.

The language of a page is always expected to be defined. If somehow it is not defined, it is set to "en" by default.

<license href? = uri>

The license covering this web-page. All the pages are assigned the same license. Use proprietary if the license does not allow copying the data. Two other common licenses are free to share, and public domain.

The optional href is used to generate a link to a page with the website detailed license when one is available.

<location href? = uri>

This information has to be worked on. The basic location is expected to be a longitude and a latitude or an address. This can be used to find the location on a map such as Google Map. The optional href is a URI pointing to a map system using the address defined below (using the street, city, province, postal code, and country.)

The <location> tag comes with sub-tags:

<business> — The name of the business at that location.
<city> — The name of the city.
<country> — The name of the country.
<jurisdiction> — The name of the legal jurisdiction, in most cases the closest large city in the same province.
<longlat href? = uri> — The longitude and latitude separated by a comma. The optional URI points to a map where the longitude and latitude can be used to pinpoint the location.
<postal-code> — The name of the postal code.
<province> — The name of the province or state.
<street> — The street address (usually a number, street name, and eventual suite or appartment number.)

<medium>

The list of medium this resource is available as. By default it is certainly Internet (i.e. that page is only available on the Internet.) A product may use Physical. A software or video could use CD or DVD. A book could use Paper.

<modified>

The date when the page was last modified. This date is generally indicating that something, whatever it is, changed on this page. It is unlikely, however, if a box on the side has dynamically changing content, that this date will reflect such changes.

Note that if you change the content then the <updated> tag changes too.

<navigation href? = uri>

Define a set of navigation links. The navigation links specifically define what page comes next, what was the previous page, where the top (home) page is, etc. It is expected to be a way to navigate your entire website in a logical manner assuming your site is built as a valid top to bottom tree.

The href attribute of the navigation tag points to a website sitemap. This can be an automatic or manually created page.

The tag is composed of <link> tags which are defined pretty much the same way as the link tags found in the HTML <head>.

The rel attribute defines the relation that this link represents. It is expected to be one of "top", "up", "first", "prev", "next", or "last".

The href attribute is a URI that points to the given page.

The title attribute is what is shown to the end user. It is optional since in most cases the relation defines what the title will be.

Other similar relations are defined in the <bookmarks> tag instead. It is clearly defined in a separate tag to clearly separate the current position in the document with links that are more global to the website as a whole.

<output>

The <output> tag is created from the <content> tag. This process occurs last as all the boxes are generated first.

<owner href? = uri>

Name of the owner of this website. This is generally the person or entity that pays for the hosting, although not always.

The href attribute is optional. It points to a resource page describing the owner of the web page.

<permission href? = uri>

This is generated by the system to describe the current permission scheme giving the user rights to see, or not see, the page. The following is a list of permissions we are thinking about using:

PUBLIC — Anyone can view this page
ANONYMOUS — Only anonymous users (as in: non-logged in) can access this page; this is similar to PUBLIC except that since it prevents logged in users from viewing the content, it can be described slightly differently
UNPUBLISHED — The page was not published yet (or got unpublished)
PASSWORD — The page requires a password for the contents to be viewed
ACCOUNT — A regular account is required to access this content
MODERATOR — A moderator account is required to view the content of the page
EDITOR — Only an editor can access the content of this page
ADMINISTRATOR — This page is reserved to administrators

If the dcterms.accessRights is to define a better list of acceptable values, we may change our scheme to their scheme. This list will most certainly change over time.

The href link can be used to send users to a page describing permissions in details.

<provenance href? = uri>

Defines where this product or idea comes from. This should only be used if you bought this website / resource and the company you bought it from still exists after your purchase (i.e. you did not buy the whole company.)

Without a uri, the system generates a meta tag, otherwise it generates a link.

<publisher href? = uri>

The name of the publisher of this resource. This is the person who is responsible for the content. The href attribute is optional and points to a resource page describing the publisher of the content.

<robots>

Defines information that are targeted to robots.

The tag includes two sub-tags:

The <tracking user-agent = token>flags</tracking> tags is a set of tracking flags such as NOINDEX and NOFOLLOW. Although positive flags can be included, we only output negative flags (i.e. we do not need to indicate INDEX since that's the default.)

The <changefreq>daily</changefreq> tag defines how often this page changes. By default it is set to daily, and on each run of the XML sitemap plugin against a site, the frequency is updated to hourly, monthly, yearly, or never. This value is reproduced in this tag.

<since>

The date since the data is publicly visible. Note that publicly means people other than those responsible for the resource (authors, contributors, editors, administrators.)

The date can be entered by the end user to make the page available on a specific date and time.

<submitted>

The date when the content of this resource was submitted to the publisher.

By default, this date is generally the same as the creation date. In some cases, however, content could be submitted earlier than when added to the database.

<tags>

This entry include a set of <tag> tags which describe different types of keywords associated with this page.

We support many entries such as "category", "tag", "keyword", and "glossary". These keywords are found in the taxonomy used to categorize this page. Some taxonomies are hidden and thus do not make it to the <tags> entry. For example, the permission tags are probably never shown, except to administrators who have enough permissions to see such tags.

The <tag> definition:

<tag type = token href= uri>definition</tag>

The type attribute is expected to be a valid category and thus is viewed as a token although owners of a website can add new categories as required by their website.

The href attribute is a link to the page representing this tag. Remember that all the categories are also pages, although some are secret (part of core and cannot be modified and many cannot be viewed by everyone.)

<titles>

Defines a set of titles: a short, normal, and long title tags. The normal title is used as the page. It is also likely the one used in the page itself. In most cases, the short title is used when creating a (small) box. The long title is rarely used.

<short-title> -- a smaller title, usually will fit a box on the side of the website

<title> -- the normal title, should be as close as possible to 63 characters

<long-title> -- a possibly very long/complete title for this page

<toc href? = uri>

Defines links to the table of contents.

This tag can include a set of <entry> tags that define the chapter, section, sub-section, and appendix entries to your table of contents. The entry tag is defined as follow:

<entry element = "<type>" href = uri>title</entry>

The <type> is one of "chapter", "section", "subsection", or "appendix". The href points to the page representing that table of content entry. title is used as the title of the link.

TBD, maybe we'll support a <header type = token id = idref> tag which defines all the H1 to H6 tags appearing in this page. This can quickly be used to build a table of contents.

<translations mode = token>

Define a set of languages in which this page is available. This generates a set of alternate links in the HTML header.

The <translations> tag includes a set of <l> tags define as follow:

<l lang = token>Language Name</l>

The lang attribute defines the 2 letter language such as "en" or "fr". This is used to define the language of the resource (hreflang attribute of the HTML <link> tag.) It is also used to generate the URI to that page. For example, depending on the mode attribute the page http://www.example.com/best-friend/enjoy.html could have a French version at URL:

[sub-domain] http://fr.example.com/best-friend/enjoy.html
[extension] http://www.example.com/best-friend/enjoy.html.fr
[path] http://www.example.com/fr/best-friend/enjoy.html

The default mode is "path" and that's the one we prefer using.

The text of the tag is the name of language this link references. The name of the language may be "English Version" or "Version française".

<until>

The date and time when the page gets hidden. This can be used to make a page disappear after a given date. This is useful whenever a product, event, or service has a relatively short time to live.

A page that gets hidden is automatically unpublished so only the authors, contributors, editors, and administrators have access to the content.

<updated>

Indicate the last time the user edited the content of the page. This value does not represent the last modification to the page. See the <modified> tag instead.

This date is limited to the content which means the HTML that a user can edit to change the text and images appearing in the page.

Note that on a click on the Publish button, the system compares the new content to the existing content. If it is the same, the system does NOT update the <update> timestamp.

Page Boxes <boxes>

Inside the page, we also add the boxes. Boxes are part of the page. They are displayed on the side from the main content (in most cases at least...)

A box is located in a specific area defined by the theme. The placement of blocks is done on a per theme and it can also be complemented by the page and other plugins.

The XML defined <boxes> which then includes one <area-name> per area, and finally one <box> per box. The <area-name> is a tag named after the area where it goes (this may change in the future.) For example, you can define a Left area so you'll get a <left> tag for that area.

The content of the <box> tag is very similar to the content of the <page> tag. It includes a set of titles and a body with the actual HTML for the box. There will be other tags we'll add later (maybe <links> and <author> for example.)

<snap>
...
<page>
    ...
    <boxes>
      <left>
        <box>
          <titles><title>Box Name</title></titles>
          <body>HTML content</body>
        </box>
      </left>
    </boxes>
</page>
</snap>

This is still preliminary but it is already functional.