We want to generate an XML sitemap with all the pages that the system offers1.
Since the pages are not all clearly defined (i.e. a feature may generate hundred of pages) it is important to allow plug-ins to define their XML sitemap entries.
For example, the taxonomy feature allows you to categorize a page using tags (taxonomy pages titles). Pages that were categories can be listed by the taxonomy plug-in on generated pages. Such would not automatically appear in the XML sitemap since they do not represent a page per se2.
XML sitemap was first developed by Google. It now has its own website: sitemaps.org.
The XML sitemap can add a page change frequency information that can also be saved in the meta data of the page.
It should be possible to choose what is going to be published in the XML sitemap. This includes ways to mark pages, page types, terms, etc. as published or not published in the XML sitemap.
It is to be noted that the list of URLs in one sitemap must all be what you'd expect for that website. In other words, if you access a sitemap under http://www.example.com/snap/sitemap.xml, then you cannot have a URL such as https://www.example.com/ or http://www.example.com/blog because they are incompatible with the original.
We certainly want users to be able to specify their own frequencies, but we should also keep an eye open on how often a page gets modified and update our frequency accordingly. So we could set a frequency to 1 day by default. After 1 week a page frequency is set to 1 week. Similarly after 1 month, 3 month, 6 months, 1 year, set the frequency to those respective numbers. If the frequency increases again, make sure to change it back to get updates going again.
A combinaison of a counter of how often a page is changed and when it was last changed can be used to define a page priority. When changed recently, the page should certainly have a much higher priority (i.e. be checked again by search engines sooner rather than later.) A page that was changed many times is probably more important than a page that was created and left alone.
For blogs, the last 10 or so posts should have a higher priority and the mechanism described would give us what's expected.
It is best not to update the XML sitemap with search engines as users save content. It is better to wait until the user stops his updates (at least for a while) then post the XML sitemap to the search engines, and most certainly prevent a new post for one whole day.
Unpublishing, deleting or otherwise marking a page as not available to anonymous users should somehow quickly be reported to the XML sitemap so it gets removed from the map. This is particularly important since offering a search engine to search a page that it cannot access is not the best idea ever.
A long term goal will be to look into optimizations for larger websites. It may be a good idea to create two sitemaps and an index to avoid having all the sitemap data uploaded by robots all the time. Robots will tend to ignore the second sitemap if it is marked as much less important than the first (especially if all the pages in that second sitemap are marked as unmodified for the longest time...)
The sitemap can be returned as a text file with one URL per line. The file must be encoded using UTF-8. Using an XSLT file we can genereate that text file from our existing XML files. This will not be useful for general search engines, but it can be helpful for some people to get a complete list of all their website pages in an easy to use format.
We want to keep information about what the XML sitemap plug-in does such as when it is submitted, when it is accessed by robots, when it is updated, etc.
Sitemaps can be indicated in your robots.txt file (which is already done in Snap!) and the URL defined in the robots.txt is trusted. This means it can point anywhere in the world.
This means we can offload the sitemaps to a different server (http://sitemap.example.com/) which would aleviate the load of the standard website servers.
Snap! Websites
An Open Source CMS System in C++