robots.txt feature

Robots.txt file

We want to support a robots.txt on a per domain basis.

This means if one website is visible from multiple domains, then we need to be able to have a different robots.txt for each domain, even though it's a single site.

The robots.txt should be easy to edit. That means not giving users access to the technical text file, but instead a flag in each file and folder that let them mark whether that file or folder can be indexed by robots.

The system always provides a default version of the robots.txt. All 3rd party plug-ins that have specific needs can also hide their folders when they get signalled by the robots.txt plug-in.

Note that when the XML sitemap plug-in is installed, we want the robots.txt to indicate where it is found in the HTML header of the home page.

Spam

Along the standard usage of the robots.txt file, we want to use the information of the robots reading the robots.txt file (to the minimum the IP address and the date and time when the robot access our robots.txt, the data should live long enough until the robots access our website again.)

This data will help in two ways:

(1) we can detect when a robot attempts to read a page that is protected by the robots.txt file; such behavior can be viewed as a hacking robot instead of a nice robot and thus we can first hide a few pages (in case a robot makes a mistake) and second we can block that IP address (once a bit more than a few non-accessible pages where accessed.)

(2) we can detect that hits from those IP addresses are hits from robots and thus we can use that data in our statistics to better determine which are from robots and which are from people.

Warning: Note that pages can be forbidden to  a specific robot. This means the test must include the necessary processing of the robots.txt data to mae sure that we do it right.

References: Wikipedia, The Web Robots Page

Syndicate content

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly