Core Features

Whenever someone posts a message on any one site (sharing is great here!) we can check whether the post looks like spam. This can be done in all sorts of ways and some are highlighted here.

Spammers are then listed in a black list (IP address, and whatever other information we can easily gather.)

We also can check spammy external links. These are links created by a website owner to another site which reveals itself as being a bad site. For example, some websites may redirect search engines to another completely different site or the external site owner may quickly remove your backlink.

Who Does What

One of the main problem with spam detection is to detect who does what. Administrators, editors and trustworthy authors should not be marked as spammers. Like with stackoverflow, using a point system where someone who first comes on board can write some data, and as more points are earned, higher the person can go.

Also, using what the administrators write to adjust the spam filters is important. You want commenters to be able to use the same or similar words as the administrators without getting banned.

Also a certain number of common words can be included (a dictionary) to better detect whether a message is likely spam (i.e. spammers tend to write incorrectly when they write at all.)

Blocking

Block a set of entries completely (in this case, we can also block the IP address from even viewing the website by adding it to the firewall.)

For example, a spammer posts on your site with a very specific URL, each time (or at least a specific domain name.) Add that URL to your black list and get that problem out of your hair immediately. We can also offer a full regular expression mechanism (i.e. I tend to block IP addresses on all non-technical sites, so posting an actual IP blocks your message, especially comments!)

Note that the blocking itself makes use of the security feature (see Security Considerations.) The way to generate the block is what is laid out on this page.

Duplicate

When two or more posts look alike, especially across multiple websites, they're viewed as duplicates. Once one post was marked as spam, all posts that are alike will automatically be discarded (not even shown as a spam message, who cares about those?!)

Note that the duplicates should work retroactively too. If a user posts duplicates, especially on different pages or even different websites, then the very first message was most certainly spam (99.999% chance) and that could happen over a long period of time (i.e. some spammers do it right and post once a day instead of 1,000 times a minute.) See the Quarantine option too.

Comparing two messages

It is certainly a good idea to support a way to compare messages so small changes are not considered part of the message. This way a duplicate matches more messages. For example, all the small words can generally be removed from messages and the messages will still have what's important and possibly will match better. With that said, we have to watch out if a message is an answer with a quoted message and the answer just includes "Yes!" on top of the quote. That could very well be viewed as a duplicate assuming we remote all the words of 3 letters or less.

Once we removed the small words and cleaned up the spaces (i.e. replace all   as well as tabs and new line characters with a standard 0x20 space and then any double or more spaces into a single space,) we could generate the md5 sum of the result and use that as a key in our database. Now we can (nearly)instantaneously find a reference to any duplicates, see how old they are, whether it is a straight answer to another message, etc.

Quarantine

We want to have an easy way to quarantine posts. Duplicates let us know whether two of the same posts are found. If those two are on the same page, it's probably because the user either clicked the Send button twice or got impatient. However, when the same (very similar) message is posted twice on two completely different pages, that's 99.9% chance spam.

This works especially well if you have many websites (100's) and they all receive the same message!

The Quarantine capability allows us to keep the message hidden from others for a little while. Once a certain amount of time elapsed, we can release the message unless we detected 1 or more duplicates.

Word Counting

Detect what is and what is not spam by counting words. Good words give a weight that's favorable and bad words give a weight that's unfavorable.

For example, on a programmer's website, the word Finance is probably always a bad word. Similarly, the word recursive on a finance website is probably not used very often. Such words will appear in the "spam" column whereas, all the other words will appear in the "good word" column. Balancing these numbers one can calculate the likelihood that a message is actually a spam message.

Word Trap

Certain words will definitively be viewed as part of a spam message. A set of words can be defined to get the system to mark posts as spam. For example, Viagra could be such a word.

A certain number of words, especially misspelled, should be available by default and a very easy way to click on words from a spam message to add them to the word trap list.

Spoofing Detection

It is possible to check words for attempting to spoof someone by using variable scripts in the same word. This is done a lot in comments and URIs.

The ICU library comes with an interface that allows one to check for Unicode spoofing. We want to make use of that library to detect possible spam (comment content) and scam (URI that includes code or goes to a hacked website.)

Constructs

Many spam messages include a certain thing in their message that can be viewed as a well known construct. For example, if you forbid people from posting URLs, then you make the system view URLs as spam. Similarly, IP addresses, email addresses, etc. can all be detected.

Age

It is common to increase the possibility of spam on older content, especially if that content is a Forum or similar type of content. It is generally very unlikely that someone comes back to a website, read a very old thread, and decide to post a comment on it, although it happens.

The age test should take the different dates that are available to the system:

Date when the website was created
Date when the post was created
Date when the post was last modified
Date when the first comment was posted
Date when the last comment was posted
Average elapsed time between comments

Counters

Some posts may generate more spammy comments than others. Counters of the number of comments marked as spam within a single thread can be used to increase the probability that a new post is also spam. (i.e. Say you define a default threshold of 90% before messages are marked as spam, then having received 10 messages marked as spam in that very thread can increase that to 92%, and each 10 messages continues to increase the chances up to 100%.)

Hidden Fields

Fields that are hidden to end users and have to either include a very specific entry or left empty are a good one to trick robots into showing up themselves.

For example, you can ask a silly question: What color is the white horse of King Arthur? and enter a reply such as "Got You!" If the field comes back with "Got You!" then it's not a robot because the field was hidden and thus users couldn't have changed it. If it comes back with random data, then it's blocked.

To make it more effective, these fields should be defined in a random manner so the robot has even less clue on whether it's a hidden field or not and get caught pretty much every time.

Important Note

The Hidden Fields must still function for users who click the Back button. In other words, if we also use a session in link with this field, we want to allow the same user (same IP/cookie) to be able to use the same session multiple times.

I worked on a module called Hidden CAPTCHA for Drupal 6.x/7.x. There is now a new module called BOTCHA that uses a similar approach with many more fields. The interesting thing as I already created the form facility in Snap! C++ is that it will be really easy for us to add such a feature.

Bayesian Filter

As people mark messages as being spam, the content of the messages can be added to a table and counted (i.e. a counter table can be used and each column is a word as found in the spammy messages.)

Each time a message is posted and not viewed as spam, the counters of the corresponding words can be decremented instead.

The result is a table of probability, or an evidential probability as discovered by Bayes. This is also called Bayesian probabilities.

New messages can be matched against the Bayesian table and if the message matches many of the words previously found in spammy messages, then it marks the new message as likely spam. The owner of the website can then decide whether he/she thinks that the message is actually not spam.

Along with the Quarantine module, we can add the message words to the table after the Quarantine allowed time.

More info about Bayesian spam filtering.

Word Matching

It is to be noted that many people will misspell words. It's very common, especially in comments on a website. So... to bypass that problem, we want to look into ways to find words even when misspelled.

One way is to sort the letters and remove duplicates. So a word such as "test" would be viewed as "est".

Also, to avoid unimportant noise we want to remove small words (such as 3 letters or less.) This is important in most cases although a language like English has many words of 3 letters...

Note that languages such as Chinese and Japanese that use one glyph as a whole word will not benefit in this respect.

Sentence Matching

Although word matching can do wonders, we may also want to look into a way to count sentences with a simplified structure. A certain structure will eventually match spam a lot more. (TBD)

Language

It is important to think of each Bayesian tables as language specific. Actually, it is likely that if you receive a comment in the wrong language it is likely going to be spam. Of course, many people will post in English on any website. However, with time it should still be better this way.

A better way would be to detect the language the message is written into and use that language instead of the destination website language. With time, we may very well be able to do that since words will be saved in our database for all languages.

Erroneous Results of Young Bayesian Statistics

Bayesian statistics have the very bad habit to mark new messages as spam (or not spam) when it is still learning. We may want to look into a good trade off and count how many messages would be necessary before we can tell that the filter is indeed working well.

Also, at any one time, a message with a majority of brand new words should certainly be marked as don't know or at least not sure.

The usual solution is to ignore such words (their probability is set to 0.5).

Sharing Bayesian statistics

With approval of our customers, we can share their spam and vice versa share ours with them.

Since spam messages are unwanted and were for most anyway expected to become public in most cases (otherwise those messages are not likely useful, are they?!), sharing them is not a bad idea. This means we can build a much better Bayesian table helping us determine whether a message is spam or not with a more likely good outcome.

According to what I read, this is useful only for similar websites. However, I've dealt with many websites of all sorts of different types and they all received a certain set of spam, no matter what.

Cloud Spam

What we call cloud spam are cloud systems that are being used to send spam messages to other systems. This is new. In general it happens when a service gets hacked and the entire cloud is accessed by the hacker to send website spam or similar activity. The obvious problem with that kind of spam is that it is always very similar, but distinct enough to be hard to catch with usual means (i.e. the IP address of the computers changes with each message, in general each computer hits a different page, the messages are different enough to make it difficult to make them look as the same message.)

However, clouds have one drawback that I have noticed so far: all their IP addresses are in the same realm (only the last 8 or so bits change) and the access pattern tends to show that many computers are working in unisson. The few attacks I have got showed that all the computers had a hostname that closely matched. It may take a little time until we find a really good answer for those attacks...

Robots.txt

The robots.txt file is used to tell robots that a certain number of files should not be read by the robot. This is particularly useful to prevent robots from reading pages such as a login page which should never be indexed.

The robots.txt feature can thus (1) detect if it got read; and (2) detect that the robot then read a robot forbidden page.

See robots.txt feature

CAPTCHA

As a last resort, human CAPTCHA can be used.

Note that the Back button must be supported properly as in: a new CAPTCHA needs to be loaded at that point. This also means we may have to use JS and if we manage a session ID for the CAPTCHA, we must accept multiple submissions thinking that the Back button may have been used.

Note that the CAPTCHA could also be used when the spam filter is suspicious (the post looks way too much like spam.)

See Anti-hammering feature

Invalid Email Address

There are different ways to verify whether an email address is valid. The first step is offered by libtld by verify the syntax of the email. If the syntax is wrong, then it sure isn't a valid email address.

The next step is to find the MX record for the specified domain name. If that cannot be found (i.e. that domain does not exist or does not offer an MX record) then the email is definitively wrong too.

When an MX record exists, the next step is to attempt sending an email. The HELO, MAIL FROM, and RCPT TO commands can be used. If the mail server has a greylist protection, then the MAIL FROM will be rejected. We can then test later (wait 5 to 15 min. and try again). Otherwise, the RCPT TO gives you the answer.

Most of the email address checkers work that way. We may also check using the finger tool. Note that an invalid address may be invalid just because it was mistyped.

MX record for example.com exists.
Connection succeeded to ....example.com SMTP.
220 mx.example.com ESMTP ... - gsmtp

> HELO snapwebsites.com 
250 mx.example.com at your service

> MAIL FROM: <robot@snapwebsites.com>
=250 2.1.0 OK ... - gsmtp

> RCPT TO: <alexis@example.com>
=250 2.1.5 OK ... - gsmtp

External Links

Many webmasters, in order to increase their Google ranking, will look around in order to get well placed links. These webmasters generally end up with a certain number of either broken or unwanted external links.

We already plan to have a way to check for broken links. This is done by the Links feature (Menus, Management, Filters, Anchors) [core].

As far as anti-spam goes, we also want to add a few tests to avoid bad destinations.

Destination Ranking Dropped

Assuming that the destination page had some good Google ranking. Detect a large drop. Having links to pages that get a large drop can indicate web-spam was detected by Google so the destination is not a good one anymore. At least the webmaster should have a look.

Destination Various Behavior

One thing hackers like to do is redirect a user depending on who he is.

One of the main type of users getting redirected: search engines. So we can check the link by simulating being a search engine. The one problem here is that we cannot simulate their IP address, however, we can easily simulate their User Agent. So any website that decides to redirect us differently or not redirect us depending on the User Agent get flagged.

if(preg_match("#(google|slurp|msnbot)#si", $_SERVER['HTTP_USER_AGENT']))
{
    header("HTTP/1.1 301 Moved Permanently");
    header("Location: http://www.your-main-site.com/");
    exit;
}

Note: If they offer cookies, we may also have to handle those because Browsers would do so as expected.

Finally, we want to save the 301 new URL and follow it up to the final destination. If we get over a certain threshold of 301 or 302, then we flag the link. If the redirections change too often, we also flag the link (i.e. changing once a month is probably okay. If we check the same URL twice in a row and we get a different set of URL, it's certainly spam.)

Destination Page Content

A destination page can be marked as mainly static. In that case, heavy changes between checks may indicate a problem that needs to be checked. This will be difficult on pages that have many ads, large sections of latest posts, comments, etc. Yet, it is still quite valid and we should be able to determine what the main block of content is and only check that block for heavy changes.

3rd Party Safe Browsing Information

There are several services which are generally free that help with knowing whether the destination website includes viruses, trojans, etc. Using these to either completely hide a link or determine wheather a post can be considered spam is also a very good way to protect our users.

Google offers what they call Safe Browsing. Mozilla has a database as well. I'm sure there are other similar services available and each one could be implemented because each one may discover different pages as being unsafe whereas all may not find all unsafe pages.

Other Plug-ins in need of Anti-Spam

The main data to check is the data saved in the Page feature [core]. This is used by many other plug-ins such as the Blog and Comments, and that may repels in plug-ins such as the Chat since it makes use of the Comments feature.

The Spam feature works by offering a function determining whether a piece of data is spam or not spam. The function actually returns a number between 0.0 and 1.0 where 0.0 means definitively not spam and 1.0 definitively spam.

The only way to automate the process would be to mark a field as a spam prone field. Then the system would automatically check whether a field contains spam. However, I'm not so sure that this adds anything opposed to having each plug-in call the spam function, as long as it is easy to make that call and deal with the reply (i.e. call with one string, result is to mark the object as spam.)

However, we need to think about objects that are composed of multiple data blocks and how we'll be handling them.