Sunday, April 17, 2011

Spam! How to avoid them?


Spam is the use of electronic messaging systems (including most broadcast media, digital delivery systems) to send unsolicited bulk messages indiscriminately. While the most widely recognized form of spam is e-mail spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs,wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social networking spam, television advertising and file sharing network spam.

Spamming remains economically viable because advertisers have no operating costs beyond the management of their mailing lists, and it is difficult to hold senders accountable for their mass mailings. Because the barrier to entry is so low, spammers are numerous, and the volume of unsolicited mail has become very high. In the year 2011, the estimated figure for spam messages is around seven trillion. The costs, such as lost productivity and fraud, are borne by the public and by Internet service providers, which have been forced to add extra capacity to cope with the deluge. Spamming has been the subject of legislation in many jurisdictions.

Spam in blogs (also called simply blog spam or comment spam is a form of spamdexing. (Note that blogspam has another, more common meaning, namely the post of a blogger who creates no-value-added posts to submit them to other sites.) It is done by automatically posting random comments or promoting commercial services to blogs, wikis, guestbooks, or other publicly accessible online discussion boards. Any web application that accepts and displays hyperlinks submitted by visitors may be a target.

Adding links that point to the spammer's web site artificially increases the site's search engine ranking. An increased ranking often results in the spammer's commercial site being listed ahead of other sites for certain searches, increasing the number of potential visitors and paying customers.


Possible solutions


Disallowing multiple consecutive submissions

It is rare on a site that a user would reply to their own comment, yet spammers typically will do. Checking that the user's IP address is not replying to a user of the same IP address will significantly reduce flooding. This, however, proves problematic when multiple users, behind the same proxy, wish to comment on the same entry.


Blocking by keyword

Blocking specific words from posts is one of the simplest and most effective ways to reduce spam. Much spam can be blocked simply by banning names of popular pharmaceuticals and casino games.
This is a good long-term solution, because it's not beneficial for spammers to change keywords to "vi@gra" or such, because keywords must be readable and indexed by search engine bots to be effective.


nofollow

Google announced in early 2005 that hyperlinks with rel="nofollow" attribute would not be crawled or influence the link target's ranking in the search engine's index. The Yahoo and MSN search engines also respect this tag.

Using rel="nofollow" is a much easier solution that makes the improvised techniques above irrelevant. Most weblog software now marks reader-submitted links this way by default (with no option to disable it without code modification). A more sophisticated server software could spare the nofollow for links submitted by trusted users like those registered for a long time, on a whitelist, or with a high karma. Some server software adds rel="nofollow" to pages that have been recently edited but omits it from stable pages, under the theory that stable pages will have had offending links removed by human editors.
Some weblog authors object to the use of rel="nofollow", arguing, for example, that
  • Link spammers will continue to spam everyone to reach the sites that do not use rel="nofollow"
  • Link spammers will continue to place links for clicking (by surfers) even if those links are ignored by search engines.
  • Google is advocating the use of rel="nofollow" in order to reduce the effect of heavy inter-blog linking on page ranking.
  • Google is advocating the use of rel="nofollow" only to minimize its own filtering efforts and to deflect that this actually had better been called rel="nopagerank".
  • Nofollow may reduce the value of legitimate comments
  • Nofollow links may still carry value in terms of search engine rankings
Other websites like Slashdot, with high user participation, use improvised nofollow implementations like adding rel="nofollow" only for potentially misbehaving users. Potential spammers posting as users can be determined through various heuristics like age of registered account and other factors. Slashdot also uses the poster's karma as a determinant in attaching a nofollow tag to user submitted links.
rel="nofollow" has come to be regarded as a microformat.


Validation (reverse Turing test)

A method to block automated spam comments is requiring a validation prior to publishing the contents of the reply form. The goal is to verify that the form is being submitted by a real human being and not by a spam tool and has therefore been described as a reverse Turing test. The test should be of such a nature that a human being can easily pass and an automated tool would most likely fail.

Many forms on websites take advantage of the CAPTCHA technique, displaying a combination of numbers and letters embedded in an image which must be entered literally into the reply form to pass the test. In order to keep out spam tools with built-in text recognition the characters in the images are customarily misaligned, distorted, and noisy. A drawback of many older CAPTCHAs is that passwords are usually case-sensitive while the corresponding images often don't allow a distinction of capital and small letters. This should be taken into account when devising a list of CAPTCHAs. Such systems can also prove problematic to blind people who rely on screen readers. Some more recent systems allow for this by providing an audio version of the characters.

A simple alternative to CAPTCHAs is the validation in the form of a password question, providing a hint to human visitors that the password is the answer to a simple question like "The Earth revolves around the... [Sun]".

One drawback to be taken into consideration is that any validation required in the form of an additional form field may become a nuisance especially to regular posters. Many bloggers and guestbook owners notice a significant decrease in the number of comments once such a validation is in place.

Disallowing links in posts

There is negligible gain from spam that does not contain links, so currently all spam posts contain (an excessive number of) links. It is safe to require passing Turing tests only if post contains links and letting all other posts through. While this is highly effective, spammers do frequently send gibberish posts (such as "ajliabisadf ljibia aeriqoj") to test the spam filter. These gibberish posts will not be labeled as spam. They do the spammer no good, but they still clog up comments sections.

Garbage submissions might however also result from level 0 spambots, which don't parse the attacked HTML form fields first, but send generic POST requests against pages. So it happens that a "content" or "forum_post" POST variable is set and received by the blog or forum software, but the "uri" or other wrong url field name is not accepted and thus not saved as spamlink.

No comments:

Recent Posts