Social Media

Why web clippers should obey the same conventions as bots

Posted .


A web clipper is a bookmarklet, app, or browser extension that copies some or all of the content of a webpage into a new webpage or app. Clippers appear to be exploding in number. Pinterest, Evernote, SpringPad, Clipboard, JustAPinch, and KeyIngredient are just a few examples of web and mobile apps that offer clippers. If the amount of content clipped is significant and is made publicly available as a copy on a new website, it might constitute copyright infringement. But regardless of the legality, “clipping” is essentially programmatic access of a website much like web spidering. It may not exactly be automated and recursive as in the case of a bot, but it is access by a non-human, usually for the purpose of copying content.

Site owners need to be able to control which clippers can access their sites and which cannot via some standard, much in the same way that they can control well-behaved bots via robots.txt and meta codes. If a clipping site has benevolent intentions, its clipper should obey the same conventions that well-behaved bots obey in order to be transparent about its actions and to respect a site owner’s wishes. These conventions include:

  1. Identifying itself. Set a custom user agent that identifies the clipper with a link to an information page. Clippers such as the KeyIngredient and JustAPinch bookmarklets currently identify themselves based on their underlying web access frameworks. Not only does this give away the structure of the code used, but it doesn’t provide a site owner with any useful information in the server log files on who is accessing their site programmatically and why. 
  2. Explaining itself. Provide a page about your clipper’s user agent explaining why you are accessing web pages programmatically and what the benefit to a site owner might be. Major search engines like Google do this. So should every site with a web clipper.
  3. Providing instructions for how to block itself. Provide instructions on how to block your clipper via robots.txt or other means such as the BadBehavior plugin, .htaccess file, or a meta code. Again, all major search engines do this. Clippers need to give site owners a way to opt out.
  4. Obeying robots.txt and meta codes. If a site owner has taken steps to block your clipper, respect their wishes and do not access their site programmatically. The NOINDEX and NOARCHIVE meta codes should imply that the site should not be clipped either. Pinterest invented their own NOPIN meta tag. For sites owners that wish to allow their pages to be indexed or archived but not clipped, I propose that we adopt a NOCLIP meta tag and set the expectation that clipping apps obey it.

A note on JavaScript: It’s possible to access the data of a web page through a bookmarklet entirely on the client side without ever hitting the server of the target web page.( This, of course, has the downside of not working in a mobile or tablet app.) In this case, the user agent and robots.txt conventions do not apply. This may make the case that this type of access should be regarded as inherently Black Hat. That aside, such access should still obey meta tags if any more than a page title, url, and meta description is grabbed.

Photo Credit: wiredforlego on Flickr