Drupal's Robot.txt

How to Control (Somewhat) Search Engine Indexing Bots

This article was published as a cover story in the print magazine Drupal Watchdog, Volume 6 Issue 3, Winter 2016, on pages 40-43, by Linux New Media.

Every new Drupal installation, by default, contains a text file named robots.txt as one of the files in its root directory. The majority of site builders pay little or no attention to this particular file, especially if they are put off by the somewhat arcane commands contained within it. It is understandable that the typical Drupal site builder could easily conclude that the robots.txt mysteries are best left to hard-core techies who enjoy the dark arts of web server configuration and optimization and who are happy to run the risk of inadvertently breaking their websites, making them invisible to search engines or even human visitors.

Yet if you were to willfully ignore the robots.txt file, then you would be missing out on the ability to exert better control over the search engine indexing programs that periodically visit your site. Fortunately, as you will learn in this article, the syntax of the file is not difficult to comprehend, and once learned, you should feel comfortable examining the current versions of that file that are part of Drupal 7 and Drupal 8.

Robot, Obey!

The primary purpose of the robots.txt file is to provide information to online programs developed and operated by search engine companies that automatically visit one web page after another, following links to pages not yet examined or whose contents may have changed since the last time they were checked. This process of "crawling" from one page to another is a critical part of how search engines can record the Web and later determine what pages to display given any particular search query.

Considering that these indexing programs are written by engineers who know far more about search engines than most of us will ever know, it is safe to assume that the programs used by the top-tier companies (e.g., Google) are well behaved and do not intentionally cause problems, including excessive consumption of the valuable bandwidth allotted to your site by your web hosting company. (However, there is a limit to any web bot's understanding of what havoc it might cause if a web programmer does something as foolish as the one fellow who unwisely added a "delete this file" link for each one of a collection of valuable files he had stored on his site, along with instructions to all visitors to only use those links selectively. Naturally, the first indexing bot to come along did not understand or comply with those instructions to humans, and it dutifully visited each link, thereby nuking his entire collection. Don't assume that all of your site's visitors are human and humane!)

The consumption of bandwidth can become excessive if your site contains a huge number of pages, the indexing bots are visiting those pages at a faster pace than you had budgeted, or both. Fortunately, you can ask those bots to slow down their rate of page consumption, shifting from the level of hyperactive Pac-Man to something more leisurely. Those first two lines of code in the Drupal 7 robots.txt show how to do it:

User-agent: *Crawl-delay: 10

The first line of code declares that the subsequent adjacent lines are intended for those user agents (i.e., search engine bots) whose names match the regex pattern "*" (i.e., all of them). The second line asks the bots to slow down their crawl rate to one page every 10 seconds. Note that the Drupal 8 robots.txt is missing this command.

You can further help out these bots by telling them ahead of time which resources they do not need to bother indexing. For instance, consider the third line of code from the Drupal 7 file:

Disallow: /includes/

It tells the bots to ignore all pages found in Drupal's includes directory — or, more precisely, all those pages whose URLs have file paths beginning with the string /includes/. That is not to say that a human or robotic visitor to a well-built Drupal site would ever encounter such a URL, but it is still safe to exclude all of the core .inc files in that directory from any sort of indexing, just in case any such URLs are ever exposed to the public. The default Drupal 7 robots.txt file then does the same exclusion for the misc, modules, profiles, scripts, and themes directories. The Drupal 8 version needs to do the same for only the core and profiles directories.

Such exclusion commands can be applied not only to directories, as shown above, but also to individual files. The Drupal 7 robots.txt file includes Disallow commands for three PHP files (cron.php, update.php, and xmlrpc.php), along with eight text files. Drupal 8 does the same for README.txt and web.config.

The aforementioned statement about the Disallow commands applying to URLs and not necessarily directories is well illustrated by the robots.txt commands that Drupal 7 uses to warn the indexing bots away from such generated paths as /admin/, /node/add/, and several others — in both their clean URL forms and otherwise (/?q=admin/, /?q=node/add/, etc.). The same is true for Drupal 8, whose non-clean URLs employ an intriguingly different and non-query format (e.g., /index.php/admin/ and /index.php/node/add/).

Another difference with the Drupal 8 version of the file, is that it also includes explicit Allow commands for CSS, JavaScript, and image files found at /core/ and /profiles/ URLs. The comments in the robots.txt file do not indicate why any site builder would want those sorts of files to be indexed by search engines.

Last but not least, we can even use the robots.txt file to inform search engines of any XML sitemap that you have handcrafted or that your site generates automatically. For instance, simply include the command

Sitemap: http://www.example.com/sitemap.xml

if you use the XML Sitemap module.

Suspicious Spiders

Perhaps most if not all of your website's human visitors use only the better-known search engines, such as Google and Bing, and never use — or at least, never find your website via — any of the lesser-known search engines. Some bots that index sites are not intended to provide search results to the public, but instead have their own nefarious purposes. In all such cases, do you even want those search engines spiders to be indexing your site and consuming resources? If not, you should consider using your robots.txt file to block them. Here is a short list of some of these bots that website owners will often exclude:

User-agent: BaiduspiderUser-agent: Baiduspider-videoUser-agent: Baiduspider-imageDisallow: /User-agent: mogetUser-agent: ichiroDisallow: /User-agent: NaverBotUser-agent: YetiDisallow: /User-agent: sogou spiderDisallow: /User-agent: YandexDisallow: /User-agent: YoudaoBotDisallow: /User-agent: aipbotDisallow: /User-agent: BecomeBotDisallow: /User-agent: MJ12botDisallow: /User-agent: psbotDisallow: /

The first six groups are for Asian and Russian bots, and the remaining four are for various questionable companies, none of whom provide any benefits to your website. Be sure to include at least one blank line between groups, otherwise the commands will be considered invalid.

This next option is debatable, depending upon whether or not you want your website's older versions to be archived for future access. If you don't, then consider also blocking the Internet Archive:

User-agent: ia_archiverDisallow: /

Bad Robot!

Unfortunately, not all search index bots are polite and respectful of your resources. In fact, you have no guarantee that any particular web bot will even bother to read the rules that you have posted in your robots.txt, much less comply with them. One well-known bad actor is Copyscape, which searches your web pages looking for any content that could be utilized by your competition to hit you with copyright infringement suits. Its bot ignores robots.txt entries, HTML meta-tags, and nofollow attributes in hyperlinks.

If their bot won't abide by your instructions, do you have a way to stop it? Fortunately, yes, by blocking the IP addresses they use, which are listed at http://www.copyscape.com/faqs.php#password. Simply add the following commands to your HTTPS access file (.htaccess, also found in the root directory of your Drupal site):

Deny from 162.13.83.46Deny from 212.100.254.105Deny from 212.100.239.219Deny from 212.100.243.112-116

Robotic Resources

If you want to learn more about the robots.txt file, a limited number of resources are available online. Perhaps start with the Web Robots Pages, which offer both general information about these files and specific commands and techniques. Google offers a detailed Robots.txt Specifications document, as well.

To test a specific robots.txt file and see if it contains any errors that might prevent the desired search engines from properly indexing your site, log in to your Google Webmaster Tools account, and on the "Search Console" page, go to the "Crawl" section, and click on the link for "robots.txt Tester". This section will list all the content of the chosen site's robots.txt, including comment lines. Any errors and warnings will be flagged. To get a terse explanation of each error and warning, mouse over the corresponding red or yellow symbol, and a tooltip will pop up, showing the explanation. In the case of the Drupal 7 robots.txt file, no errors are indicated, but Google does warn that it ignores the command Crawl-delay: 10. Just below the large text area showing your robots.txt file, is a field where you can enter a specific URL and have Google test whether it will be indexed on the basis of the rules you have specified.

If you are building a new website without Drupal (and why in heavens name would you do that?!), then you might need to create a new robots.txt file that incorporates all of the search engine bot blocking that you specifically desire for the new site. In that case, consider using the Robot Control Code Generation Tool, which does exactly what the name implies. For each of the major search engines, you can use the default value or specify that the particular indexing bot be allowed or refused, and those choices will be reflected in the robots.txt file generated by the website.

This is probably more information than you ever wanted or even needed regarding the relentless indexers that will periodically visit your site, but at least now you can gain more control over what they are doing — especially those that may be overstaying their welcome and abusing your site's resources.

Original Article:

Drupal's Robot.txt