ROBOTS.TXT DISALLOW: 20 Years of Mistakes To Avoid

The robots.txt was first officially rolled out 20 years ago today! Even though 20 years have passed, some folks continue to use robots.txt disallow like it is 1994.

Before jumping right into common robots.txt mistakes, it's important to understand why standards and protocols for robots exclusion were developed in the first place. In the early 1990s, websites were far more limited in terms of available bandwidth than they are today. Back then it was not uncommon for automated robots to accidentally crash websites by overwhelming a web server and consuming all available bandwidth. That is why the Standard for Robot Exclusion was created by consensus on June 30, 1994. The Robots Exclusion Protocol allows site owners to ask automated robots not to crawl certain portions of their website. By reducing robot traffic, site owners can free up more bandwidth for human users, reduce downtime and help to ensure accessibility for human users. In the early 1990s, site owners were far more concerned about bandwidth and accessibility than URLs appearing in search results.

Throughout internet history sites like WhiteHouse.gov, the Library of Congress, Nissan, Metallica and the California DMV have disallowed portions of their website from being crawled by automated robots. By leveraging robots.txt and the disallow directive, webmasters of sites like these reduced downtime, increased bandwidth and helped ensure accessibility for humans. Over the past 20 years this practice has proved quite successful for a number of websites, especially during peak traffic periods.

Using robot.txt disallow proved to be a helpful tool for webmasters; however, it spelled problems for search engines. For instance, any good search engine had to be able to return quality results for queries like [white house], [metallica], [nissan] and [CA DMV]. Returning quality results for a page is tricky if you cannot crawl the page. To address this issue, Google extracts text about URLs disallowed with robots.txt from sources that are not disallowed with robots.txt. Google compiles this text from allowed sources and associates it with URLs disallowed with robots.txt. As a result, Google is able to return URLs disallowed with robots.txt in search results. One side effect of using robots.txt disallow was that rankings for disallowed URLs would typically decline for some queries over time. This side effect is the result of not being able to crawl or detect content at URLs disallowed with robots.txt.

Here are some of the most common robots.txt mistakes I encounter:

Implementing a robots.txt file. - Google has stated that, you only need a robots.txt file if "your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one)." Most situations are best resolved without using a robots.txt or disallowing URLs. When you think about using robots.txt to disallow URLs, think of it as your last option. Consider things like retuning a 410 HTTP response, using noindex meta tags and rel=canonical, among other options first.

Not disallowing URLs 24 hours in advance. - In 2000 Google started checking robots.txt files once a day. Before 2000, Google only checked robots.txt files once a week. As a result, URLs disallowed via robots.txt were usually crawled and indexed during the weeklong gap between robots.txt updates. Today, Google usually checks robots.txt files every 24 hours but not always. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers. Other search engines may take longer than 24 hours to check robots.txt files. Either way, it is entirely possible for content disallowed via robots.txt to be crawled during gaps between robots.txt checks during the first 24 hours. In order to prevent pages at URLs that should be disallowed with robots.txt from being crawled, the URLs must be added to robots.txt at least 24 hours in advance.

Disallowing a URL with robots.txt to prevent it from appearing in search results. - Disallowing a URL via robots.txt will not prevent it from being seen by searchers in search results pages. Crawling and indexing are two independent processes. URLs disallowed via robots.txt become indexed by search engines when they appear as links in pages not disallowed via robots.txt . Google is then able to associate text from other sources with disallowed URLs to return URLs disallowed via robots.txt in search results pages. This is done without crawling pages disallowed with robots.txt. To prevent URLs from appearing in Google search results, URLs must be crawlable and not disallowed with robots.txt. Once a URL is crawlable, noindex meta tags, password protection, X-Robots-Tag HTTP headers and/or other options can be implemented.

Using robots.txt disallow, to remove URLs of pages that no longer exist from search results. - Again, the robots.txt file will not remove content from Google. Google does not assume that content no longer exists just because it is no longer accessible to search engines. Using robots.txt to disallow URLs of pages that have been indexed but no longer exist, prevents Google from detecting that the page has been removed. As a result, these URLs will be treated just like any other disallowed URL and will probably linger in search results for some time. In order for Google to remove old pages from search results quickly, Googlebot must be able to crawl the page. In order for Google to crawl a page, it must not be disallowed with robots.txt. Until Google detects that content has been removed, keyword and link data for these pages will continue to appear in Google Webmaster Tools. When pages have been removed from a website and should be removed from search results pages, allow search engines to crawl the pages and return a 410 HTTP response. I was recently able to have 150,000 pages removed from search results in 7 days using this method.

Disallowing URLs that redirect with robots.txt. - Disallowing a URL that redirects (returns a 301 or 302 HTTP response or MetaRefreshes) to another URL, "disallows" search engines from detecting the redirect. Because the robots.txt file does not remove content from search engines indexes, disallowing a URL that redirects to another URL typically results in the wrong URL appearing in search results. This in turn causes analytics data to be even further corrupted. For redirects to be handled by search engines correctly and not screw up analytics, redirected URLs should be accessible to search engines and not disallowed via robots.txt.

Using robots.txt to disallow URLs of pages with noindex meta tags - Disallowing URLs of pages with noindex meta tags will "disallow" engines from seeing the noindex meta tag. As a result and as mentioned earlier, disallowed URLs can appear indexed in search results. If you do not want the URL of a content page to be seen by users in search results, use the noindex meta tag in the page and allow the URL to be crawled.

Some sites try to communicate with Google through comments in robots.txt - Googlebot essentially ignores comments in robots.txt like you see at nike.com/robots.txt, yelp.com/robots.txt and etsy.com/robots.txt.

Using robots.txt to disallow URLs of pages with rel=canonical or nofollow meta tags and X-Robots-Tags - Disallowing a URL prevents search engines from seeing HTTP headers and meta tags. As a result, none of these will be honored. In order for engines to honor HTTP response headers or meta tags, URLs must not be disallowed with robots.txt

Disallowing Confidential Information via robots.txt. - Anyone who understands robots.txt can access the robots.txt file for a website. For instance, google.com/robots.txt and apple.com/robots.txt. Clearly, the robots.txt was never intended as a mechanism for hiding information. The only way to prevent search engines from accessing confidential information online and displaying it to users in search results pages is to place that content behind a login.

WHOA NELLY robots.txt. - Even though most sites do not need a robots.txt file many look like https://www.google.com/robots.txt. I call these "WHOA NELLY robots.txt files." Complex robots.txt files make for mistakes on your end and by search engines. For example, the maximum file size for a robots.txt is 500 kb. Text in robots.txt over the 500 kb limit is ignored by Google. Robots.txt files should be like Snickers Mini bars, short and sweet.

Robots.txt postpone. - If Google tries to access a robot.txt file but does not receive a 200 or 404 HTTP response, Google will postpone crawling until a later time. For that reason it is important to ensure that robots.txt URLs always return a 200, 403 or 404 HTTP response.

403 robots.txt. - Returning a 403 HTTP response for robots.txt indicates that no file exists. As a result, Googlebot can assume that it is safe to crawl any URL. If your robots.txt returns a 403 HTTP response and this is an issue, simply change the response to a 200 or 404.

User-Agent directive override - When generic user-agent directives come before specific directives in robots.txt, later directives can override earlier directives as far as Googlebot is concerned. This is why it is best to test robots.txt in Google Webmaster Tools.

Robots.txt case sensitivity - The URL of the robots.txt file and URLs in the robots.txt file are case-sensitive. As a result, you can expect issues if your file is named ROBOTS.TXT and included URLs are accessible via mixed cases.

Removing robots.txt file URLs from search results. - To prevent a robots.txt files from appearing in Google search results, webmasters can disallow robots.txt via robots.txt and then remove it via Google Webmaster Tools. Another way is by using x-robots-tag noindex in the HTTP header of the robots.txt file.

robots.txt Crawl-delay - Sites like http://cs.stanford.edu/robots.txt include a "Crawl delay" in robots.txt but these are ignored by Google. In order to control Google crawling, use Google Webmaster Tools.

 

Leave a Reply