It seems like there has been a lot of confusion about robots.txt recently and why URLs disallowed by robots.txt appear in Google search results.
The directives in robots.txt for disallowing URLs was originally intended to help site owners preserve bandwidth and prevent outages caused by robot traffic. Robots can consume lots of bandwidth and in the early years, it was not uncommon for Google to crash websites or make them inaccessible for users during a crawl cycle. By disallowing robots, site owners could limit bandwidth used by robots. Blocking robot traffic helped site owners to ensure that their website was available for humans.
At that time, even though many site owners did not want search engines using up bandwidth, they did not mind the traffic from search engines. Search engines on the other hand, wanted to return relevant results for URLs disallowed or not. Instead of returning no results for queries like [amazon.com] for example, if that site was disallowed by robots.txt, Google returned search results based on data collected from other websites. This is why disallowed URLs appear indexed in search results. Google can return uncrawled URL references for disallowed URLs by combining anchor text and description data extracted from other sites. As a result, uncrawled URL references for disallowed URLs may appear indexed in Google search results pages.
Disallowing a URL via robots.txt will not prevent it from appearing indexed in search results pages. In order to prevent URLs from appearing in search results pages, webmasters should implement rel=noindex meta and/or use password protection. In order to remove URLs disallowed by robots.txt but indexed in Google SERPS, webmasters should use the URL removal tool in Google Webmaster Tools.
Here are some other tips for success with robots.txt:
- Search engines do not check robots.txt for every page request. Many search engines update robots.txt data once every 24 hours. For that reason, disallowed URLs added between updates may be accidentally crawled and indexed. To ensure pages aren’t crawled, be sure to add future URLs to your robots.txt file 24 to 36 hours in advance of adding actual content.
- URLs in robots.txt are case-sensitive. For that reason, blocking aboutus.html will not prevent ABOUTUS.html, Aboutus.html, AbOuTUs.html and/or AboutUs.html from being crawled.
- “When you add the +1 button to a page, Google assumes that you want that page to be publicly available and visible in Google Search results. As a result, we may fetch and show that page even if it is disallowed in robots.txt or includes a meta noindex tag.” (http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1634172)
- Disallowing URLs that 301 redirect, will prevent search engines from “seeing” the redirect. As a result, engines may continue to index the incorrect URL.
- Disallowing URLs with pages containing noindex meta will prevent search engines from “seeing” the noindex meta tag and as a result noindex pages may appear indexed in Google search results.
- When 301 redirecting www or non-www versions of URLs to the preferred version, don’t forget to redirect your robots.txt file as well.