Google

It seems like there has been a lot of confusion about robots.txt recently and why URLs disallowed by robots.txt appear in Google search results.

The directives in robots.txt for disallowing URLs was originally intended to help site owners preserve bandwidth and prevent outages caused by robot traffic. Robots can consume lots of bandwidth and in the early years, it was not uncommon for Google to crash websites or make them inaccessible for users during a crawl cycle. By disallowing robots, site owners could limit bandwidth used by robots. Blocking robot traffic helped site owners to ensure that their website was available for humans.

At that time, even though many site owners did not want search engines using up bandwidth, they did not mind the traffic from search engines. Search engines on the other hand, wanted to return relevant results for URLs disallowed or not. Instead of returning no results for queries like [amazon.com] for example, if that site was disallowed by robots.txt, Google returned search results based on data collected from other websites. This is why disallowed URLs appear indexed in search results. Google can return uncrawled URL references for disallowed URLs by combining anchor text and description data extracted from other sites. As a result, uncrawled URL references for disallowed URLs may appear indexed in Google search results pages.

Disallowing a URL via robots.txt will not prevent it from appearing indexed in search results pages. In order to prevent URLs from appearing in search results pages, webmasters should implement rel=noindex meta and/or use password protection. In order to remove URLs disallowed by robots.txt but indexed in Google SERPS, webmasters should use the URL removal tool in Google Webmaster Tools.

Here are some other tips for success with robots.txt:

- Search engines do not check robots.txt for every page request. Many search engines update robots.txt data once every 24 hours. For that reason, disallowed URLs added between updates may be accidentally crawled and indexed. To ensure pages aren't crawled, be sure to add future URLs to your robots.txt file 24 to 36 hours in advance of adding actual content.

- URLs in robots.txt are case-sensitive. For that reason, blocking aboutus.html will not prevent ABOUTUS.html, Aboutus.html, AbOuTUs.html and/or AboutUs.html from being crawled.

- "When you add the +1 button to a page, Google assumes that you want that page to be publicly available and visible in Google Search results. As a result, we may fetch and show that page even if it is disallowed in robots.txt or includes a meta noindex tag." (http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1634172)

- Disallowing URLs that 301 redirect, will prevent search engines from "seeing" the redirect. As a result, engines may continue to index the incorrect URL.

- Disallowing URLs with pages containing noindex meta will prevent search engines from "seeing" the noindex meta tag and as a result noindex pages may appear indexed in Google search results.

- When 301 redirecting www or non-www versions of URLs to the preferred version, don't forget to redirect your robots.txt file as well.

google drive

According to a number of sources, Google is about to launch a new product called "Google Drive". The code above, which is embedded in Google Docs does seems to support that idea. In addition to the embedded code (above), Google Docs also includes an "Add to My Drive" button (below) that is currently in place just not visible to users. After discovering the first description of GDrive in 2009, I'm always a little skeptical of Google actually launching this service but this evidence seems pretty clear.

google drive button

Earlier today Google launched www.wesolveforx.com. This site is reported to be the new website for Google's highly Top Secret "X-Lab project" but, one secret is already out.

The content seen by users at www.wesolveforx.com actually resides on an internet marketing agency's website http://www.thinkbelieveact.com/solveforx/. Meaning that Google's new site www.wesolveforx.com, is currently little more than a domain name. To make things worse, this content is indexed on the agency's website by both the www and non-www versions of the URL.

Background information aside, what interests me is that my +1 for www.wesolveforx.com was actually credited to Google's agency website www.thinkbelieveact.com instead of where I intended. While this makes sense given the technical issue at hand, Google's agency does appear to be benefiting in some ways from this situation. To be fair though, this situation may not have been easily avoidable because adding the +1 button to pages causes Google to ignore disallow directives in robots.txt and meta noindex tags. For that reason, maybe some trade-offs had to be made I'm not sure.

Either way, it seems like preventing +1 buttons from appearing in framed content might be a good idea?

UPDATE: Larry Page mentioned earlier today that the new site is now live. The new site is now live but as of the time of this update, the pages on Google's agency site are also still live.