Author Archives: Brian Ussery

The robots.txt was first officially rolled out 20 years ago today! Even though 20 years have passed, some folks continue to use robots.txt disallow like it is 1994.

Before jumping right into common robots.txt mistakes, it's important to understand why standards and protocols for robots exclusion were developed in the first place. In the early 1990s, websites were far more limited in terms of available bandwidth than they are today. Back then it was not uncommon for automated robots to accidentally crash websites by overwhelming a web server and consuming all available bandwidth. That is why the Standard for Robot Exclusion was created by consensus on June 30, 1994. The Robots Exclusion Protocol allows site owners to ask automated robots not to crawl certain portions of their website. By reducing robot traffic, site owners can free up more bandwidth for human users, reduce downtime and help to ensure accessibility for human users. In the early 1990s, site owners were far more concerned about bandwidth and accessibility than URLs appearing in search results.

Throughout internet history sites like WhiteHouse.gov, the Library of Congress, Nissan, Metallica and the California DMV have disallowed portions of their website from being crawled by automated robots. By leveraging robots.txt and the disallow directive, webmasters of sites like these reduced downtime, increased bandwidth and helped ensure accessibility for humans. Over the past 20 years this practice has proved quite successful for a number of websites, especially during peak traffic periods.

Using robot.txt disallow proved to be a helpful tool for webmasters; however, it spelled problems for search engines. For instance, any good search engine had to be able to return quality results for queries like [white house], [metallica], [nissan] and [CA DMV]. Returning quality results for a page is tricky if you cannot crawl the page. To address this issue, Google extracts text about URLs disallowed with robots.txt from sources that are not disallowed with robots.txt. Google compiles this text from allowed sources and associates it with URLs disallowed with robots.txt. As a result, Google is able to return URLs disallowed with robots.txt in search results. One side effect of using robots.txt disallow was that rankings for disallowed URLs would typically decline for some queries over time. This side effect is the result of not being able to crawl or detect content at URLs disallowed with robots.txt.

Here are some of the most common robots.txt mistakes I encounter:

Implementing a robots.txt file. - Google has stated that, you only need a robots.txt file if "your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one)." Most situations are best resolved without using a robots.txt or disallowing URLs. When you think about using robots.txt to disallow URLs, think of it as your last option. Consider things like retuning a 410 HTTP response, using noindex meta tags and rel=canonical, among other options first.

Not disallowing URLs 24 hours in advance. - In 2000 Google started checking robots.txt files once a day. Before 2000, Google only checked robots.txt files once a week. As a result, URLs disallowed via robots.txt were usually crawled and indexed during the weeklong gap between robots.txt updates. Today, Google usually checks robots.txt files every 24 hours but not always. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers. Other search engines may take longer than 24 hours to check robots.txt files. Either way, it is entirely possible for content disallowed via robots.txt to be crawled during gaps between robots.txt checks during the first 24 hours. In order to prevent pages at URLs that should be disallowed with robots.txt from being crawled, the URLs must be added to robots.txt at least 24 hours in advance.

Disallowing a URL with robots.txt to prevent it from appearing in search results. - Disallowing a URL via robots.txt will not prevent it from being seen by searchers in search results pages. Crawling and indexing are two independent processes. URLs disallowed via robots.txt become indexed by search engines when they appear as links in pages not disallowed via robots.txt . Google is then able to associate text from other sources with disallowed URLs to return URLs disallowed via robots.txt in search results pages. This is done without crawling pages disallowed with robots.txt. To prevent URLs from appearing in Google search results, URLs must be crawlable and not disallowed with robots.txt. Once a URL is crawlable, noindex meta tags, password protection, X-Robots-Tag HTTP headers and/or other options can be implemented.

Using robots.txt disallow, to remove URLs of pages that no longer exist from search results. - Again, the robots.txt file will not remove content from Google. Google does not assume that content no longer exists just because it is no longer accessible to search engines. Using robots.txt to disallow URLs of pages that have been indexed but no longer exist, prevents Google from detecting that the page has been removed. As a result, these URLs will be treated just like any other disallowed URL and will probably linger in search results for some time. In order for Google to remove old pages from search results quickly, Googlebot must be able to crawl the page. In order for Google to crawl a page, it must not be disallowed with robots.txt. Until Google detects that content has been removed, keyword and link data for these pages will continue to appear in Google Webmaster Tools. When pages have been removed from a website and should be removed from search results pages, allow search engines to crawl the pages and return a 410 HTTP response. I was recently able to have 150,000 pages removed from search results in 7 days using this method.

Disallowing URLs that redirect with robots.txt. - Disallowing a URL that redirects (returns a 301 or 302 HTTP response or MetaRefreshes) to another URL, "disallows" search engines from detecting the redirect. Because the robots.txt file does not remove content from search engines indexes, disallowing a URL that redirects to another URL typically results in the wrong URL appearing in search results. This in turn causes analytics data to be even further corrupted. For redirects to be handled by search engines correctly and not screw up analytics, redirected URLs should be accessible to search engines and not disallowed via robots.txt.

Using robots.txt to disallow URLs of pages with noindex meta tags - Disallowing URLs of pages with noindex meta tags will "disallow" engines from seeing the noindex meta tag. As a result and as mentioned earlier, disallowed URLs can appear indexed in search results. If you do not want the URL of a content page to be seen by users in search results, use the noindex meta tag in the page and allow the URL to be crawled.

Some sites try to communicate with Google through comments in robots.txt - Googlebot essentially ignores comments in robots.txt like you see at nike.com/robots.txt, yelp.com/robots.txt and etsy.com/robots.txt.

Using robots.txt to disallow URLs of pages with rel=canonical or nofollow meta tags and X-Robots-Tags - Disallowing a URL prevents search engines from seeing HTTP headers and meta tags. As a result, none of these will be honored. In order for engines to honor HTTP response headers or meta tags, URLs must not be disallowed with robots.txt

Disallowing Confidential Information via robots.txt. - Anyone who understands robots.txt can access the robots.txt file for a website. For instance, google.com/robots.txt and apple.com/robots.txt. Clearly, the robots.txt was never intended as a mechanism for hiding information. The only way to prevent search engines from accessing confidential information online and displaying it to users in search results pages is to place that content behind a login.

WHOA NELLY robots.txt. - Even though most sites do not need a robots.txt file many look like https://www.google.com/robots.txt. I call these "WHOA NELLY robots.txt files." Complex robots.txt files make for mistakes on your end and by search engines. For example, the maximum file size for a robots.txt is 500 kb. Text in robots.txt over the 500 kb limit is ignored by Google. Robots.txt files should be like Snickers Mini bars, short and sweet.

Robots.txt postpone. - If Google tries to access a robot.txt file but does not receive a 200 or 404 HTTP response, Google will postpone crawling until a later time. For that reason it is important to ensure that robots.txt URLs always return a 200, 403 or 404 HTTP response.

403 robots.txt. - Returning a 403 HTTP response for robots.txt indicates that no file exists. As a result, Googlebot can assume that it is safe to crawl any URL. If your robots.txt returns a 403 HTTP response and this is an issue, simply change the response to a 200 or 404.

User-Agent directive override - When generic user-agent directives come before specific directives in robots.txt, later directives can override earlier directives as far as Googlebot is concerned. This is why it is best to test robots.txt in Google Webmaster Tools.

Robots.txt case sensitivity - The URL of the robots.txt file and URLs in the robots.txt file are case-sensitive. As a result, you can expect issues if your file is named ROBOTS.TXT and included URLs are accessible via mixed cases.

Removing robots.txt file URLs from search results. - To prevent a robots.txt files from appearing in Google search results, webmasters can disallow robots.txt via robots.txt and then remove it via Google Webmaster Tools. Another way is by using x-robots-tag noindex in the HTTP header of the robots.txt file.

robots.txt Crawl-delay - Sites like http://cs.stanford.edu/robots.txt include a "Crawl delay" in robots.txt but these are ignored by Google. In order to control Google crawling, use Google Webmaster Tools.

 

For those who are not aware, Google Glass is a wearable computer with an optical head-mounted display (OHMD) that is being developed by Google in the Project Glass research and development project, with a mission of producing a mass-market ubiquitous computer.  Glass is a truly amazing technology and years before its time.

Google Glass 2

After unknowingly trying out Google Glass 2 several weeks before it was announced, I was sold and could not wait to get my hands on a pair.  I jumped at the opportunity to be part of Google’s Glass Explorer program and have been using the latest version of Google Glass for several days.

New Google Glass Rear

Unlike the first iteration of Google Glass, which had to be picked up in person from one of three locations, the latest version can be shipped or picked up. According to the information provided during the checkout process, Glass orders may take up to 48 hours to process.   I ordered Glass at 7:03 PM EST on Monday, it shipped at 11:10 PM EST Monday and arrived at my front door at 9:56 AM EST on Tuesday.  I could not believe how fast my shipment arrived.  I don’t think anything has ever arrived so fast.  You can bet Glass fulfillment is something currently being tested and monitored.

Google Glass 2 in box

In addition to new shipping options and awesome packaging, the latest iteration of Glass includes several new accessories. According to Google, “Each Glass comes with a protective pouch to store your Glass. To store, simply slide Glass in so that your display is snug within the hard protective housing at the bottom of the pouch.”

Glass Pouch
Google Glass Pouch

The pouch is soft, high quality and well designed except for the fact that my Glass does not seem to fit inside with the Shade accessory attached.

Glass is larger than pouch with Shade attached

Google Glass 2 also includes a detachable glass “Shade” by Maui Jim and Zeal Optics. “Shades” appear to be very high quality, custom made for Glass and include a custom made protective case with Glass logo. Attaching Shades to Google Glass is easy.

Google Glass Shade
Google Glass Shade

As shown in the video below, attaching Shades to Glass could not be easier:

In addition to the Pouch and Shade, Glass 2 also includes an earpiece which attaches to the same micro-usb port used for charging. I can hear Glass fine and have not used this accessory yet. Contrary to a number of reports I have read, using the new earpiece is optional.

Google Glass Earpiece
Google Glass Earpiece

In addition to these new accessories, Google Glass 2 also comes with a charger. The charger is solid, high quality and features the Glass logo.

Glass Earpiece

Setting up Glass 2 and syncing it with my Motorola Moto-X only took a few minutes. That said, I did accidentally take a photo and somehow post it online without having any idea that I had done so while setting up Glass. Note to self, use caution when setting up Glass. It probably is not a good idea to setup Glass in the restroom or similar situation.

Photo taken with Glass

Contrary to claims I have heard in the past, Glass does not get in the way of seeing the world around you. Walking into poles, parking meters or other things is not really a problem. In reality, Glass is actually really small and in some ways even difficult to look at for extended periods of time. People don’t look at you weird on the street when you are wearing Glass. Some folks will try to stop you to ask questions about Glass. Several strangers have even asked to try on my Glass. I suspect that it is only a matter of time before thieves start using this approach to steal Glass. As a result, it is probably a good idea not to wear your Glass in some locations and/or to have a good excuse ready.

All in all Google Glass is an amazing piece of technology and seems to have almost unlimited potential in terms of apps, covers, cases and other accessories. I can’t wait to see proximity based Glass apps or apps like night vision, thermal imaging, range finders, compasses, altimeters, pedometers, activity monitors and the like. I have never use a Bluetooth headset before but, anyone who does will absolutely love Glass.

Designed By Google

As much as I hate to say it, so far I’m more than a little mixed on the latest version of Google Glass and I have even considered returning it. In the past when I have tried out Glass, it was always a Google test device that I was told was “not fully enabled.” As a result my expectations appear to have been a little out of line with the reality of the current device.

Google Glass Explorer Card

While trying out Glass in the past, I don’t remember ever seeing a phone. When you see folks jumping out of airplanes wearing Glass but no phone, you don’t realize how dependent Glass is on your phone. My expectation was that Glass would be more of a replacement for my phone than an amazingly souped up Bluetooth headset. I can’t wait to see 4g integrated into Glass.

Battery life is another issue. According to a number of reports, battery life was supposed to be addressed in the latest version. I am not sure it has been addressed. My new Glass takes 2 hours to fully charge but the battery dies after an hour and a half of use. Because of the position of the charger, wearing Glass while it is charging is not really an option.

Google Glass Battery

From a search marketing / web design perspective, Glass clearly illustrates the importance of responsive design. Sites (like this one) that are not responsive are an absolute nightmare to navigate on Glass. Instead of scrolling through search results like you can on most mobile devices, Glass uses black and white cards that remind me of unix. On Glass you scroll from side to side rather than up and down. Interestingly there are no ads in search results on Glass.

New Google Glass

For those who are interested, the user-agent for the new version of Google Glass is "Mozilla/5.0 (Linux; U; Android 4.0.4; en-us; Glass 1 Build/IMM76L; XE10) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30."

More photos of Google Glass...

According to the White House, search engine optimization is a priority for US Government websites. Given the White House mandate and staggering number of Americans that search for health related information, you would think the new HealthCare.Gov website would be search engine friendly. Unfortunately the NEW site is not search friendly, due in part to the OLD site not being cleaned up properly. Until issues with the new version of the site are resolved and the old version of the site is cleaned up, users will continue to experience issues. Expert developers are usually focused on development and not technical search related issues.  As a result, technical search issues usually go unnoticed and continue to frustrate users.

Technical SEO site assessment is difficult to teach in a public setting because of the risk of potentially offending site owners.  Since we all own HealthCare.Gov, offending someone is not a problem.  As a result, I took a few minutes to check out the site and have documented a few critical issues below.  Please note, the list of issues outlined herein is by no means comprehensive and only took a few minutes to compile. Please feel free to post additional search related issues in the comment section below.  The objective of this post is to educate others and lend an extra set of eyes to the “A-Team".

Security:

It is widely known that HealthCare.Gov has a number of potential security issues and several of these are search related.

HealthCare.Gov security issues

Findings: Without going into detail for security reasons, it is currently possible to search and get results for “public and secure content” at HealthCare.Gov. Please note, this is an internal HealthCare.Gov IT issue, not a web search issue and has already been reported to HealthCare.gov.

Recommendation: Ensure access to content not intended for public consumption is password protected.

Accessibility:

"If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site."

According to Google and Bing, websites should be tested with a text browser. Text browsers make it possible for webmasters to "see" sites more like search engine crawlers. This kind of testing will also reveal issues experienced by individuals with disabilities when accessing the site on an assistive device.

HealthCare.Gov accessibility issues

Findings: Individuals with disabilities and search engines may have difficulty accessing and interacting with portions of the new website.

Recommendation: Ensure that fancy site elements don't interfere with the delivery of important textual content across platforms, even when images and JavaScript are disabled.

Searcher Intent:

When users search for [healthcare.gov], chances are they want to navigate to HealthCare.Gov the US Government health insurance market place.

HealthCare.Gov search results

Findings: Currently when users search for [healthcare.gov] they are returned Google search results above. Clicking on the top result in the site link section takes searchers to finder.healthcare.gov which "is not the Health Insurance Marketplace.

Recommendation: Demote the Sitelink in question via webmaster tools.

Version Issues:

When the same text content appears on different webpages, it is considered duplicate content by search engines. There is no penalty for duplicate content but it can thin certain ranking signals. As a result, search engines recommend that webmasters specify the preferred version of each page.

HealthCare.Gov search results

Findings: www.HealthCare.Gov includes the same content as well as different combinations of content from various versions of both the old and new website. For example, Spa.HealthCare.Gov , www.HealthCare.Gov, Finder.HealthCare.Gov and LocalHelp.HealthCare.Gov just to name a few. As a result, it is possible that searchers will arrive at the unintended subdomain and the site will appear not to work.

Recommendation: Use rel=canonical attributes to specify which page version is preferred and return 410 HTTP responses for pages at additional subdomains.

Soft 404 Pages:

Usually, when someone requests a page that doesn’t exist, a server will return a 404 (not found) error. This HTTP response code clearly tells both browsers and search engines that the page doesn’t exist. As a result, the content of the page (if any) won’t be crawled or indexed by search engines.https://support.google.com/webmasters/answer/181708?hl=en

HealthCare.Gov soft 404

Findings:HealthCare.Gov errors do not redirect to a dedicated 404 landing page or return a 404 HTTP response. As a result, URLs for pages without content will be indexed by search engines when posted online. In addition, versions of older pages like http://finder.healthcare.gov/404.html return a 302 HTTP response which is a temporary redirect. As a result, site error pages will continue to be indexed and frustrate users.

Recommendation: Create a dedicated 404 page which returns a 404 HTTP response and redirect error requests to the dedicated 404 URL.

Development Platform Indexing:

Findings: The new HealthCare.Gov website appears to have been developed at the subdomain Test.HealthCare.Gov. This subdomain does not appear to have been password protected and as a result was crawled and indexed by search engines. Currently 100s of pages from this subdomain are indexed in search results. In order to help prevent searchers from going to the developer version of the site, Test.HealthCare.Gov now returns a 503.  Disallowing via robots.txt or returning a 503 will not prevent pages from appearing in search results.  The only way to prevent content from appearing in search results is to add the noindex meta tag or password protection.

HealthCare.Gov development platform

Recommendation:  To have this content removed from search results return a 401 HTTP response.

Breadcrumbs

"A breadcrumb trail is a set of links (breadcrumbs) that can help a user understand and navigate your site's hierarchy." In order to understand information in a page, searchers need to know where they have landed in the site architecture.

HealthCare.Gov development platform

Findings: When users arrive at the page above from search results there is currently nothing to indicate where the user is within the site architecture. For example, if a users arrives at the page above from search, there is nothing to indicate whether this information applies to business or individual health care plans.

Recommendation: Implement breadcrumb navigational elements in each page.

Webmasters need to take action to resolve any of the manual action notification messages listed below if they appear in Google Webmaster Tools under Search Traffic > Manual Actions.

Cloaking and/or sneaky redirects

Some pages on this site appear to be cloaking (displaying different content to human users than are shown to search engines) or redirecting users to a different page than Google saw. Learn more. Learn more.

Hacked site

Some pages on this site may have been hacked by a third party to display spammy content or links. You should take immediate action to clean your site and fix any security vulnerabilities. Learn more.

Pure spam

Pages on this site appear to use aggressive spam techniques such as automatically generated gibberish, cloaking, scraping content from other websites, and/or repeated or egregious violations of Google’s Webmaster Guidelines. Learn more.

Thin content with little or no added value

This site appears to contain a significant percentage of low-quality or shallow pages which do not provide users with much added value (such as thin affiliate pages, cookie-cutter sites, doorway pages, automatically generated content, or copied content). Learn more.

Unnatural links from your site

Google detected a pattern of unnatural, artificial, deceptive, or manipulative outbound links on pages on this site. This may be the result of selling links that pass PageRank or participating in link schemes. Learn more.

Unnatural links to your site

Google has detected a pattern of unnatural artificial, deceptive, or manipulative links pointing to pages on this site. These may be the result of buying links that pass PageRank or participating in link schemes. Learn more.

Unnatural links to your site—impacts links

Google has detected a pattern of unnatural artificial, deceptive, or manipulative links pointing to pages on this site. Some links may be outside of the webmaster’s control, so for this incident we are taking targeted action on the unnatural links instead of on the site’s ranking as a whole. Learn more.

User-generated spam

Pages from this site appear to contain spammy user-generated content. The problematic content may appear on forum pages, guestbook pages, or user profiles. Learn more.

If you have the Spammy freehosts message, Hidden text and/or keyword stuffing message, additional messages or other messages from the Google Webmaster Tools Manual Action Viewer since August 8, 2013 please forward, send screen shots and/or submit in comments below.