Tag Archives: Googlebot

Does Google's removal of the following paragraph mean that creating static copies of dynamic pages is no longer necessary?

"Consider creating static copies of dynamic pages. Although the Google index includes dynamic pages, they comprise a small portion of our index. If you suspect that your dynamically generated pages (such as URLs containing question marks) are causing problems for our crawler, you might create static copies of these pages. If you create static copies, don't forget to add your dynamic pages to your robots.txt file to prevent us from treating them as duplicates."

http://www.google.com/support.....py?answer=40349&ctx=sibling

Until today, Google suggested creating static versions of dynamic pages. Reason being, Googlebot had difficulty in crawling dynamic URLs especially URLs containing question marks and/or symbols. To prevent duplicate content issues caused by sites with both static and dynamic versions, Google suggested "disallowing" the dynamic version via robots.txt. While this tactic helped engines, some would say using both thinned PageRank as well as the relevancy of anchor text in inbound links. Either way, it will be interesting to see how this move impacts Flash sites with a "static" version. Safe to say, Google is quickly advancing in the ability to crawl content!

By now you probably know Google indexes text content within Flash thanks to Google's new Algorithm for Flash. In case you missed it, Google recently updated their original announcement to include additional details about how Google handles Flash files.

SWFObject - Google confirms that Googlebot did not execute JavaScript such as the type used with SWFObject as of the July 1st launch of the new algorithm.

SWFObject - Google confirms "now" rolling out an update that enables the execution of JavaScript in order to support sites using SWFObject and SWFObject 2.

According to Google, "If the Flash file is embedded in HTML (as many of the Flash files we find are), its content is associated with the parent URL and indexed as single entity." I found this isn't the case using a variation of the example used by Google. The following query finds the same content indexed at three URLs 2 SWF and 1 HTML:
http://www.google.com/search?q=%22NASA%27s+Hubble,+...

http://www.jpl.nasa.gov/multimedia/deep-impact/index.swf
http://www.nasa.gov/externalflash/deepimpact_flash/index.swf
http://www.jpl.nasa.gov/multimedia/deep-impact/index-flash.html

Additional:

Deep Linking - Google doesn't support deep linking. "In the case of Flash, the ability to deep link will require additional functionality in Flash with which we integrate."

Non-Malicious Duplicate content - Flash sites containing "alternative" content in HTML might be detected as having duplicate content.

Googlebot, it seems still ignores #anchors but will soon crawl SWFObject. Given that Googlebot can or will soon crawl SWFObject sites, major reworks should be considered for "deep linking" sites where correlating "alternative" HTML content pages contain the same Flash file and are accessible via multiple URLs.

ActionScript - Google confirms indexing ActionScript 1, ActionScript 2 and ActionScript 3 while at the same time Google shouldn't expose ActionScript to users.

External Text (XML) - Google confirms, content loaded dynamically into Flash from external resources isn't associated with the parent URL.

While this is a great development for Flash Developers moving forward, lots of education may be required.

A new and updated version of Google's "Spam Recognition Guide for Quality Raters" that surfaced recently. At first I was a little skeptical as to the document's authenticity. After a little "forensic" analysis, I feel reasonably certain the the document is at least partially legitimate. I'm still going through the document but, a few sections seemed worth mentioning. Big hat tip to vizualbod.com.

"Revised Rating:
Vital
Useful
Relevant
Not Relevant
Off-topic
Didn't Load
Foreign Language
Unratable"

Interesting confirmation that being "relevant" isn't always the most important issue.

"Some individuals have more than one blog and/or more than one homepage on a social networking site (e.g. myspace, facebook, friendster, mixi). When these pages are maintained by the individual (or an authorized representative of the individual), they are all considered to be Vital."

Hmmm... think social networks are a total waste of time do you?

"Relevant
"A rating of Relevant is assigned to pages that have fewer valuable attributes than were listed for Useful pages. Relevant pages might be less comprehensive, come from a less authoritative source, or cover only one important aspect of the query."

I've always suspected this notion of a total number of "valuable attributes" as being important. As in, more information is better. This factor also comes into play when sites use formats or technology which prevent Google from extracting information used as signals.

"Recognizing true merchants:
Features that will help you determine if a website is a true merchant include:

  • a "view your shopping cart" link that stays on the same site and updates when you add items to it,
  • a return policy with a physical address,
  • a shipping charge calculator,
  • a "wish list" link, or a link to postpone purchase of an item until later,
  • a way to track FedEx orders,
  • a user forum, the ability to register or login,
  • a gift registry, or
  • an invitation to become an affiliate of that site"

Confirmation that even "quality sites" could be mistaken as other or in some way depreciated if all bases aren't covered.

- beu