Monthly Archives: June 2012

Search engines have focused on simply "matching keywords to queries" for years. This approach is slightly problematic however, because it disassociates keyword meanings for multiple keyword queries. For example, search engines might interpret the query [Paris Hilton] (a proper noun and named entity) as simply a request for instances where the words "hilton" and "paris" appear within a page. With a large enough set of data, fortunately it is possible to make statistical inferences about the intent of a user's query. As a result, Google has relied on statistical inference for uncertain data queries like [Paris Hilton] and [b&b ab] (bed & breakfast in Alberta) for years.

In 2010 Google purchased Metaweb Technologies, Inc. which was the company behind Freebase. Freebase was/is an "open, shared database of the world's knowledge". Before being acquired by Google, Metaweb was in the process of identifying millions of "entities and mapping out how they're related" via Freebase. In addition to entity mapping, Freebase also looks at what words other sites use to refer to entities. In May 2012 Google launched "Knowledge Graph," a “graph” which is built in part on Freebase. According to Google, Knowledge Graph can "understand real-world entities and their relationships to one another." Google hopes Knowledge Graph will improve search results and provide more immediate answers to user's questions in search results pages.

The concept behind Freebase and Google's use of graphed entities is pretty interesting but, I would like to know more about what is really going on under the hood of Google Knowledge Graph. Since Knowledge Graph launched, I have spent hours trying to break it, find bugs, discover issues and/or to identify abnormalities. Remarkably I must say, until last week I had found very little. Then as they say, "it happened!" Last Thursday, while looking for a good example of Google Knowledge Graph results to use in a presentation, I got the search result below.

SERP for Matt Cutts

 

Suddenly it dawned on me, Matt did not go to UNC Law School!

 

Matt Cutts SERP

I clicked on "University of North Carolina School of Law" in Matt's Google's Knowledge Graph result under his bio from Wikipedia but, it returned search results for another entity [university of north carolina at chapel hill]. From that result, I searched for [unc] and was returned this result.

Just to be sure what I was seeing was correct, I deleted all cookies, signed out of Google and restarted my browser. After refreshing all of my settings, I searched for [unc founded] and was returned this search result.

At that point, I realized UNC's founding date even seemed off? I checked and according to the University of North Carolina Planning Department, UNC was founded in 1793 not 1789. To be sure this was not the date UNC's Law School was founded, I checked the UNC School of Law website. According to the site, the first law professor did not arrive at UNC until 1845. Then went back and checked Wikipedia's page for UNC and it did not contain any text being displayed in Google's Knowledge Graph search results either.

With the suspected smoking gun already in hand, I went to Freebase.com and searched for [UNC]. You guessed it, Freebase.com's first result for [UNC] was exactly what had appeared in Knowledge Graph results "University of North Carolina School of Law". It turns out Matt is not alone, all UNC graduates listed in Freebase.com are listed as UNC School of Law graduates even if they did not attend the UNC School of Law. At that point it was clear, Google Knowledge Graph "thinks" UNC and UNC's School of Law are a the same or a single entity because that is what Freebase.com is "telling" Google Knowledge Graph.

Because Freebase data appears in Google Knowledge Graph search results and Google's main search results this issue also means results for 100+ notable figures are potentially incorrect. For instance, according to Google Knowledge Graph results US President James K Polk graduated from UNC's School of Law but UNC's School of Law was founded when he was already in office.

Knowledge Graph Results for James K Polk

In addition to Matt Cutts and President Polk, search results for [Michael Jordan college] in Google's main search results are also incorrect due to this issue.

Knowledge Graph Results for Michael Jordan

Other UNC School of Law alumni according to Freebase and potentially Google Knowledge Graph, include Alge Crumpler, Lawrence Taylor, Andy Griffith, Rick Dees, Roger Mudd, Vince Carter, Jerry Stackhouse and even Thomas Layton, the former CEO of Metaweb.

This issue is potentially due at least in part to the fact that only a shell page for UNC (UNC being the parent University of UNC Law School) existed in Freebase.com until yesterday. To hopefully help improve the quality of Google Knowledge Graph results, I added an image, description, UNC's correct founding date and other information from UNC.edu to UNC's Freebase page yesterday.

With fingers crossed that Matt's wild and crazy UNC Law School days are not his best kept secret, that my site won't vanish from Google tomorrow and that the US Secret Service won't show up at my door, I removed "Law School" from both Matt's and President Polk's profiles in Freebase. As a result, Matt Cutts and President Polk are now the only non-Law School students / graduates in UNC's Freebase page. It will be interesting to see how long these changes take to appear in Google's Knowledge Graph search results.

Google Knowledge Graph is really interesting and seems to be working pretty well despite a few bugs. This is yet another edge case but a situation you should know about. Instances where different entities have the same or similar names are problematic. Instances were multiple keywords are similar to multiple keyword entities are also problematic. Google may already be using Knowledge Graph data based on Freebase.com to determine whether on not content falls in or out of scope. For all of these reasons and others, it is important to ensure you keep an eye on Knowledge Graph results that relate to you. If you notice issues, click on "feedback" just below Knowledge Graph results on the right hand site of Google search results pages.

If you missed it earlier, be sure to check out Barry Schwartz's live blog coverage of SMX Live: You&A with Matt Cutts, head of Google's Webspam team. For those interested, I have posted my notes from the session below.

Penguin:

  • The lead engineer for what would come to be known as "Penguin" picked that name.
  • Penguin addresses spam.
  • Impacted sites are believed to be in current violation of Google Webmaster Guidelines.
  • The only way to recover from Penguin is to properly address guidelines violations.
  • Impacted site experience an algorithmic demotion in search results but not a penalty.
  • Penguin is not a "penalty" or "manual action" because it is algorithmic and therefore not manual.
  • There is no whitelist for Penguin.
  • Google uses 200+ signals for rankings and Penguin is the latest.
  • Sites hit by Penguin can fully recover once guidelines violations are resolved.

Panda:

  • Panda was named after the lead engineer who's last name is Panda.
  • Addresses thin and/or low quality content.
  • Prior to Panda low quality content fell between the Search Quality team and Web Spam.
  • Since Panda, Search Quality and WebSpam teams at Google work closer together.
  • Sites hit by Panda can fully recover once content issues are resolved.

"Manual Actions" the new "Penalty"

  • According to Matt, "We don't use "penalty" anymore, we use "manual action" vs an algorithmic thing."
  • Manual reviews result in a manual actions whereas algorithmic detection results in a demotion.
  • 99 percent of manual actions are accompanied by webmaster notifications via Google Webmaster Tools.
  • Algorithmic issues do not result in a notification via Google Webmaster Tools.

Unnatural Link Notifications:

  • Unnatural link messages imply a manual action (penalty).
  • For unnatural link notifications webmasters should submit a reconsideration request.
  • According to Matt, "typically if you get a notification you will see a downgrade in rankings."
  • Google wants "to see a real effort" on the part of webmasters when it comes to removing unnatural links. Some webmasters have gone so far as to scan in images of letters sent to domain owners requesting links be removed.
  • When reinclusion requests are submitted for unnatural link notifications, "Google reviews a random sample to see if those links are removed."
  • Webmasters should attempt to remove at least 90% of unnatural links pointing to their site.
  • Google understands it is difficult to remove links and is working on alternative solutions.
  • Google is working on a new feature which will allow webmasters to "disavow" links pointing to their website.
  • If you cannot remove some links, it may be possible to remove the entire page if it is not the homepage or similar.

Paid Links that pass PageRank:

  • Despite the fact that Google is able to detect paid links passing PageRank and does not count these links, Google recently started taking manual action by penalizing these sites.
  • According to Matt, Google is taking manual action and penalizing sites with links passing PageRank because companies continue to profit off of these practices.
  • Google wants people to understand that PageRank passing paid links are a link scheme, a waste of time and money.

Affiliate links:

  • Google handles affiliate links well but including rel=nofollow never hurts.
  • "Nofollowed links account for less than 1% of all links on the internet."
  • Negative SEO

    • The recent reaction to "Negative SEO" has been interesting.
    • Negative SEO has been around a long time.
    • According to Matt, "It is possible for people to do things to sites, like steal domains."
    • Matt pointed out that Google changed the wording of Google Webmaster Guidelines some time ago to address negative SEO. It says, "Practices that violate our guidelines may result in a negative adjustment of your site's presence in Google, or even the removal of your site from our index."

    Bounce Rate:

    • According to Matt, Google Analytics data is not used for rankings.
    • Bounce rates from search results is noisy because of redirects, spam and/or other issues.
    • Bounce rates do not accurately measure quick answers.
    • Because users often get the answer they want and then leave, bounce rate is not a good metric for Google to use.