In May 2024, the Google Leak case broke out, a momentous event that confirms many assumptions already known by those involved in SEO, often in contradiction with official statements. Some secret APIs of the search engine algorithm have become public knowledge and have been analyzed by the world’s top SEO experts. In total, the 2,500-page document from Google’s Content API Warehouse collected 2,596 modules with 14,014 attributes, identified as ranking factors.
Be careful! The documents do not show how much each factor weighs on the ranking. However, based on the attributes, we can identify which elements are considered by Google’s algorithm. Google spokesperson Davis Thompson confirmed to The Verge the authenticity of the documents, but cautioned readers against drawing inaccurate conclusions. Michael King, founder and CEO of iPullRank, and Rand Fishkin, founder of SparkToro and former CEO of Moz, have analyzed these documents, comparing them with information that emerged from the Antitrust trial against Google. Thus, we can gather important confirmations on ranking factors.
Where to find the full Google Leak document
The document, created on March 27, 2024, was on Github, with access reserved for employees of Google’s Content API Warehouse, as per practice. Inadvertently, on May 7, the folder became accessible. Subsequently, on March 24, an anonymous source shared the file with Rand Fishkin, who analyzed the contents along with Michael King.
Today it is possible to analyze the Google Leak file in detail on the Hexdocs platform: the document is very long and technical. Below, I publish a summary of the key elements.
The most important ranking factors revealed by the Google Leak
Not all official claims about Google’s algorithm were 100% true. This is what emerges from the Google Leak file. Below, I report the evidence on the ranking factors that emerged from the confidential documents.
The concept of entities has now surpassed that of keywords. Google has now classified information on the web and identified entities composed of attributes. These identify all the information on the web, creating connections and a dense network of concept maps.
According to the information contained in the Google Leak documents, we can see entity links with the authors of content on the web (which reinforce theories about the EAT model) and with the brand gaining more authority in relation to specific topics.
User clicks and data collection from Google Chrome
Google analyzes user clicks and collects information from Google Chrome browser. The file reveals the Pulcinella secret that everyone knew, no one could prove, and Google always denied. Attributes such as “badClicks”, “goodClicks”, “lastLongestClicks” and “unsquashedClicks” suggest that the user’s clickstream on a site is measured and this affects ranking.
This is the responsibility of the re-ranking twiddler navBoost, which analyzes users’ click logs, identifying natural, forced, and spam ones and attributing value to actions and session duration or time on the page. Google Chrome helps provide useful information to the algorithm. This shouldn’t come as a surprise to those who are used to working with the Page Speed Insight tool. From here, for example, Google collects useful information to create sitelinks to the most viewed pages of the website.
In 2011, after the Panda Update release, people started talking about Site Authority, many tools such as Semrush, SeoZoom or Moz included this variable in the reports (trying to assume what the attributes could be considered by the algorithm) and Googlers continued to deny its existence for a long time. Today, the attribute “siteAuthority” appears in confidential documents. This element collects useful information to define the relevance and authority of a domain. Age is one of them, as the “hostAge” attribute suggests. So the sandbox effect, which penalizes new websites on keywords with a lot of competition, exists!
In addition to this, we know for sure that Google doesn’t forget. The “CrawlerChangerateurlhistory” attribute informs us that the search engine stores copies of previous versions of the pages, in order to penalize fraudulent SEO practices.
It also identifies websites by type. This is the case for personal blogs, that have different goals from a corporate or informational site, and are labeled as “smallPersonalSite”. For sensitive topics related to current affairs, such as Covid and political elections, it has instead added lists of exceptions, through the attributes “isCovidLocalAuthority” and “isCovidLocalAuthority” that subject the pages to stricter control.
Link building affects SEO ranking
A huge divisive topic among SEO experts is that of link building. Secret documents tell us that backlinks are a ranking factor on Google. We don’t know how much they affect, but we have confirmation that they have a value. Several of the aspects discussed above affect the value of links, including click count on Google Chrome and relevance according to the entity map.
Quality Raters’ feedback determines the relevance of content
Google periodically updates the document with the guidelines for quality raters and has been using for a long time the EWOK quality rating platform (same name as the funny creatures of “Star Wars” saga). The “GoogleApi.ContentWarehouse.V1.Model.RepositoryWebrefDocLevelRelevanceRatings” module reveals that human evaluations by quality raters are actually being used. In particular, it indicates whether the content is considered relevant and understandable.
SEO optimization: what Google Leak documents tell us
We’ve seen the most important ranking factors that emerge from Google Leak documents. Actually, we don’t know how much each of them affects, and there are many others of lesser scope in the file. From all this, we can get5 best practices for those involved in SEO, content and digital marketing:
- take care of the brand and how it is perceived online;
- do internal and external link building and digital PR activities;
- try to get traffic from different channels: organic search, social, advertising, newsletters, and referrals;
- foster an internal navigation path with a clear menu, relevant touchpoints, and high-value related resources;
- create relevant and quality content.