In this chapter, we will discuss how a search engine, specifically Google, works to provide users with information that match their query in seconds. Search engines operate using a three step process: web crawling, indexing and ranking. While we already touched on Google’s ranking algorithm, this chapter will narrow-down on the first two steps in this process; in order to rank, you need to have a database of relevant information to parse through. The first section of this chapter will touch on web crawling, discuss a variety of issues that may hinder the ability of a web page to be crawled, and then further discuss how you can address these issues. Next, we will discuss what indexing is and why some web pages are better off not being indexed.
By the end of this chapter, you should know how a search engine operates and the factors that play a role in ensuring that your web pages operate according to best practices.
KEY TERMS DISCUSSED Bots, Crawl Budget, Crawl Demand, Crawl Rate Limit, Disallow Directive, Indexing, Noindex, Robots.txt, Sitemap, User-Agent, Web Crawling, Fresh Rank
Now that we’ve covered the basics, let’s discuss the technology behind the search engine, and how it actually works. We know that the Google algorithm ranks web pages that are relevant to the query, but what is the technical process involved in getting this information?
Have you ever noticed how quickly Google is able to provide you with a SERP after you’ve hit the search button? Well, the precise amount of time is always listed on the page, as seen in the image below.
As you can see, this particular search provided the user with 222 million results in a mere 0.83 seconds. That’s crazy!
What’s the secret behind being able to pull all this information so quickly?
Well, you see, the content in every one of these 222 million web pages have already been analyzed. Google knows what information is available, where it can be found, how relevant it is, and how trustworthy the information is. All of this is stored in the search index.
To fully understand how this is done, and what a search index is, let’s first discuss what’s involved in a Google search before you hit the “search” button.
As discussed, there are billions of web pages available on the world wide web, and web crawling is how search engines discover the content of these web pages. Google (and all other search engines) deploys their “bots” or “spiders,” to browse every web page on the web and gather information about their content; this process is commonly known as “crawling.”
Each search engine has a specific name for their bots (called user-agents); the name of the bots are important to note because when writing robots.txt for your website, you might want to call out specific agents and prescribe them some rules to follow when conducting their search. We will talk about robots.txt later in this section
Here is a list of the some of the more known search engines, and their user-agents:
As previously mentioned, we will be focusing on the user-agent Googlebot since Google is by far the most popular search engine on the internet. “Googlebots” collect a ton of information from each web page they crawl, beyond what is simply written on it. The information that the bots collect include:
- The URL of the web page
- Meta tag information
- The content on the web page
- The links in the web page, and their destinations
- Web page title
- Headings in the web page
As each web page is crawled, the bots travel from one web page to another using the links on each website. Aside from browsing new URLs, the bots also visit existing URLs, so that all new information found on these pages can be “read.” All the information they find is compiled in a list.
Next, the information on the list is “evaluated” and assigned a weighted value. The value is ultimately determined by a number of different factors, such as which words are most commonly used on the web page, and where they are found. The higher the value of your web page, the higher it will rank on the SERP.
It’s obvious that you want to have your pages crawled by Googlebots so that they can be index and subsequently made searchable on Google.
As we previously mentioned, there are literally billions of web pages on the world wide web. It would be impossible for bots for crawl all of them at all times. In some circumstances, especially if you’re a larger organization, a “crawl budget” is something that you might want to keep in mind.
We know that bots access webpages through the URL. If your website only has a few hundred or so URLs, the bots can easily crawl all of them to find new or updated content. If a website has, however, an upward of 2000 web pages, the bots might not crawl all of them immediately. This means that new or updated content might take longer than expected to be indexed and made searchable.
In cases like this, a crawl budget is something that you want to know about. A crawl budget is allocated for each website by Google and tells the Googlebots how often to crawl the website and how many resources the server hosting the website can allocate to crawling. Google determines the crawl budget for each website based on two factors:
- Crawl Rate Limit
- Crawl Demand
Let me explain.
The Can: Crawl Rate Limit
“Crawl rate limit” means exactly what it sounds like: the limit at which bots are able to crawl your website. In simple terms, it’s the rules that tell Googlebots how many web pages they can crawl at one time on one site and how long they have to wait before moving to the next URL.
These limits are prescribed so that the bots don’t overload your server. If there were no limits, people viewing the web page at the same time as the bots would experience very slow loading times on their web browsers.
As the Google experts themselves put it, “Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn’t degrade the experience of users visiting the site. We call this the “crawl rate limit,” which limits the maximum fetching rate for a given site.”
Google also notes that there are two major factors that contribute to the crawl rate limit:
- Crawl health: This is essentially your website/ web page loading speed. If your pages respond quickly to the bots, then the limit will be increased until they are just below the threshold of risking slow loading time for users.
- Search Console Settings: Website owners can reduce or increase crawl through the search console settings, though setting higher limits doesn’t automatically increase crawling.
The Want: Crawl Demand
Unfortunately, especially for smaller companies who aren’t well known or popular, Google may choose to limit the amount of crawling that the bots can do on your website even if they’re under the crawl rate limit.
This is where the want factor comes in. The bots have to want to crawl your page – and if you want your content searchable on Google, you want them to crawl your pages too!
The more popular and in demand your web pages and content are, the higher your crawl demand will be – it means that people want your stuff to be searchable.
There are essentially three factors that contribute to determining the crawl demand:
- Popularity: The more popular the URL is, the more often they tend to be crawled to ensure they’re fresh and up-to-date
- Fresh Rank: The more fresh content you put out there, the more the bots will want to crawl your pages. Providing them with fresh new content gives them a reason to keep coming back. If you can’t think of new content to publish, older content can always be refurbished and published again.
- Level of Staleness: The more “stale” a URL is (meaning they haven’t been crawled in a while) the more of an attempt the bots make to crawl them to ensure that all updates are indexed.
Factors that Affect Your Budget
The factors that play a role in lowering your budget mostly have to do with having bad or low-value URLs on your website. These types of URLs will waste your server resources on indexing poor web pages, which will in turn will make your site less popular. The less popular your site is, the lower your crawl demand will be. The lower your crawl demand, the lower your crawl budget will ultimately be.
Of these factors that lower your budget, Google points to six of them that have the largest impact:
- Faceted navigation: This is when you have a webpage that allows you to filter results. Each filtered (or facet) result tends to have a different URL, which leads to duplicate content and can quickly eat through your budget.
- Duplicate content: Having duplicate content essentially means that Googlebots are going to be wasting up your budget by crawling content they’ve seen before. This will take time away from the important pages on your website, as they will be crawled less frequently.
- Soft error pages: Soft errors (which we will discuss later in Chapter 7), particularly soft 404 errors, which means that a URL on your website leads to a page that tells the user that the page doesn’t exist, is obviously not good for your budget. Googlebots shouldn’t be wasting their time connecting to URLs that don’t lead to any content.
- Hacked pages: Websites that have been hacked will impact your crawl budget.
- Infinite spaces: Infinite spaces refer to a large number of links that provide little to no new or relevant content for bots to crawl. An example would be a web page that has a calendar with a “next month” link – in essence, you could click “next month” forever. This is bad for your budget because if bots crawl these URLs have no end.
- Low quality and spam content: Simply put, if the quality of your webpages isn’t good, your budget will see a decrease.
Check Your Crawl Budget
Google offers a free tool so that you can check the crawl budget of your website through the Google Search Console. You just need to paste in your URL and you’ll be able to see the crawl statistics for your website, in addition to any server errors you may be experiencing.
I’d recommend doing this before trying to optimize your website since this will not only inform you of what errors need to be addressed, but you also might find that you’re budget is already at optimal level – no need to waste your time! See Chapter 7 for information on crawl errors.
How to Optimize Your Crawl Budget
Essentially, to optimize your crawl budget, the first thing you want to do is ensure that you aren’t making any of the mistakes listed above. If you’re able to avoid all the things that negatively impact crawl budget, you’ll already be in a good place.
In any case, here are some tips on how to make sure that your crawl budget is as healthy as it can be so that all your content is being crawled and indexed quickly. The table below shows additional things that can increase your crawl budget.
- Server Speed – Select a host provide that responds quickly to server requests. To see the speed of your website, check out PageSpeed Insights.
- Search Console Settings – Ensure that your crawl rate isn’t being limited, and is set on “optimize.”
- Original and rich content – Make sure that you aren’t duplicating content and that every new URL you create leads to new, useful and rich content.
- Sitemap – Make sure you’ve got a sitemap. The sitemap is an .xml file that lists all web pages on your website. This helps the bots recognize what your content is and helps them navigate. Make sure this is up-to-date and doesn’t contain any broken links.
Let’s go a little more in depth about what sitemaps are and how they help Googlebots do their job more easily.
As mentioned, a sitemap is essentially a list that tells Googlebots which pages you think are important in order to help them more intelligently crawl your website. The sitemap also provides bots with other information, such as when the page was last updated.
Technically speaking, you don’t necessarily need a sitemap, but it’s highly recommended as an integral part of SEO
It lets the bots know which pages you want them to crawl. Further, having a sitemap helps Google to identify whether or not your site is the original publisher of content. If you don’t have a sitemap, Googlebots might think that you have duplicate content, which isn’t good, especially if your original content is inspiring and in demand!
There are a ton of benefits of having a sitemap:
- Identifies updated content: Your content will rank higher in the SERP if you keep your content fresh and relevant. A sitemap will let the bots know of new or updated content, which could contribute to more favourable rankings.
- Prioritizes important content: You need to be mindful of your crawl budget and make sure that bots aren’t wasting their time on pages that aren’t useful. A sitemap will let them know what pages are the most important to crawl, which will help your best pages get crawled and indexed much more quickly.
- Gets content indexed fast: You definitely want your content to be indexed and searchable as quickly as possible, especially if you are writing about a news story, or a story that is relevant immediately. This is when the sitemap comes in handy – the bots will know what content is of priority to crawl and subsequently index.
- Helps make your content discoverable to Googlebots: When you submit your sitemap, you are making your URLs known to Google. This means you rely less on other webpages linking to yours (external linking) for the bots to find and crawl your pages. We will talk more about external links later in Chapter 4.
How to Create a Sitemap
Creating a sitemap is actually quite simple when using a third party tool. If you have less than 500 webpages, you could use XML-Sitemaps.com or ScreamingFrog, both of which I’ve heard good things about. Google also provides webmasters with a list of other sitemap generator sites that can get the job done.
The first thing you need to do is decide what web pages you want to mark as priority and include in your sitemap. Once you’ve collected a list of the pages you want, the hardest part is done.
From here, you employ the third party tool of your choice, and depending on how many web pages you’ve included, it could take from minutes to a few hours to generate.
Next, you simply need to sign into Google Search Console and submit.
Lastly, once you have your sitemap.xml, you need to move it to the main directory (the root) of your own domain. From here, you’ll always be able to access your sitemap through your domain.
It’s important to note however, that one sitemap can only contain up to 50,000 URLs. Obviously this isn’t a problem for smaller organizations, but for some larger ones, multiple sitemaps might need to be created so that all of their relevant pages can be mapped out for the Googlebots.
While sitemaps provide mere guidelines for the bots, and they may eventually crawl other URLs that aren’t on the sitemap, robots.txt provides webmasters with a way to actually make rules of what bots can and can’t crawl.
In short, robots.txt is defined as a set of instructions that tell bots what they can or can’t crawl on your website.
This is incredibly useful because it can help ensure that the bots aren’t wasting any of their resources on web pages that may contain factors that lower your crawl budget, such as pages with infinite spaces, or facet navigation.
It’s important to note however that just because a web page is disallowed in robots.txt doesn’t mean that it won’t be crawled. Google may decide to crawl the page if the URL is found as a link on a web page of another website; this may indicate to bots that the web page is relevant and should be crawled in order to make it searchable. If you want to truly block a page from being crawled or index, you need to use a “noindex,” “nofollow” tags, which we will discuss later.
Robots.txt files can contain multiple directives (instructions) for the user-agents to follow. The three most common directives used are:
- Disallows Directive: The standard directive and is readable by all user-agents. This tells all user-agents which parts of the website they can and can’t access.
- Allows Directive: This directive is only readable by Google’s user-agents, Googlebots, and is used to indicate which pages can be accessed even if its parent page is disallowed.
- Crawl-delay Directive: This directive isn’t readable by Googlebots, but is readable by some other user-agents. It’s used to indicate how many seconds a crawler should wait before loading and crawling page content. For Googlebots, this can simply be set in the Google Search Console.
The basic format of a robot.txt, using the standard “disallow” directive, looks like this:
User-agent: [user-agent name]Disallow: [URL string not to be crawled]
You can find the robots.txt of any website by simply typing in the root URL of a website, and then adding /robots.txt at the end.
How to Create Robots.txt
Much like sitemaps, websites aren’t required to have a robots.txt, though it’s SEO best practice to have one. Having a robots.txt is also a sign to Google that your website is of quality.
If you are using WordPress for your website, simply login to your WordPress backend and select: SEO > File Editor. From here, you’ll be able to create or edit your robots.txt.
If you aren’t using WordPress, you can create your robots.txt using a text editor, such as notepad. Once complete, upload the contents to the file of your root directory.
After the bots have completed their crawl, they store all the content that they’ve deemed valuable in what’s known as the search index; the term “indexing” is merely the process of doing this. Google describes this massive database as being similar to an, “index in the back of a book — with an entry for every word seen on every web page [that has been indexed.]”
When a user hits the search button, the search engine retrieves the data from the search index, which is ready and waiting with all the information from every indexed site. It’s estimated that the contents in this database amount to over 100,000,000 gigabytes in size.
To see how many web pages your company has indexed, simply use the Google Search Console. Further, if you are using a sitemap, you can submit this to the Google Search Console to see how many of your recommended pages have actually been crawled and indexed.
As we’ve previously discussed, there are a number of reasons why you might not want Googlebots to crawl your web pages, such as infinite spaces or facet navigation, which both impact your crawl budget. We’ve also determined that using a robots.txt isn’t a foolproof way to guarantee that your web page won’t show up in the SERP.
This is where “noindex” tags“ come into play.
First thing to understand is that “no index,” is a robot meta tag; robot tags are one of the four members of the “meta family” (which we discuss in the Chapter 3) however unlike the other members, robot tags provide instructions for the crawlers, rather than just provide them with information about the web page, like the other meta tags do. This is why we are discussing this aspect of a robot meta tag here; it impacts crawling and indexing.
As we know, robot.txt isn’t a foolproof way to guarantee that your page won’t be indexed, as Googlebots can still find a way to do so. If, however you add a noindex tag, Google will not be able to index this page irregardless of any other factors.
By definition, “noindex” simply tells the Googlebots not to index the page. This is a “meta” tag because when in use, it will be found in the HTML code of the web page, in the <head> section. Here is an example of what it would look like:
<!DOCTYPE html><html><head><meta name=”robots” content=”noindex” />(…)</head><body>(…)</body></html>
“Robots” refers to all user-agents. If you wanted to prevent a specific search engine from indexing your web page, you’d simply write their user-agent name in place of the “robots,” as seen below:
<!DOCTYPE html><html><head><strong><meta name=”googlebot” content=”noindex” /></strong>(…)</head><body>(…)</body></html>
In this case, the noindex tag would prevent the web page from being indexed in Google search, but would be indexed in Yahoo, for example.
It’s important to note however that if you are going to be using an noindex tag, make sure that the web page isn’t also disallowed in your robot.txt. This is because Googlebots have to have access to the web page in order to “read” the noindex tag in the HTML.
This will ensure that even if Googlebots find their way to your web page via a link from another website, it won’t be indexed.
KEY CHAPTER TAKEAWAYS
Web crawling and indexing are the first parts of the search engine process; Googlebots crawl web pages and then index any pages that they are, a) allowed to, based on your robots.txt or noindex directive, and b) they think are relevant and of good quality (for example, they might not index duplicate content).Having your website crawled is the first step in ensuring that your pages get indexed so that they can be ranked and found via the SERP. Make sure that you are providing Googlebots with a sitemap to make their job as easy as possible. By simply nudging them in the right direction, you’ll be able to optimize your crawl budget so that all the pages you want indexed will be crawled as priority.