Anyone concerned with their website’s search engine optimisation would do well to ensure they have an understanding of how the search engine ‘crawlers’ work. There’s a basic introduction from Google here but the company doesn’t give away too much technical detail, as you might expect.
Many people have spent time trying to reverse engineer what the ‘Googlebots‘ do, and one way to achieve this is by analysing sites’ log files. In What I Learnt from Analysing 7 Million Log File Events on the Screaming Frog blog, author Roman Adamita reports on an analysis of the log file for a 100,000-page eCommerce site. There are some interesting observations.
In the case of that website, Googlebot crawled the robots.txt file between 6 to 60 times per day! Our websites might not get as many visits, but I just checked the server log of a small blog I run that posts once a week, and on a randomly selected day Googlebot checked 23 different pages, some of them several times. It’s a mind-boggling data collection operation.
That said, don’t assume the search engine crawlers will be around in a minute. If you’re adding something urgent to your site, use the page-submission tool in Google Search Console to get it crawled more quickly.
Some of the most interesting observations in the article concern pages that have gone away. It seems that ‘404’ (gone away) pages may be crawled unlimited times, perhaps because there are internal or external links to those pages, including in the sitemap. If we don’t want to waste Googlebot visits, it’s important to identify and remove these links to nonexistent URLs. Also, ‘301-redirected’ URLs can be crawled for up to a year, so keep those redirects in place.