Every Wednesday from now on (until I run out of subjects!), I shall be addressing a search engine optimisation topic under the imaginative title of “SEO Wednesdays”. Today: What’s your robots.txt up to?
First of all, what the heck is a robots.txt file? Well, search engines such as Google keep up to date with your website by visiting your domain and running a program – variously called a crawler, spider or robot – which follows all the internal links and hoovers up everything it can find. How often the search engines visit your site to do this depends on how often they’ve found changes in the past (which is another reason to add stuff to your site frequently).
The first thing the crawler looks for is the existence of two files: a “sitemap” and/or a “robots.txt file”. The sitemap is an optional list you can make of everything on your website. This is a mutually beneficial resource which lists all the pages you’d like the search engine to find and index. The robots.txt file can contain a number of instructions for the search engines, but its main one is to do the opposite of the sitemap: to tell the search engines which pages you don’t want them to spend time examining.
Note that the robots.txt file is not designed to keep pages, documents and images out of the search engine results – there are other ways to do that. It’s just designed to guide the search engine crawlers, which will have a limited time on your site, not to waste time on certain pages. If Google finds an external link to a page on your website, it will probably include that page in its results, even if the page is “disallowed” in the site’s robots.txt file. An external link can even be created by somebody clicking on Google’s “+1” button when visiting the page.
Now, at this point, you might be saying: “That all sounds a bit pointless, and anyway, I just want everything on my website indexed – so do I need this?” and the answer is probably no. Google specifically says: “If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).” However, what you should do is to check that you don’t have one sitting there and stopping search engines from visiting pages on your site which you would like indexed. I’ve seen this on more one occasion. One reason for it happening was that during the website’s initial development, the designers didn’t want the search engines visiting pages which hadn’t been completed, so they put in a robots.txt file to stop this from happening. Unfortunately, they didn’t remember to delete the file afterwards.
The best way to examine your website’s robots.txt file, if it exists, is to use Google Webmaster Tools (see screenshot below). There you can see if it exists, if the search engines understood it, and also get an indication of what it’s doing. You can even test out changes. However, the simplest way of all is to just look at the file (if it exists) in your web browser. The URL is simply:
http://[your domain name]/robots.txt
If you see any “disallow” lines, try to work out what they mean, and if their effect is important to you. If the file is disallowing access to parts of your website which seem to be active, get someone to take a closer look at what’s going on. Insider Programme members are welcome to email me for any advice.