How many pages are there on your website?

How many pages are there on your website? Are you presenting them all to search engines? Are you hiding any from visitors? It’s worth auditing what you’ve got, and what’s accessible.

There are four lists of website pages that I like to compile and compare. The first should be the definitive list – an export from your content management system. I assume any CMS can do this, although some may only provide it in a slightly geeky format designed for export/import purposes, so it may take a bit of effort to extract just the page URLs.

The second list is a ‘crawl’ of the site, following all the links, just as a search engine would do. The pro tool for this is Screaming Frog, although its free version is limited to 500 pages; otherwise, the evergreen Xenu Link Sleuth is still available and very effective.

List number three is your ‘sitemap’ – a file which should be at yoursite.com/sitemap.xml (but if you can’t find it, see if its location is listed in yoursite.com/robots.txt). This is the index you’re providing for search engines, and is something that every site should have.

Finally, a useful list is that of the pages which people have been looking at, from your website visitor analytics. Set the date range to at least a year, then export all the pages viewed. You won’t need any of the view data, just the URLs.

Once you’ve got all of these, put them side by side on a spreadsheet in alphabetical order, and play spot the difference. Then try to work out why the differences exist. Maybe your sitemap is out of date. Maybe you have pages which can’t be reached through conventional crawling or site navigation, which people have nevertheless accessed directly. Maybe you have some pages many times over in a list because they’ve been displaying with various ‘query strings’ after the URL – it’s worth working out if this is a problem or not.

Things you might discover, which need fixing, include:

  • Pages which aren’t in the sitemap;
  • Pages which are in the sitemap but don’t exist;
  • Pages which are in the crawl but don’t exist;
  • Pages which your website analytics says have been accessed that aren’t in the CMS, and maybe shouldn’t exist;
  • Pages which have been accessed that weren’t crawled, and maybe shouldn’t exist;
  • Pages which have never been accessed (why?);