Getting Your Webpage Found by the Search Engines
A Radio Bracknell listener wrote in asking for advice, saying that he has a “hidden webpage” on his website and wants to ensure that it is found by the search engines. I dug a little deeper and answered the question during yesterday’s “Business In Berkshire” radio slot.
It transpires that the listener’s webpage isn’t so much “hidden”, as simply “not obvious” but it does contain information that he wants the search engines to find, and hopefully add to their index.
The first thing for me to say here is that search engines (like the “bots”, “spiders” and “crawlers” that roam the web doing their work for them) aren’t miracle workers, they are simply algorithms running on a big server somewhere. Or to put it another way, if you want them to help you (by crawling and indexing your web page) it’s a good idea to help them by making that page reasonably easy to find! If it’s a really important page to you, include a link to it on your homepage. If it’s not that important, then put a link to it on a second or third level page or similar, but consider that if you can’t find it through the navigation structure you provide there’s a good chance that the search engines won’t either.
You can go further than that though and provide a sitemap to help the search engines too. Don’t mistake this for a sitemap designed to help human visitors because it isn’t, the one I’m talking about is really designed to help search engines. By convention it’s called sitemap.xml and sits in the root directory of your website, i.e. the same directory as your homepage. If you want to read all about sitemaps here’s a useful URL: http://www.sitemaps.org/protocol.php
But what if you don’t want the search engines to crawl a certain area of your website? One way is to create a file called robots.txt and put that in the root directory of your website, alongside your sitemap.xml and homepage. Using robots.txt you can manually specify certain directories and even files that you do or do not want to be crawled. You can learn more about robots.txt here: http://www.robotstxt.org/robotstxt.html This will work well for the ‘respectable’ search engines.
The only problem however is that following robots.txt is optional, not mandatory, and if there’s a spider out there that fancies ignoring it then it can and will. Indeed, some rogue spiders may even look for things that are expressly disallowed and then try to find ways to crawl them on the basis they may be more interesting to them! If you have sensitive data you need to take steps to protect it, and that is outside the scope of this article. Phone me though, and I’ll try to help.
You can also try to stop webpages being indexed on a per-page basis if you wish, when the bot gets there, by including a special instruction in the page’s meta-data, up in the HEAD where human’s don’t see it. It looks like this:
<meta name=”robots” content=”noindex”>
Sounds complex? It can be, just like robots.txt, especially if you get it wrong, but 99% of webmasters get by without knowing about things like this.
Finally it’s important to realise that just because a spider has crawled your website (i.e. wandered around and taken a look at the contents) it doesn’t mean that it will appear in that search engine’s index (a decision made later) let alone be retrieved in the SERPs (search engine results pages – i.e. what people see after typing in their query). And this of course is what the vast majority of website owners want: to be found.
The inter-relationships here, and ways to manipulate them, are complex at best and far outside the scope of a book let alone this article, so if you need help with your Search Engine Optimisation (SEO) or Search Engine Local Optimisation get in touch and we’ll be happy to help you out.
If you have enjoyed this article and would like to share it with others, please use the social media links below. If you want to be notified of more articles like it when they are published, please register use the links on the right. And if you have any comments, please use the comment facility provided.
Tags: Business In Berkshire, hidden webpage, meta-data, Radio Bracknell, robots.txt, rogue spiders, search engine index, search engine local optimisation, search engine optimisation, search engine results pages, search engines, sensitive data, SEO, SERPs, sitemap, sitemap.xml, website, website crawled