A fundamental to ensuring your web site appears in the search engine listings is making sure that search engine spiders can successfully crawl your web site. After all, if the spider can't reach your pages, then the search engine can't include them in their search listings.
Unfortunately many web sites use technologies or architectures that make them hostile to search engine spiders. A search engine spider is really just an automated web browser that must interpret your page's underlying HTML code, just like a regular browser does.
But search engine spiders are surprisingly unsophisticated web browsers. The most advanced spiders are arguably the equivalent of a version 2.0 web browser. That means the spider can't understand many web technologies and can't read parts of your web page. That's especially damaging if those parts include some or all of your page's links. If a spider can't read your links, then it won't crawl your site.
As a search engine marketing consultant, I'm often asked to evaluate new web sites soon after their launch. Search engine optimization is often neglected during the design process, when designers are focused on navigation, usability, and branding. As a result, many sites launch with built-in problems, and it's much harder to correct these issues once the site is complete.
Yet often its only when their site fails to appear in the search engine listings that many companies call in an SEO.
That's a shame, because for small businesses, search engines are by far the most important source of traffic. Fully 85% of Internet users find sites through search engines. A web site that isn't friendly to search engines, its value is greatly reduced.
In this article, I'll give an overview of the issues that can keep a search engine spider from indexing your site. This list is by no means exhaustive, but it will highlight the most common issues that will keep spiders from crawling your pages.
JavaScript Links
JavaScript is a wonderful technology, but it's invisible to all of the search engines. If you use JavaScript to control your site's navigation, spiders may have serious problems crawling your site.
Links contained in your JavaScript code are likely to be ignored by search engine spiders. That's especially true if your script builds links by combining several script variables into a fully formed URL.
For example, suppose you have the following script that sends the browser to a specific page in your site:
<script language="JavaScript"> function goToPage(page) { window.location = "http://www.mysite.com" + page + "?tracking=" + trackingCode; } </script>
This script uses a function called goToPage() to add a tracking code onto the end of the URL before sending visitors to the page.
I've seen sites where every link on every page ran through a JavaScript such as this. In some cases the JavaScript is used to include a tracking code, in other cases it's used to send users to different domains based on the page. But in all of these cases the site's home page is the only one listed in the search engines.
None of the spiders include a JavaScript parsing engine that would allow them to interpret this type of link. Even if the spider could interpret this script, it is difficult for them to simulate the different mouse clicks that would trigger execution of the goToPage() function with different values of the page variable, and it has no idea what value should be used for trackingCode.
Spiders will either ignore the contents of your SCRIPT tag, or else will read the script content as if it was visible text.
As a rule of thumb, it's best to avoid JavaScript navigation.
DHTML menus
DHTML drop-down menus are extremely popular for site navigation. Unfortunately, they're also hostile to search engine spiders, since the spiders again have problems finding links in the JavaScript code used to create these menus.
DHTML menus have the added problem that their code is often placed in external JavaScript files. While there are good reasons to put your script code into external files, some spiders won't fetch these pure JavaScript files.
If you use DHTML menus on your site and want to see what effect they have on search engines, try turning JavaScript off in your browser. The drop-down part of your menus will disappear, and there's a chance the top-level menus will disappear too. Yikes! Suddenly most of the pages in your site are unreachable. And that's the way they are to the search engines.
Query Strings
If you have a database-driven site that uses server-side technologies such as ASP, PHP, Cold Fusion, or JSP, there's a good chance your URLs include a query string. For example, you might have a URL like this one:
http://www.mysite.com/catalog.asp? item=320&category=23
That's a problem, because many search engine spiders won't follow links that include a query string. This is true even if the page that the link points to contains nothing but standard HTML. The URL itself is a barrier to the spider.
Why? Most search engines have made a conscious design decision not to follow query string links because they require additional record keeping by the spider. Spiders a keep list of all the pages they've crawled, and try to avoid indexing the same page in a single crawl. They do this by comparing all new URLs to the list of URLs they've already seen.
Now, suppose the spider sees a URL like this one in your site:
http://www.mysite.com/catalog.asp? category=23&item=320
This URL leads to the same page as our first query string URL, even though the URLs are not identical (Notice that the name/value pairs in the query string are in a different order).
To recognize that this URL leads to the same page, the spider would have to decompose the query string and store each name/value pair. Then, each time it sees a URL with the same root page, will would need to compare the name/value pairs of its query string to all the previous ones it has on file.
Keep in mind that our example query string is fairly short. I've seen query strings that are 200 characters long, and reference a dozen different name/value pairs.
So indexing query string pages can mean a great deal of extra work for the spider.
Some spiders, such as Googlebot, will handle URLs with a limited number of name/value pairs in the query string. Other spiders will ignore all URLs containing query strings.
Flash
Flash is cool, in fact it's much cooler than HTML. It's dynamic and cutting edge. Unfortunately, search engine spiders use trailing-edge technology. Remember: a search engine spider is roughly equivalent to a version 2.0 web browser. Spiders simply can't interpret newer technologies, such as Flash.
So even though that Flash animation may amaze your visitors, it's invisible to the search engines. If you're using Flash to add a bit of spice to your site, but most of your pages are written in standard HTML, this shouldn't be a problem.
But if you've created your entire site using Flash, you've got a serious problem getting your site into the engines.
Frames
Did I mention that search engine spiders are low-tech? That's right, they're so low-tech they don't understand frames either. If you use frames, a search engine will be able to crawl your home page, which contains the FRAMESET tags, but it won't be able to find the individual FRAME tags that make up the rest of your site.
In this case, at least, you can work around the problem by including a NOFRAMES section on your home page. This section of your page will be invisible to anyone using a frames-capable browser, but allows you to place content that is visible to the search engines and other frame-blind browsers.
If you do include a NOFRAMES section, be sure to put real content in there. At a minimum you should include standard hypertext links (A HREF) pointing to your individual frame pages.
It's surprising how often people include a NOFRAMES section that simply says "This site requires frames. Please upgrade your browser." If you'd like to try an experiment, run a query on Google for the phrase "requires frames." You'll see somewhere in the neighborhood of 160,000 pages returned, all of which include the text "this site requires frames." Each of these sites has limited search engine visibility.
With www or without www?
The address of this web site is www.keyrelevance.com, but can people reach it if they leave off the initial "www?" For most server configurations, the answer is "yes," but for some the answer is "no." Make sure your site resolves gracefully with and without the www.
This list presents some of the most common reasons why a search engine may not be indexing your site. Other factors, such as the way you structure the hierarchy of your web pages, will also affect how much of your site a spider will crawl.
Each of these problems has a solution, and in future articles I'll expand on each one to help you get more of your pages indexed.
If you're currently redesigning your web site, I'd encourage you to consider these issues before the site goes live. While each of these search engine barriers can be removed, it's better to start with a search engine friendly design than to fix hundreds of pages after launch.