Learn IT Know IT Use IT: Turning Over Search Engines

We've all been there. You've been given an assignment you need help with, or maybe you're researching a new product or looking up a competitor's spec sheet. Regardless of the reason, at some point you will opt for the dreaded Internet search engine only to find ... more frustration. Internet search engines all work differently and provide varying results. Learning how they work can help you choose the right path and be more productive in your effort.

How Do Search Engines Work?

The term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways.

Crawler-based search engines. Crawler-based search engines, such as HotBot, create their listings automatically. They "crawl" or "spider" the Web, then listings are based on what they have found. If a webpage is changed, crawler-based search engines eventually find these changes, and that can affect how a page is listed. Page titles, body copy, and other elements all play a role.

Human-powered directories. A human-powered directory, such as Yahoo, depends on humans for its listings. A short description is submitted to the directory for an entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted. Changing a webpage has no effect on the listing. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site.

"Hybrid search engines" or mixed results. In the Web's early days, a search engine either presented crawler-based results or human-powered listings. Today, a hybrid approach is common. Usually, a hybrid search engine will favor one type of listing over another.

The Parts of a Crawler-Based Search Engine

Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a webpage, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

Everything the spider finds goes into the second part of the search engine: the index. The index, sometimes called the catalog, is like a giant book containing a copy of every webpage the spider finds. If a webpage changes, then this book is updated with the new information.

Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a webpage may have been "spidered" but not yet "indexed." Until it is indexed - added to the index - it is not available to those searching with the search engine. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.

Around the Web: The Best Search Engines

Alta Vista. AltaVista is one of the largest search engines on the Web, in terms of pages indexed. Its comprehensive coverage and wide range of searching commands makes it a particular favorite among researchers. In addition to crawler-based webpage matches, it also offers news search, shopping search, multimedia search, and human-powered directory results.Excite. Excite offers a medium-sized crawler-based webpage index, as well as access to human-powered directory results.

Google. This is my favorite. Google is a search engine that makes heavy use of link popularity as a primary way to rank websites. This can be especially helpful in finding good sites in response to general searches such as "cars" and "travel," because users across the Web have in essence voted for good sites by linking to them. The system works so well that Google has gained widespread praise for its high relevancy. Google also has a huge index of the Web and provides some results to Yahoo and Netscape Search.

Northern Light. Northern Light is another favorite search engine among researchers. It features a large index of the Web, along with the ability to cluster documents by topic. Northern Light also has a set of "special collection" documents that are not readily accessible to search engine spiders. There are documents from thousands of sources, including newswires, magazines, and databases. Searching these documents is free, but there is a charge of up to $4 to view them.

Yahoo. An ancient seven years old, Yahoo is the Web's most popular search service and has a well-deserved reputation for helping people find information easily. The secret to Yahoo's success is human beings. It is the largest human-compiled guide to the Web, with well over 1 million sites listed. Yahoo also supplements its results with those from Google (as of July 2001) if an initial search fails. ES