The truth is out there – if you can only find it. Jack Schofield looks at the crawlers and indexers which trawl the Web for us
Are you using a search engine that’s ignoring 97,5% of the pages on the World Wide Web? Would you be shocked to discover that even the best search site isn’t indexing about 84% of all the pages of information available?
The figures, produced by Dr Steve Lawrence and Dr C Lee Giles for the NEC Research Institute and published recently in the scientific journal Nature, shocked many users. But if the Web currently has about 800-million pages, as the researchers claim, then AltaVista is indexing only 15,5%, HotBot 11,3%, Yahoo! 7,4%, Excite 5,6%, and Lycos a feeble 2,5%.
The search engine with the best coverage is the little-known Northern Light, , which still only covers 16% of the Web. (Alltheweb, a relatively new search engine at , was not included in the research.) However, Northern Light also had the highest number of invalid links at 9,8%, compared with HotBot’s 2,2%.
Even combining the indexes of all the major search engines – which is what “metasearch engines” like Mama and Dogpile do – covers only 42% of the Web, though this now looks like the best way to search for less popular sites. NEC, the Japanese electronics giant behind the United States- based NEC Research Institute, is developing a metasearch engine called Inquirus.
The authors say: “Larger indexes cost more to create, and slow the response time on average. There are diminishing returns to indexing all of the Web, because most queries made to the search engines can be satisfied with a relatively small database. Search engines might generate more revenues by putting resources into other areas [for example, free e-mail].”
Over the past two years, many sites that became popular by offering search facilities – Infoseek, Lycos, Yahoo – have put more effort into becoming “portals” or Web gateways by adding personalisation and other features. Ironically, sites designed as portals may now be better places to search. Snap! – started by Halsey Minor, who first conceptualised the portal idea – covers 15,5% of the Web, while the Microsoft Network portal, which was originally going to be called Start, manages 8,5%.
Search engine expert Danny Sullivan of Calafia Consulting says: “Being bigger isn’t necessarily better. Users want relevancy, not just another 100-million documents to drive them nuts when they do searches.”
Sullivan – who edits two newsletters about search engines and runs a website, SearchEngineWatch – says one of the most popular searches is for the word “Hotmail”,and a lot of people search on Hot-mail’s address, ! “There are a lot of new users out there who don’t understand how the Web works,” he says, “and even when they know the address of a site, they don’t know how to enter the information.”
Also, rather than searching for obscure information, many people are using search engines just to find sites for companies, says Sullivan.
“That’s why Google.com is becoming so popular: if you just want to look up sites, it finds them for you,” he says. Google does well on “relevancy” by ranking sites according to the number of links that point to them from other sites, though there are other approaches, such as DirectHit and Centraal’s Real Names.
Most search engine companies have a “minister of defence” who is trying to keep a lot of Web addresses out of the database, says Sullivan. Someone running pornographic sites once submitted a million addresses to Alta Vista, and an online bookstore with 700 000 books could easily have 700 000 addresses, one for each book. “The search engines are under a great deal of pressure from people who are trying to manipulate them,” he adds. “There will always be a group of people who try to take advantage of them, and push the limits as far as they can.”
Carl Howe, research director with Forrester Research in Cambridge, Massachusetts, says another problem is that search engine “spiders” – the software robots that “crawl” the Web to find pages – “consume bandwidth”. Companies with commercial websites pay for the bandwidth (the amount of data traffic) their sites consume. More search engines using more robots to search more websites more often would create better indexes, “but I don’t think people would put up with the bandwidth consumption”, he says. Some sites already ban spiders.
“I think we’re running into a fundamental problem, which is that the more human knowledge spans the Web, the less likely we are to be able to find things,” says Howe. “Perhaps it’s better just to accept that there’s going to be specialisation. Perhaps we should think about dividing the Web up between search engines, by subject matter, by target audience, by content or whatever. Why should Americans index European sites? Why not have a European search engine do it?”
That’s an approach hundreds of specialised search engine providers have already adopted. Ed Burt, managing director of SearchUK, says: “Ever since we launched, we’ve been saying that attempts to provide a global index are doomed to fail. The Web will always grow faster than a global search engine, but if you only do, in our case, United Kingdom sites, the situation suddenly becomes a damn sight easier to manage.”
SearchUK indexes sites with a .uk or similar address, and already covers more than six million pages. What about UK sites with American .com or .net addresses? “We check them by hand,” says Burt.
But he is wary of the idea of trying to index everything, even within a limited domain. “That may be a bit of a misunderstanding of the role of a search engine,” Burt says. “Our job is to point you in the right direction. It may be that we can get you to the shop, but after that you have to use the shop’s own search engine to find things within the site.”
As the Nature article says: “It is not necessary for a page to be indexed for it to be accessible: a suitable query might locate another page that links to the desired page.”
Barry Parr, California-based director of the Internet and e-commerce programme at International Data Corporation, an industry research company, says he was startled to find that the leading search engine covered only 16% of the Web “and certainly startled to see it was Northern Light: they must be ecstatic”.
He adds: “But it’s not as alarming as [the Nature article] might lead you to believe. When good information becomes available, people share it very quickly in the old- fashioned way, by `word of mouth’, and the Internet accelerates that process. If something sparks people’s imagination, it’s pretty hard to keep a lid on it. Look at the Dancing Hamster site: probably everybody in the world saw that within a week or two!”
That may work for something humorous, but for commercial suppliers, “building a website is not enough”, says Parr, and they can’t rely on search engines to provide customers.
“The people we talk to spend at least as much promoting their websites as they do building them, and spending tens of millions of dollars is pretty much penny ante to get into the game.”