Nutch: The Free Search Alternative to Google

An open deployable search algorithm is set to allow many webmasters to launch their own search engines and to bring more transparency into the maturing business

Der folgende Beitrag ist vor 2021 erschienen. Unsere Redaktion hat seither ein neues Leitbild und redaktionelle Standards. Weitere Informationen finden Sie hier.

Google has come under more and more criticisms about being bombed by self-announced Search Engine Optimizers and other spammers. Doug Cutting, an expert who has been working in the field of information retrieval for over fifteen years, wants do better. He is searching together with hundreds of developers for a more transparent way of searching with an open source web search engine in the Nutch project.

A's goal is to free access to now mostly proprietary kept search technologies, to facilitate research, and to improve web searching. As the maintainer of the project, he works from his home office in Silicon Valley and is partly funded by one of Google's main competitors, Yahoo. His prior experiences result from jobs as Xerox PARC, Apple, and Excite, for example. Stefan Krempl asked him to give some details about the upcoming Nutch "revolution" and to show off the early steps of Free Search.

Could you please explain shortly, how Nutch is supposed to work, basically?

Doug Cutting: Nutch is software that one can download in order to deploy a web search engine. After downloading, you need to first specify a few things, like where to start crawling, what domains to crawl or not crawl, etc. Then you run Nutch's crawler for a while. How long depends on what sort of a search site you're trying to build. An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. Once you've crawled then you use Nutch to index the pages you've fetched and launch your search site. The goal is for Nutch to be both easy to use for intranets and niches, while at the same time scaling to complex whole-web deployments.

What does Open Source mean in the context of a search engine? Can anybody join the Nutch development team? Or is it more about producing a transparent search algorithm?

Doug Cutting: Open source permits more folks to launch web search engines, encouraging competition and different viewpoints. It also provides transparency: folks can see how the engine decides to rank pages, reviewing the algorithm for inappropriate bias. As with most open-source projects, Nutch permits anyone to make contributions. These are reviewed by other developers, and, if they are found to have merit, are incorporated into the code. After making a number of high-quality contributions, a developer earns a reputation and can be invited to become a committer who can change the code directly.

If I have full access to the page-ranking technologies, will it be easier for me to work on a higher output in Nutch? How can Search Engine Optimizers (SEOs) make use of the Nutch principle?

Doug Cutting: I think this is less of an issue that people fear. SEOs already understand ranking algorithms and manipulate sites to make them rank higher. More knowledge of the ranking algorithm won't make this much easier: it's already easy. Sites that are found to be overly-optimized can be penalized for some number of months, just as with commercial engines. What's needed are trusted judgments of result quality, in order to train the ranking algorithm. Overly-optimized pages are one of many problems for a ranking algorithm. One should not focus excessively on this one problem. A well-trained algorithm will show little spam.

Google has tried to keep its search formulas closely secret. Even though, Google-Bombing and spamming has become a major sport within the internet community and especially in the blogosphere. How can you prevent unfair manipulation of Nutch's search results?

Doug Cutting: Google's PageRank algorithm was published, so it is not a secret. Google does keep its full ranking formula secret, and perhaps that does help it stop some spam, but, as anyone can see, it does not stop all spam. So how many effective spam-stopping secrets does Google really have? We don't know, but my guess is that most of the methods that Google uses have already been reverse engineered by spammers. Thus, effectively, Google has few ranking secrets. The best anti-spam measures are those that are difficult to defeat even when you know how they work. Links to a site from well-known sites are difficult to spam. Link farms are not that difficult to spot.

Can Nutch's search algorithm lead to better results than Google's, Microsoft's or Yahoo's? How might it improve web searching in general?

Doug Cutting: Long term there's no reason that Nutch's algorithm cannot be competitive with Yahoo, Microsoft and Google's algorithms. Once this is the case, it will be more cost-effective for new search engines to use Nutch rather than to develop a proprietary implementation. For example, previously, many search engines used to maintain their own proprietary directories of web sites. Now, with the advent of the Open Directory (http://dmoz.org/), many sites use it instead of employing a staff of directory editors. Nutch can create a similar economy for web search engines. Also, by providing a platform for research, Nutch permits more scientists to make advances in search technology.

The common search engines leave many places in cyberspace in the dark. Will Nutch be able to index more websites than its competitors?

Doug Cutting: Yes, Nutch enables folks to easily deploy niche search engines, searching these dark corners of the web. With tens, hundreds or even thousands of Nutch-powered, specialized web search engines, Nutch should indeed be able to index more than any single proprietary site.

How many sites do you have indexed so far and when will the public demo finally start?

Doug Cutting: We no longer think a big public demo is key to Nutch's success. Rather, the key is to build a lasting developer community. This is done by building something that lots of folks want to use. That is why we are now focusing on niche and intranet search engines. We still want to power whole-web search too, but there are far fewer developers who have the resources to work on that, and that goal alone cannot currently sustain the project. That said, we do intend to operate a large crawl, so that researchers do not have to all replicate this effort. Right now we have machines and hosting. We just need someone to run the operation.

In the end, will Nutch be more like a business-to-business provider for search engine technology or will it mostly work as a search interface itself? Do you want to turn Nutch into a viable business one day?

Doug Cutting: Nutch is not a business. Nutch is a provider of software and a coordinator of software development. Nutch is like the Apache foundation: we have no employees, and have a legal entity (a non-profit corporation) primarily to own the copyright, so that the project is independent from its individual developers.

You have also started the Lucene search project. How is it related to Nutch?

Doug Cutting: Nutch uses Lucene internally to power the search.

How did you get all these web and computer pioneers like Mitch Kapor, Brewster Kahle or Tim O'Reilly -- who are sitting with you on the Nutch board -- interested in Free Search?

Doug Cutting: I sent them email and they liked the idea. I think my background with Lucene gave me credibility.

Google seems to enjoy a kind of monopoly in the search market at the moment. Is it simply because of their clever marketing?

Doug Cutting: Google would tell you that they've done no marketing: it's all been word-of-mouth. But that's not quite true. They frequently take the moral high ground in public statements, disparaging things like pay-for-inclusion. That's marketing. For many years they provided an advertisement-free web search engine that worked better than anything else available at the time. Now they have ads, their result quality is not markedly superior, but folks still think of them as higher quality and less commercial. That's good marketing too. So yes, it is in part clever marketing, but also in large part because they've delivered a great product. Throw in measures of luck, timing, and respect for consumers, and you've almost got a recipe!

Apart from all the self-promotion as "the nice guys" in the business, do we have to fear Google's de facto monopoly power? Has it stalled innovation in the field of search technologies?

Doug Cutting: Innovation has slowed since Google launched, but I don't think that's because innovators are afraid of Google. Rather the contrary: Google's success has made innovators try even harder to find a Google-killer. But the fact that no Google-killer has yet appeared leads me to the conclusion that the reason innovation has slowed is because the technology has matured. The big innovations have been made. How fundamentally have cars changed since the Model-T?

The German Bertelsmann Foundation has started a dialogue about self-regulation of the search industry. They promote a code of conduct that would bind search providers to block or edit material on Neonazi sites, for example. What do you think about this approach in general and how could open source searching play along with it?

Doug Cutting: A large commercial search engine is unlikely to deploy a search engine that violates such laws, but Nutch permits folks to easily launch uncensored search engines.

I guess, many people are googling for Nutch at the moment. When will they be nutching for Google?

Doug Cutting: Nutch does not wish to compete with Google. Rather we see what we're doing as complementary. We're facilitating research and we're permitting folks to get alternative search results. So long as Google keeps delivering high-quality, unbiased search results, Nutch should not affect Google's business much. Has Linux shifted the balance of power in PC desktop operating systems? Not yet. But Linux provides a free alternative to Windows that has enabled lots of other applications, like set-top boxes, web servers, routers, PDAs, etc. Just as the success of Linux has not required the defeat of Windows, neither does the success of Nutch in any way require the defeat of Google.