From kragen@dnaco.net Fri Sep 25 00:00:15 1998 Date: Fri, 25 Sep 1998 00:00:13 -0400 (EDT) From: Kragen X-Sender: kragen@pike To: systalk@ml.org, tech@clug.org Subject: distributed database Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Keywords: X-UID: 2147 Status: O X-Status: I'm a distribution freak. I don't like the idea that *anything* has to be centralized. Here's what I'm on about at the moment: search engines are very centralized. I'd like to remedy this situation. Trouble is, I don't know how to build a distributed index. It's plain enough how to build an index where all the index for a particular range of keys lives on one machine or group of machines, and all the index for another particular range of keys lives on another machine or group of machines. DNS does it, and does it well -- it's the most reliable service on the Internet. But DNS doesn't have to do joins. The thing is, if I want to search for pages containing both "Kragen" and "criminal", it's unlikely that the parts of the index for both of these will reside on the same machines. So I want to find a way to combine the index entries for the 13038 occurrences of "Kragen" and the 1,823,325 occurrences of "criminal" and find the 88 pages in which they both occur, and discover that none of them are about me. I don't know of a reasonably-fast way of doing this. The best method I've been able to find so far is a method of walking down binary tries described in Technical Report TR 94-18 from the University of New Hampshire, which proves that the average execution time of this method is something I won't understand without reading the whole paper. It seems apparent to me, though, that the amount of network traffic required to find the intersection will likely be rather large. Kragen -- Kragen Sitaker The sages do not believe that making no mistakes is a blessing. They believe, rather, that the great virtue of man lies in his ability to correct his mistakes and continually make a new man of himself. -- Wang Yang-Ming