From kragen@dnaco.net Tue Jun 30 22:45:27 1998
Date: Tue, 30 Jun 1998 22:45:26 -0400 (EDT)
From: Kragen <kragen@dnaco.net>
To: "Bradley M. Kuhn" <bkuhn@ebb.org>
cc: clug-user@clug.org
Subject: Re: Insomnia does strange things.
In-Reply-To: <19980630180534.56616@ebb.org>
Message-ID: <Pine.SUN.3.96.980630222440.9507c-100000@picard.dnaco.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Keywords:
X-UID: 171
Status: O
X-Status: 

On Tue, 30 Jun 1998, Bradley M. Kuhn wrote:
> Thus spoke Kragen:
> > > Well, you can do the same search at the clug.org WWW site.  Why does it need
> > > to be in altavista too?
>  
> > Because, more than likely, I've never heard of the mailing list the person
> > sent their question to.  Even if I have, I'd rather do one search of all
> > mailing lists, rather than a hundred searches to cover all the ones I know
> > about.
>  
> Well, I don't think the archive of this list is such an important thing that
> it need be archived for this reason.  The answers given here are ones that
> are already available elsewhere.  

Both of these things could be said about most of the mailing lists
whose archives I've found useful information in through web searches.

> This list is primarily to help local
> people get to know each other and help each other and interact on a more
> personal level.  The archive, IMHO, is mainly for the use by members
> (although it is public).

Well, that's as may be.  That might be a valid reason to hide the
mailing list from searchbots.

>   The answers we give here are never as good as one
> would get on a "better" mailing list.

I don't think that's necessarily true.

> > Hypermail-indexed messages have a paragraph in them that looks like this:
> > 
> > Messages indexed by: [date] [thread] [subject] [author]
> 
> Hmm...our mailing list to WWW gateway (mhonarc) does not do this, although
> it has a similar line.  Is this a de-facto standard of some sort, or just
> something you have discovered about hypermail (which I assume is a
> functional equivalent to Mhonarch)?

Yes, hypermail is a functional equivalent to mhonarc; I don't think
this is a de-facto standard of some sort.  But when I see lots of
unwanted mail messages showing up in my AltaVista searches, I just
glance at one and exclude it with some such phrase.

If it's a single archive that's spitting up lots of hits, you can also
exclude it by URL.

> One of the main problems is that all the archived messages on the CGLUG list
> have the following in them:
> 
> <a href="index.html"><strong><font size=+3>Cincinnati Linux Users
> Group</font></strong></a><br>
> <em>Making Linux Available to Everyone!</em>
> 
> Because *any* search on the keywords "cincinnati, linux, users"  will return
> *every* message in the archive!

Yes, this is true.  However, search engines have gone to some
considerable trouble to make sure things like this don't make them
unusable.  Searching on AltaVista for cincinnati linux users, my top
ten hits include:

- three messages from the archive
- the directions to Castle page
- the directions to the UC Medical Science Building
- the University of Cincinnati LUG page
- the CLUG Site Statistics page
- the CLUG Resources page
- the CLUG Members page
- meeting minutes from 1997-07-05

It is true that AltaVista returns 1,081,050 matches, and (presumably)
that among those matches are all the mailing list messages.  Given that
people can easily exclude the messages (-thread-next-thread-prev) if
they want, why bother doing it for them?

> As a compromise, perhaps we could just index the main thread pages and not
> the individual article pages?

Well, you know, I'm not maintaining the web site.  In fact, I'm not
even a CLUG member, really.  Do what you like.  Just know that I have
benefited, in the past, from people not hiding their mailing-list
archives from crawlers, and other people might benefit from CLUG not
hiding CLUG-USER.


Kragen


