From kragen@dnaco.net Wed Aug 26 12:04:05 1998
Date: Wed, 26 Aug 1998 12:04:03 -0400 (EDT)
From: Kragen <kragen@dnaco.net>
To: "Bradley M. Kuhn" <bkuhn@ebb.org>
cc: clug-user@clug.org
Subject: Re: another way to do it (was Re: Web Page Help)
In-Reply-To: <19980826111830.46777@ebb.org>
Message-ID: <Pine.SUN.3.96.980826114827.11646U-100000@picard.dnaco.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Keywords:
X-UID: 1454
Status: O
X-Status: 

On Wed, 26 Aug 1998, Bradley M. Kuhn wrote:
> Thus spoke Kragen:
> > while (<>) {
> 
> ARGH!  Why not use:
> 
> while ( defined(my $line = <>))
> 
> It handles the zero byte problem gracefully.  :)

Why not?  Because I'd never heard of the zero byte problem.

Are you saying that the string "\0" is false as a boolean?  Maybe we
should fix this problem here.

> 
> >         if (/^From.*\s+([A-Za-z][A-Za-z][A-Za-z])\s+\d+\s+[0-9:]+\s+(?:\S+\s+)?((19|20)\d\d)\s*$/)
> 
> That From.*\s+ is a bit dangerous, don't you think?  :)
> 
> It will slow you down severely on non-match....

I thought about its slowness on match (since it'll start trying to
match near the end of the line -- maybe .*?\s+ would be better), but I
didn't think about the fact that it was exponential-time in the amount
of whitespace on a false From line, at least, with an NFA
implementation.

Doesn't Perl use a DFA, though?

I'm not really familiar with optimizing regexes (even Perl ones).

> How about this little rewrite:
> 
> ###############################################################################
> use strict;
> use Date::Manip;
> 
> my $lastOpened = "";

Optimization alert!

> while ( defined(my $line = <>)) {
>     if (my($date) = $line =~ /^From\s+\S+\s+(.+)$/) {
>         my $filename = &UnixDate($date, "%Y-%m");

I'm not familiar with UnixDate.  What does it do?  And why are you
explicitly &ing the routine?

Are you sure that \s+\S+\s+ will skip over everything before the date?
I wasn't, because I wasn't familiar with the UUCP-style "From" line's
standard.

>         if ($filename eq "") {
>             warn "Line $.: found a From line without a date!: $line";

Is good.

>             print OUTPUT $line unless ($lastOpened eq "");
>             next;
>         }  
>         unless ($lastOpened eq $filename) {
>              close(OUTPUT) unless ($lastOpened eq "");

You can close an unopened filehandle safely, can't you?

And forgetting to close a filehandle is safe, isn't it?

>              open(OUTPUT, ">>$filename") || die "Cannot open  $filename: $!\n";
>         }
>         $lastOpened = $filename;
>     }
>     $lastOpened ? print OUTPUT $line :
>            warn "Line $.: precedes any valid From lines: $line";
> }
> ###############################################################################
> 
> The Date::Manip takes care of most of your worries with the date regex.
> Date::Manip can be found on CPAN.
> 
> There are also a variety of Mail:: handling packages as well, but using the
> regex is probably just as good and much faster.
> 
> My version has the following advantages over Kragen's (however, Kragen's was
> a fine start :):
>     - Does not reopen the same file as many times....keeps track to see
>       if the last file was the same

This ought to be a big win.

>     - has greater date handling functionality, using Date::Manip.

Is this a good thing?

>     - has more efficient regex for checking From line.

Definitely.

> BTW, I ran this on a 3.9MB mail file that split into 29 different months.
> It took 1.5 minutes on my Pentium 90.

That's interesting.  It took mine four seconds to do a 1.1MB file on a
5x86-133 -- which is approximately equal to a P75 -- but running yours
on a similar machine, with roughly four times as much mail, took 20 times
as long?  Are you sure it wasn't a 39MB mail file?

Maybe UnixDate() is drowning the win from the better regex and the
saved system calls.

Kragen

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
We are forming cells within a global brain and we are excited that we might
start to think collectively.  What becomes of us still hangs crucially on
how we think individually.  -- Tim Berners-Lee, inventor of the Web


