From kragen@dnaco.net Wed Aug 26 13:31:50 1998
Date: Wed, 26 Aug 1998 13:31:48 -0400 (EDT)
From: Kragen <kragen@dnaco.net>
To: "Bradley M. Kuhn" <bkuhn@ebb.org>
cc: clug-user@clug.org
Subject: Re: another way to do it (was Re: Web Page Help)
In-Reply-To: <19980826125747.27521@ebb.org>
Message-ID: <Pine.SUN.3.96.980826131531.11646c-100000@picard.dnaco.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Keywords:
X-UID: 1462
Status: O
X-Status: 

On Wed, 26 Aug 1998, Bradley M. Kuhn wrote:
> Thus spoke Kragen:
> > On Wed, 26 Aug 1998, Bradley M. Kuhn wrote:
> > > It handles the zero byte problem gracefully.  :)
> 
> If the file has a zero byte, and/or an ASCII 0, or ASCII 000   on the last
> last line of the  file with no newline, <> and other diamond operators
> return NULL or "0" or "000", etc.  All of which are false...thus ending
> the loop one line early.

It looks like it doesn't actually end the loop one line early.  See my
other reply.

> .*?\s+ is no better, optimization wise, than .*\s+   I had this
> misconception too until I took Friedl's RE course at the Perl conference.

Is that because Perl is doing a breadth-first search?

> You can see my notes from the course if you are interested.

Indeed I am.

Who's Friedl?

> > Doesn't Perl use a DFA, though?
> 
> No, Perl is NFA as they come.  In fact, in Perl 5.005, you can now match
> context free languages with Perl's regular expressions.

Wowzers.

> > Are you sure that \s+\S+\s+ will skip over everything before the date?
> 
> I think so, but not sure.

I was just thinking about spaces in the envelope-sender -- I've never
seen 'em, but they're legal if properly quoted.

> > I wasn't, because I wasn't familiar with the UUCP-style "From" line's
> > standard.
> 
> I think it is correct.  You should have From <ADDRESS> <DATE>.  The sendmail
> >From line was made to be simple.

Look!  It ate your ^From!  :)

> > >              close(OUTPUT) unless ($lastOpened eq "");
> > 
> > You can close an unopened filehandle safely, can't you?
> 
> IIRC, yes, but the quick test doesn't hurt......(maybe minor optimization
> issue....)

Oh, I disagree 1000%!  Doing work you don't have to do is a minor
optimization issue, it's true (most of the time) but it's a *major*
impediment to maintenance.  Writing code that doesn't make sense makes
it impossible for someone later to understand your code, which is a
necessary prerequisite to maintenance.

>  
> > And forgetting to close a filehandle is safe, isn't it?
> 
> Yes, but I am not sure on the semantics of opening the same file handle
> again.
> 
> It is probably ok, but I wouldn't want to depend on it.

See the close FILEHANDLE entry in perlfunc.

> > > BTW, I ran this on a 3.9MB mail file that split into 29 different months.
> > > It took 1.5 minutes on my Pentium 90.
> > 
> > That's interesting.  It took mine four seconds to do a 1.1MB file on a
> > 5x86-133 -- which is approximately equal to a P75 -- but running yours
> > on a similar machine, with roughly four times as much mail, took 20 times
> > as long?  Are you sure it wasn't a 39MB mail file?
> 
> > Maybe UnixDate() is drowning the win from the better regex and the
> > saved system calls.
> 
> Yes, Date::Manip routines can be slow.  :(
> 
> That's probably what slows it down....It was a 3.9MB file.

A factor of 20, though?!

Maybe we should change to using a variant of your regex (to avoid the
exponential explosion of someone handing us "From                   ")
but using a simple regex instead of UnixDate().

btw, the following script completed immediately on a sun4c:
print "starting\n";
"X                              " =~ /X.*\s+X/;
print "done\n";

Why?  I'd think it would end up trying to do on the order of 2^29 or
2^30 operations.  Can you show me a script that does exhibit the
exponential regex behavior that normally cripples NFAs under some
circumstances??

Kragen

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
We are forming cells within a global brain and we are excited that we might
start to think collectively.  What becomes of us still hangs crucially on
how we think individually.  -- Tim Berners-Lee, inventor of the Web