From kragen@dnaco.net Wed Aug 26 13:31:50 1998 Date: Wed, 26 Aug 1998 13:31:48 -0400 (EDT) From: Kragen To: "Bradley M. Kuhn" cc: clug-user@clug.org Subject: Re: another way to do it (was Re: Web Page Help) In-Reply-To: <19980826125747.27521@ebb.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Keywords: X-UID: 1462 Status: O X-Status: On Wed, 26 Aug 1998, Bradley M. Kuhn wrote: > Thus spoke Kragen: > > On Wed, 26 Aug 1998, Bradley M. Kuhn wrote: > > > It handles the zero byte problem gracefully. :) > > If the file has a zero byte, and/or an ASCII 0, or ASCII 000 on the last > last line of the file with no newline, <> and other diamond operators > return NULL or "0" or "000", etc. All of which are false...thus ending > the loop one line early. It looks like it doesn't actually end the loop one line early. See my other reply. > .*?\s+ is no better, optimization wise, than .*\s+ I had this > misconception too until I took Friedl's RE course at the Perl conference. Is that because Perl is doing a breadth-first search? > You can see my notes from the course if you are interested. Indeed I am. Who's Friedl? > > Doesn't Perl use a DFA, though? > > No, Perl is NFA as they come. In fact, in Perl 5.005, you can now match > context free languages with Perl's regular expressions. Wowzers. > > Are you sure that \s+\S+\s+ will skip over everything before the date? > > I think so, but not sure. I was just thinking about spaces in the envelope-sender -- I've never seen 'em, but they're legal if properly quoted. > > I wasn't, because I wasn't familiar with the UUCP-style "From" line's > > standard. > > I think it is correct. You should have From
. The sendmail > >From line was made to be simple. Look! It ate your ^From! :) > > > close(OUTPUT) unless ($lastOpened eq ""); > > > > You can close an unopened filehandle safely, can't you? > > IIRC, yes, but the quick test doesn't hurt......(maybe minor optimization > issue....) Oh, I disagree 1000%! Doing work you don't have to do is a minor optimization issue, it's true (most of the time) but it's a *major* impediment to maintenance. Writing code that doesn't make sense makes it impossible for someone later to understand your code, which is a necessary prerequisite to maintenance. > > > And forgetting to close a filehandle is safe, isn't it? > > Yes, but I am not sure on the semantics of opening the same file handle > again. > > It is probably ok, but I wouldn't want to depend on it. See the close FILEHANDLE entry in perlfunc. > > > BTW, I ran this on a 3.9MB mail file that split into 29 different months. > > > It took 1.5 minutes on my Pentium 90. > > > > That's interesting. It took mine four seconds to do a 1.1MB file on a > > 5x86-133 -- which is approximately equal to a P75 -- but running yours > > on a similar machine, with roughly four times as much mail, took 20 times > > as long? Are you sure it wasn't a 39MB mail file? > > > Maybe UnixDate() is drowning the win from the better regex and the > > saved system calls. > > Yes, Date::Manip routines can be slow. :( > > That's probably what slows it down....It was a 3.9MB file. A factor of 20, though?! Maybe we should change to using a variant of your regex (to avoid the exponential explosion of someone handing us "From ") but using a simple regex instead of UnixDate(). btw, the following script completed immediately on a sun4c: print "starting\n"; "X " =~ /X.*\s+X/; print "done\n"; Why? I'd think it would end up trying to do on the order of 2^29 or 2^30 operations. Can you show me a script that does exhibit the exponential regex behavior that normally cripples NFAs under some circumstances?? Kragen -- Kragen Sitaker We are forming cells within a global brain and we are excited that we might start to think collectively. What becomes of us still hangs crucially on how we think individually. -- Tim Berners-Lee, inventor of the Web