#!/usr/bin/env ruby
$usage = <<-EOF
`project`: remove all but specified columns from an input file.

Stdin and stdout are whitespace-separated columns.  You specify which
columns you want to keep, either numerically, counting from 0, or by
name in which case the first line needs to be whitespace-separated
column names, or you’re out of luck.

So, for example:

    kragen@VOSTRO9:~/devel/misc$ df -B1
    Filesystem           1B-blocks      Used Available Use% Mounted on
    /dev/sda1            7181049856 6531735552 284528640  96% /
    tmpfs                520904704         0 520904704   0% /lib/init/rw
    varrun               520904704    135168 520769536   1% /var/run
    varlock              520904704         0 520904704   0% /var/lock
    udev                 520904704    147456 520757248   1% /dev
    tmpfs                520904704     90112 520814592   1% /dev/shm
    lrm                  520904704   2244608 518660096   1% /lib/modules/2.6.28-19-generic/volatile
    kragen@VOSTRO9:~/devel/misc$ df -B1 | project Mounted Available
    Mounted  Available
    /        284516352
    /lib/init/rw  520904704
    /var/run      520769536
    /var/lock     520904704
    /dev          520757248
    /dev/shm      520372224
    /lib/modules/2.6.28-19-generic/volatile  518660096

Or, alternatively:

    kragen@VOSTRO9:~/devel/misc$ df -B1 | project 5 3
    (same output)

Here’s another example:

    kragen@VOSTRO9:~/devel/misc$ ls -l | tail
    -rw-r--r--  1 kragen kragen   71469 2012-02-01 21:19 when-johnny-comes-marching-home-8bit.wav
    -rw-r--r--  1 kragen kragen    4747 2012-03-07 17:40 wikimedia-commons-nudes
    -rw-r--r--  1 kragen kragen     381 2012-03-07 16:22 wikimedia-commons-nudes.~1~
    -rwxr-xr-x  1 kragen kragen    8471 2009-11-05 20:04 wordhashes.py
    lrwxrwxrwx  1 kragen kragen      11 2009-11-05 18:11 wordlist -> ../wordlist
    -rwxr-xr-x  1 kragen kragen    1523 2012-03-23 00:35 xpose
    -rwxr-xr-x  1 kragen kragen      50 2012-03-23 00:11 xpose.~1~
    -rwxr-xr-x  1 kragen kragen    1694 2012-01-19 05:48 youtubeogg.py
    -rwxr-xr-x  1 kragen kragen      90 2012-01-16 12:51 youtubeogg.py.~1~
    -rwxr-xr-x  1 kragen kragen    1429 2012-01-16 13:19 youtubeogg.py.~2~
    kragen@VOSTRO9:~/devel/misc$ ls -l | tail | project 7 4
    when-johnny-comes-marching-home-8bit.wav  71469
    wikimedia-commons-nudes                   4747 
    wikimedia-commons-nudes.~1~               381  
    wordhashes.py                             8471 
    wordlist                                  11   
    xpose                                     1523 
    xpose.~1~                                 50   
    youtubeogg.py                             1694 
    youtubeogg.py.~1~                         90   
    youtubeogg.py.~2~                         1429 

And another:

    kragen@VOSTRO9:~/devel/misc$ (cd; du -sb * | sort -n | tail)
    10876202	trainstation_data
    19134912	Pictures
    23199071	Desktop
    25935163	public_html
    45263310	pkgs
    159810766	Downloads
    230311267	Videos
    242231066	music
    281248344	Documents
    408474679	devel
    kragen@VOSTRO9:~/devel/misc$ (cd; du -sb * | sort -n | tail) | ./project 1 0
    trainstation_data  10876202
    Pictures           19134912
    Desktop            23199071
    public_html        25935163
    pkgs               45263310
    Downloads          159810766
    Videos             230311267
    music              242231066
    Documents          281248344
    devel              408474679

Column names are matched by substring, and can match more than one
column:

    kragen@VOSTRO9:~/devel/misc$ route -n | tail -n +2 | ./project G
    Gateway  Genmask
    0.0.0.0  255.255.255.0
    0.0.0.0  255.255.0.0  
    192.168.1.1  0.0.0.0      

In order to function in a fully streaming fashion, `project` does not
do any lookahead, so it may need to widen its columns as it goes.  The
`df` example shows how this looks.  A quick kludge to solve this
problem is to pass the output through `xpose` twice:

    kragen@VOSTRO9:~/devel/misc$ df -B1 | project Mounted Available | ./xpose | ./xpose
    Mounted                                  Available
    /                                        284463104
    /lib/init/rw                             520904704
    /var/run                                 520769536
    /var/lock                                520904704
    /dev                                     520757248
    /dev/shm                                 520814592
    /lib/modules/2.6.28-19-generic/volatile  518660096

BUGS

When a header name isn’t found, `project` doesn’t report which one, or
print the header line it wasn’t found in.

Currently `project` only allows each input column to appear at most
once in the output, silently dropping duplicates.

Matching headers by substring could be a problem if one header is a
substring of another.

Header names containing whitespace will fuck everything up.

It should produce a usage message instead of a bunch of blank lines
when invoked with no arguments.

XXX why doesn’t this work?

$ ./every 1 './xpose < /proc/meminfo | head -n 2' | ./project Active
Active:  Active(anon):  Active(file):
Active:  Active(anon):  Active(file):
Active:  Active(anon):  Active(file):
Active:  Active(anon):  Active(file):
Active:  Active(anon):  Active(file):
Active:  Active(anon):  Active(file):

EOF

class NumericQuery
  def initialize(n)
    @n = n
  end

  def header_line(headers)
  end

  def column_numbers
    [@n]
  end
end

class HeaderNotFound < StandardError
end

class PatternQuery
  def initialize(pattern)
    @pattern = Regexp.new(Regexp.escape(pattern))
  end

  def header_line(headers)
    @columns = []
    headers.each_with_index do |header, i|
      @columns << i if @pattern.match(header)
    end
    raise HeaderNotFound.new(@pattern) if @columns.empty?
  end

  def column_numbers
    @columns
  end
end

def merge_columnspecs(columnspecs)
  rv = []
  columnspecs.each do |columns|
    columns.each do |column|
      rv << column unless rv.member? column
    end
  end
  rv
end

def query_from_arg(arg)
  is_integer = false
  begin
    n = Integer(arg)
    # Oh come ON.  There has GOT to be a better way to do this.
    is_integer = true
  rescue ArgumentError
  end

  is_integer ? NumericQuery.new(n) : PatternQuery.new(arg)
end

class Projection
  def initialize(column_numbers)
    @column_numbers = column_numbers
    @widths = [0] * column_numbers.length
  end

  def pad_columns(values)
    values.each_with_index do |value, i|
      value = '' if value.nil?
      @widths[i] = value.length if value.length > @widths[i]
      yield value.ljust(@widths[i]), i
    end
  end

  def emit_line(values)
    pad_columns(values) do |value, i|
      print "  " if i > 0
      print value
    end
    print "\n"
  end

  def project(input)
    @column_numbers.map { |n| input[n] }
  end
end

queries = ARGV.map { |arg| query_from_arg arg }
header_line = STDIN.gets.split
queries.each { |q| q.header_line(header_line) }
projection = Projection.new(merge_columnspecs(queries.map(&:column_numbers)))
projection.emit_line(projection.project(header_line))

STDOUT.sync = true

STDIN.each_line do |line|
  projection.emit_line(projection.project(line.split))
end

