{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f78debde",
   "metadata": {},
   "source": [
    "Perfect Pangrams: a completely unjustifiable optimization problem\n",
    "====\n",
    "\n",
    "My mother wondered\n",
    "what the [\"perfect pangram\"](https://word.tips/spelling-bee-answers/)\n",
    "was for today's \"spelling bee\", \"jainrot\":\n",
    "\n",
    "> A perfect pangram in the NYT Spelling Bee game is a word that uses every letter of the puzzle's given letters exactly once, including the center letter that forms the basis of the puzzle.\n",
    "\n",
    "Solving it in two or three lines of Python\n",
    "----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9e3e012f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7968d610",
   "metadata": {},
   "source": [
    "The `collections` library contains a `Counter` that counts\n",
    "the number of occurrences of each distinct element of a sequence.\n",
    "A word is a sequence of letters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9bf006c8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'o': 2, 'f': 1})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collections.Counter('foo')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74a81471",
   "metadata": {},
   "source": [
    "We can compare two `Counter`s for equality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "39549042",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collections.Counter('foo') == collections.Counter('ofo')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d597b09",
   "metadata": {},
   "source": [
    "Unix has a file `/usr/share/dict/words`\n",
    "which lists about a hundred thousand English words,\n",
    "so we can simply examine each one to see if it matches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4eed4e2e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 814 ms, sys: 0 ns, total: 814 ms\n",
      "Wall time: 814 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['janitor\\n']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "[line for line in open('/usr/share/dict/words')\n",
    " if collections.Counter(line.strip()) == collections.Counter('jainrot')]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9bc4f64",
   "metadata": {},
   "source": [
    "Solving it six times faster in Python\n",
    "----\n",
    "\n",
    "That is kind of slow, although it's less than a second on one CPU core,\n",
    "so it takes less time than writing smarter code,\n",
    "at least by hand.\n",
    "Still, one core of this laptop runs about\n",
    "5 billion instructions per second,\n",
    "and 4 billion machine instructions seems like an excessive amount\n",
    "of computation for this problem.\n",
    "Isn't there a better way?\n",
    "\n",
    "We could sort the letters of a word into alphabetical order\n",
    "instead of building a table of their frequencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "76ee4662",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['a', 'i', 'j', 'n', 'o', 'r', 't'],\n",
       " ['a', 'i', 'j', 'n', 'o', 'r', 't'],\n",
       " ['a', 'i', 'j', 'n', 'o', 'o', 'r', 't']]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[sorted(word) for word in ['jainrot', 'janitor', 'toornaji']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7417336",
   "metadata": {},
   "source": [
    "And we could precompute our search key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "c9eb2de9",
   "metadata": {},
   "outputs": [],
   "source": [
    "key = sorted('jainrot')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fde5322",
   "metadata": {},
   "source": [
    "Then, maybe we can test all hundred thousand words a bit faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6e13cbb2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 130 ms, sys: 0 ns, total: 130 ms\n",
      "Wall time: 128 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['janitor\\n']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "[line for line in open('/usr/share/dict/words')\n",
    " if key == sorted(line.strip())]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b3a2888",
   "metadata": {},
   "source": [
    "That's noticeably faster, like, more than six times as fast.  We've saved\n",
    "80% of the work by building sorted lists of letters instead\n",
    "of `Counter`s.  We could see if we can win more by using just a string,\n",
    "so that comparison is faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b7297899",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 147 ms, sys: 3.32 ms, total: 150 ms\n",
      "Wall time: 145 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['janitor\\n']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "key = ''.join(sorted('jainrot'))\n",
    "[line for line in open('/usr/share/dict/words')\n",
    " if key == ''.join(sorted(line.strip()))]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71e7e2fc",
   "metadata": {},
   "source": [
    "Nope, that makes it slower, though only a little.\n",
    "\n",
    "Building a database in a text file with 5 lines of Python to solve it $2000\\times$ faster\n",
    "----\n",
    "\n",
    "If we had this problem to solve several times, we could build a database\n",
    "and look up keys in the database; even a simple text file would work.\n",
    "Building the database takes extra time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "11d9697c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 373 ms, sys: 19 ms, total: 392 ms\n",
      "Wall time: 391 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "db = [(''.join(sorted(line.strip())), line.strip())\n",
    "     for line in open('/usr/share/dict/words')]\n",
    "\n",
    "db.sort()\n",
    "\n",
    "with open('sorted.pangrams', 'w') as fo:\n",
    "    fo.writelines(f'{key} {word}\\n' for key, word in db)\n",
    "            "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8b4040b",
   "metadata": {},
   "source": [
    "The database file looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "e9a808ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "''''cefloss fo'c's'le's\n",
      "'''bnoss bo's'n's\n",
      "'''ceflos fo'c's'le\n",
      "'''cefloss fo'c's'les\n",
      "'''cefloss fo'c'sle's\n",
      "''ALmorsu L'Amour's\n",
      "''Adeorszz d'Arezzo's\n",
      "''BOeinrs O'Brien's\n",
      "''Baaahhllsu Baha'ullah's\n",
      "''Baahis Baha'i's\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head sorted.pangrams"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57a4f8ed",
   "metadata": {},
   "source": [
    "But then you can find things in it much more quickly, like\n",
    "2 to 3 milliseconds instead of 120 milliseconds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "7d18e172",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aijnort janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.002s\n",
      "user\t0m0.002s\n",
      "sys\t0m0.000s\n",
      "\n",
      "real\t0m0.002s\n",
      "user\t0m0.002s\n",
      "sys\t0m0.000s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ggilnow glowing\n",
      "bghhioy highboy\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.003s\n",
      "user\t0m0.000s\n",
      "sys\t0m0.002s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "iloqrsu liquors\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.002s\n",
      "user\t0m0.002s\n",
      "sys\t0m0.000s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "time look aijnort sorted.pangrams\n",
    "time look ggilnow sorted.pangrams\n",
    "time look bghhioy sorted.pangrams\n",
    "time look iloqrsu sorted.pangrams"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6675ca6",
   "metadata": {},
   "source": [
    "The `look` command is capable of searching the hundred-thousand-word\n",
    "dictionary in 2 ms instead of 120 ms,\n",
    "because it's using binary search,\n",
    "instead of sequentially examining every item,\n",
    "and because it's not written in slow-ass interpreted Python.\n",
    "Also, it's doing less work on each item,\n",
    "since it doesn't have to sort the letters.\n",
    "\n",
    "With ltrace we can see what words it's looking at:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "de466e18",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "setlocale(LC_ALL, \"\")                            = \"en_US.UTF-8\"\n",
      "bindtextdomain(\"util-linux\", \"/usr/share/locale\") = \"/usr/share/locale\"\n",
      "textdomain(\"util-linux\")                         = \"util-linux\"\n",
      "__cxa_atexit(0x5585d2799b40, 0, 0x5585d279c008, 0) = 0\n",
      "getenv(\"WORDLIST\")                               = nil\n",
      "getopt_long(3, 0x7ffee2fdfc88, \"adft:Vh\", 0x5585d279bbc0, nil) = -1\n",
      "open(\"sorted.pangrams\", 0, 00)                   = 3\n",
      "fstat(3, 0x7ffee2fdfaa0)                         = 0\n",
      "mmap(nil, 1970168, 0b1, 0x1, 3, 0)               = 0x7f689a10a000\n",
      "strlen(\"aijnort\")                                = 7\n",
      "malloc(8)                                        = 0x558600586410\n",
      "strncmp(\"abgikkm\", \"aijnort\", 7)                 = -7\n",
      "strncmp(\"bhort t\", \"aijnort\", 7)                 = 1\n",
      "strncmp(\"aeefrtu\", \"aijnort\", 7)                 = -4\n",
      "strncmp(\"aggiilm\", \"aijnort\", 7)                 = -2\n",
      "strncmp(\"bccehor\", \"aijnort\", 7)                 = 1\n",
      "strncmp(\"aiimnor\", \"aijnort\", 7)                 = -1\n",
      "strncmp(\"alprssw\", \"aijnort\", 7)                 = 3\n",
      "strncmp(\"ainnoor\", \"aijnort\", 7)                 = 4\n",
      "strncmp(\"ailmpsy\", \"aijnort\", 7)                 = 2\n",
      "strncmp(\"aiknoor\", \"aijnort\", 7)                 = 1\n",
      "strncmp(\"aiinprs\", \"aijnort\", 7)                 = -1\n",
      "strncmp(\"aijlmno\", \"aijnort\", 7)                 = -2\n",
      "strncmp(\"aijnort\", \"aijnort\", 7)                 = 0\n",
      "strncmp(\"aijls j\", \"aijnort\", 7)                 = -2\n",
      "strncmp(\"aijnopp\", \"aijnort\", 7)                 = -2\n",
      "strncmp(\"aijnopp\", \"aijnort\", 7)                 = -2\n",
      "strncmp(\"aijnors\", \"aijnort\", 7)                 = -1\n",
      "strncmp(\"aijnort\", \"aijnort\", 7)                 = 0\n",
      "strncmp(\"aijnort\", \"aijnort\", 7)                 = 0\n",
      "strncmp(\"aijnort\", \"aijnort\", 7)                 = 0\n",
      "putc('a', 0x7f689a4d45c0)                        = 97\n",
      "putc('i', 0x7f689a4d45c0)                        = 105\n",
      "putc('j', 0x7f689a4d45c0)                        = 106\n",
      "putc('n', 0x7f689a4d45c0)                        = 110\n",
      "putc('o', 0x7f689a4d45c0)                        = 111\n",
      "putc('r', 0x7f689a4d45c0)                        = 114\n",
      "putc('t', 0x7f689a4d45c0)                        = 116\n",
      "putc(' ', 0x7f689a4d45c0)                        = 32\n",
      "putc('j', 0x7f689a4d45c0)                        = 106\n",
      "putc('a', 0x7f689a4d45c0)                        = 97\n",
      "putc('n', 0x7f689a4d45c0)                        = 110\n",
      "putc('i', 0x7f689a4d45c0)                        = 105\n",
      "putc('t', 0x7f689a4d45c0)                        = 116\n",
      "putc('o', 0x7f689a4d45c0)                        = 111\n",
      "putc('r', 0x7f689a4d45c0)                        = 114\n",
      "putc('\\n', 0x7f689a4d45c0)                       = 10\n",
      "strncmp(\"aiklmmn\", \"aijnort\", 7)                 = 1\n",
      "free(0x558600586410)                             = <void>\n",
      "__errno_location()                               = 0\n",
      "ferror(0x7f689a4d45c0)                           = 0\n",
      "fflush(0x7f689a4d45c0)                           = 0\n",
      "fileno(0x7f689a4d45c0)                           = 1\n",
      "dup(1, 0, -1, 0)                                 = 4\n",
      "close(4)                                         = 0\n",
      "ferror(0x7f689a4d44e0)                           = 0\n",
      "fflush(0x7f689a4d44e0)                           = 0\n",
      "fileno(0x7f689a4d44e0)                           = 2\n",
      "dup(2, 0, 0, 0x7f689a4d45c0)                     = 4\n",
      "close(4)                                         = 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aijnort janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "+++ exited (status 0) +++\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "ltrace look aijnort sorted.pangrams"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39282289",
   "metadata": {},
   "source": [
    "We can see that it called `strncmp` 21 times\n",
    "with different keys from the database,\n",
    "so the binary search algorithm is cutting the work it does by about\n",
    "a factor of 5000, or maybe 1000 if we take into account\n",
    "the extra complexity of what it does on each key.\n",
    "Being in C rather than Python\n",
    "would normally be about a $40\\times$ speedup,\n",
    "but in this case probably most of the work was being done in things\n",
    "like `sorted()` and `==` anyway.\n",
    "\n",
    "Probably most of those 2&ndash;3 ms are starting up and shutting down\n",
    "the `look` executable, anyway.  How many CPU instructions does our query\n",
    "take inside `look`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "7c98aaa5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==98970== Callgrind, a call-graph generating cache profiler\n",
      "==98970== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==98970== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==98970== Command: look aijnort sorted.pangrams\n",
      "==98970== \n",
      "==98970== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aijnort janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==98970== \n",
      "==98970== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==98970== Collected : 211546 49675 19804 1474 1529 704 1448 1085 655\n",
      "==98970== \n",
      "==98970== I   refs:      211,546\n",
      "==98970== I1  misses:      1,474\n",
      "==98970== LLi misses:      1,448\n",
      "==98970== I1  miss rate:    0.70%\n",
      "==98970== LLi miss rate:    0.68%\n",
      "==98970== \n",
      "==98970== D   refs:       69,479  (49,675 rd + 19,804 wr)\n",
      "==98970== D1  misses:      2,233  ( 1,529 rd +    704 wr)\n",
      "==98970== LLd misses:      1,740  ( 1,085 rd +    655 wr)\n",
      "==98970== D1  miss rate:     3.2% (   3.1%   +    3.6%  )\n",
      "==98970== LLd miss rate:     2.5% (   2.2%   +    3.3%  )\n",
      "==98970== \n",
      "==98970== LL refs:         3,707  ( 3,003 rd +    704 wr)\n",
      "==98970== LL misses:       3,188  ( 2,533 rd +    655 wr)\n",
      "==98970== LL miss rate:      1.1% (   1.0%   +    3.3%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --cache-sim=yes look aijnort sorted.pangrams"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69c8cf57",
   "metadata": {},
   "source": [
    "It only ran 0.2 million CPU instructions,\n",
    "which, at 5 million instructions per millisecond,\n",
    "would normally be about 0.04 milliseconds.  So the 2-millisecond time\n",
    "is purely system overhead.  We've made it 2000 times faster."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecd58603",
   "metadata": {},
   "source": [
    "Solving it by brute force again, but this time in C so it's $40\\times$ faster\n",
    "-----\n",
    "\n",
    "What if we just do the Python logic in C?  I prompted GPT-5 Mini as follows:\n",
    "\n",
    "> Hi! Please write a C program to print out all the words from a dictionary file whose (single-byte, ASCII) letters are bag-equivalent to its first command-line argument, that is, where each letter occurs the same number of times in the word as it does in the command-line argument. Write an insertion-sort subroutine to sort the letters in a word into a canonical order, so that bag-equivalence reduces to simple memcmp(). Each word in the input file, whose name is provided as the second command-line argument, is followed by a `\\n`, including the last word. Exit with an error if an input word is over 64 bytes long rather than using any dynamic allocation. Don't make any checks for \"validity\" of the bytes in the argument or the input lines; simply assume they're all letters. Make it fast. Use stdio for I/O. Any clarifying questions?\n",
    "\n",
    "And it responded with this C program:\n",
    "\n",
    "```c\n",
    "/* bagmatch.c\n",
    " *\n",
    " * Print all words from a dictionary file whose letters (single-byte ASCII)\n",
    " * form the same multiset (\"bag\") as the first command-line argument.\n",
    " *\n",
    " * Usage: ./bagmatch KEYWORD DICTIONARY_FILE\n",
    " *\n",
    " * Constraints & behavior per request:\n",
    " * - Words and KEYWORD are assumed to contain only letters; no validity checks.\n",
    " * - Maximum input word length is 64 bytes; program exits with error if exceeded.\n",
    " * - Sort letters in a word using an insertion-sort subroutine into canonical order.\n",
    " * - Compare canonical forms via memcmp.\n",
    " * - Use stdio for I/O; be fast and avoid dynamic allocation.\n",
    " */\n",
    "\n",
    "#include <stdio.h>\n",
    "#include <stdlib.h>\n",
    "#include <string.h>\n",
    "\n",
    "#define MAXWORD 64\n",
    "\n",
    "/* insertion_sort_chars:\n",
    " * Sorts n bytes in-place (array of unsigned char) using insertion sort.\n",
    " * This is optimized for small fixed-size keys (<=64).\n",
    " */\n",
    "static void insertion_sort_chars(unsigned char *a, size_t n) {\n",
    "    for (size_t i = 1; i < n; ++i) {\n",
    "        unsigned char key = a[i];\n",
    "        size_t j = i;\n",
    "        while (j > 0 && a[j - 1] > key) {\n",
    "            a[j] = a[j - 1];\n",
    "            --j;\n",
    "        }\n",
    "        a[j] = key;\n",
    "    }\n",
    "}\n",
    "\n",
    "int main(int argc, char *argv[]) {\n",
    "    if (argc != 3) {\n",
    "        fputs(\"Usage: bagmatch KEYWORD DICTIONARY_FILE\\n\", stderr);\n",
    "        return 2;\n",
    "    }\n",
    "\n",
    "    unsigned char keybuf[MAXWORD];\n",
    "    size_t keylen = strlen(argv[1]);\n",
    "    if (keylen > MAXWORD) {\n",
    "        fputs(\"Error: keyword too long (max 64 bytes)\\n\", stderr);\n",
    "        return 2;\n",
    "    }\n",
    "    /* copy then sort into canonical form */\n",
    "    memcpy(keybuf, argv[1], keylen);\n",
    "    insertion_sort_chars(keybuf, keylen);\n",
    "\n",
    "    FILE *f = fopen(argv[2], \"rb\");\n",
    "    if (!f) {\n",
    "        perror(\"fopen\");\n",
    "        return 2;\n",
    "    }\n",
    "\n",
    "    /* Read file line-by-line, each word followed by '\\n' including last.\n",
    "       We'll read into a buffer and detect newline. Use fread for speed. */\n",
    "    unsigned char inbuf[MAXWORD + 2]; /* allow room to detect overflow and newline */\n",
    "    size_t bufpos = 0;\n",
    "    int c;\n",
    "\n",
    "    while ((c = fgetc(f)) != EOF) {\n",
    "        if (c == '\\n') {\n",
    "            /* word is in inbuf[0..bufpos-1] */\n",
    "            if (bufpos > MAXWORD) {\n",
    "                fputs(\"Error: input word too long (max 64 bytes)\\n\", stderr);\n",
    "                fclose(f);\n",
    "                return 2;\n",
    "            }\n",
    "            /* sort a copy so we don't disturb original for printing */\n",
    "            unsigned char work[MAXWORD];\n",
    "            memcpy(work, inbuf, bufpos);\n",
    "            insertion_sort_chars(work, bufpos);\n",
    "\n",
    "            if (bufpos == keylen && memcmp(work, keybuf, keylen) == 0) {\n",
    "                /* match: print original word (as bytes) followed by newline */\n",
    "                if (fwrite(inbuf, 1, bufpos, stdout) != bufpos || putchar('\\n') == EOF) {\n",
    "                    perror(\"write\");\n",
    "                    fclose(f);\n",
    "                    return 2;\n",
    "                }\n",
    "            }\n",
    "            bufpos = 0;\n",
    "        } else {\n",
    "            /* append byte */\n",
    "            if (bufpos < sizeof(inbuf)) {\n",
    "                inbuf[bufpos++] = (unsigned char)c;\n",
    "            } else {\n",
    "                /* we purposely allow detecting overflow; keep reading until newline\n",
    "                   but will error when newline encountered (or we can exit now) */\n",
    "                /* consume until newline to keep file position consistent, but\n",
    "                   per spec we should exit on word > 64 bytes rather than trying\n",
    "                   to handle dynamically. So we can report error now. */\n",
    "                fputs(\"Error: input word too long (max 64 bytes)\\n\", stderr);\n",
    "                fclose(f);\n",
    "                return 2;\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "\n",
    "    /* If file doesn't end with newline, per spec that shouldn't happen.\n",
    "       But we won't process a final partial line (spec says each word is followed by '\\n'). */\n",
    "    fclose(f);\n",
    "    return 0;\n",
    "}\n",
    "```\n",
    "\n",
    "This seems to work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "38daa88b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.020s\n",
      "user\t0m0.020s\n",
      "sys\t0m0.000s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cc -Wall -O5 bagmatch.c -o bagmatch\n",
    "time ./bagmatch jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "730e4d92",
   "metadata": {},
   "source": [
    "This is 40 times faster than the original Python,\n",
    "but, as you can see, it's much slower than the database lookup, because it's visiting\n",
    "all 100k words.\n",
    "\n",
    "Doubling the speed of the AI's C\n",
    "----\n",
    "\n",
    "Also it's sorting the ones of the wrong length, which is most of them.\n",
    "Pulling the length conditional outward like this\n",
    "\n",
    "```c\n",
    "            /* Test words of the right length further. */\n",
    "            if (bufpos == keylen) {\n",
    "                /* sort a copy so we don't disturb original for printing */\n",
    "                unsigned char work[MAXWORD];\n",
    "                memcpy(work, inbuf, bufpos);\n",
    "                insertion_sort_chars(work, bufpos);\n",
    "\n",
    "                if (memcmp(work, keybuf, keylen) == 0) {\n",
    "                    /* match: print original word (as bytes) followed by newline */\n",
    "                    if (fwrite(inbuf, 1, bufpos, stdout) != bufpos || putchar('\\n') == EOF) {\n",
    "                        perror(\"write\");\n",
    "                        fclose(f);\n",
    "                        return 2;\n",
    "                    }\n",
    "                }\n",
    "            }\n",
    "\n",
    "            bufpos = 0;\n",
    "```\n",
    "\n",
    "should speed it up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "b9454b8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.011s\n",
      "user\t0m0.011s\n",
      "sys\t0m0.000s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cc -Wall -O5 bagmatch2.c -o bagmatch2\n",
    "time ./bagmatch2 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62cb07f9",
   "metadata": {},
   "source": [
    "That's another almost factor of 2, from 40 times faster to 70 times faster than the original."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "9618dee3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==101121== Callgrind, a call-graph generating cache profiler\n",
      "==101121== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==101121== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==101121== Command: ./bagmatch2 jainrot /usr/share/dict/words\n",
      "==101121== \n",
      "==101121== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==101121== \n",
      "==101121== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==101121== Collected : 30117465 8437141 5203742 1365 1221 618 1337 911 576\n",
      "==101121== \n",
      "==101121== I   refs:      30,117,465\n",
      "==101121== I1  misses:         1,365\n",
      "==101121== LLi misses:         1,337\n",
      "==101121== I1  miss rate:       0.00%\n",
      "==101121== LLi miss rate:       0.00%\n",
      "==101121== \n",
      "==101121== D   refs:      13,640,883  (8,437,141 rd + 5,203,742 wr)\n",
      "==101121== D1  misses:         1,839  (    1,221 rd +       618 wr)\n",
      "==101121== LLd misses:         1,487  (      911 rd +       576 wr)\n",
      "==101121== D1  miss rate:        0.0% (      0.0%   +       0.0%  )\n",
      "==101121== LLd miss rate:        0.0% (      0.0%   +       0.0%  )\n",
      "==101121== \n",
      "==101121== LL refs:            3,204  (    2,586 rd +       618 wr)\n",
      "==101121== LL misses:          2,824  (    2,248 rd +       576 wr)\n",
      "==101121== LL miss rate:         0.0% (      0.0%   +       0.0%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --cache-sim=yes ./bagmatch2 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10bb48be",
   "metadata": {},
   "source": [
    "30 million instructions in 11 milliseconds is about 2.7 million instructions per second,\n",
    "less than my guesstimate above of 5, but in the ballpark.  `ltrace` reveals that `fgetc`\n",
    "is being invoked repeatedly through, presumably, the library call interface (since\n",
    "`ltrace` wouldn't be able to interpose an inlined call):\n",
    "\n",
    "```\n",
    "fgetc(0x5576770222a0)                            = 'a'\n",
    "fgetc(0x5576770222a0)                            = 'm'\n",
    "fgetc(0x5576770222a0)                            = 'm'\n",
    "fgetc(0x5576770222a0)                            = 'o'\n",
    "fgetc(0x5576770222a0)                            = '\\n'\n",
    "fgetc(0x5576770222a0)                            = 'a'\n",
    "fgetc(0x5576770222a0)                            = 'm'\n",
    "fgetc(0x5576770222a0)                            = 'm'\n",
    "fgetc(0x5576770222a0)                            = 'o'\n",
    "fgetc(0x5576770222a0)                            = 'n'\n",
    "fgetc(0x5576770222a0)                            = 'i'\n",
    "fgetc(0x5576770222a0)                            = 'a'\n",
    "fgetc(0x5576770222a0)                            = '\\n'\n",
    "memcmp(\"aaimmno\", \"aijnort\", 7)                  = -1\n",
    "```\n",
    "\n",
    "That seems like a reasonable thing to target for speedup;\n",
    "the trivial fix is to use the macro `getc`:\n",
    "\n",
    "```c\n",
    "    while ((c = getc(f)) != EOF) {\n",
    "```\n",
    "\n",
    "but that doesn't help on this platform, because `getc` is just a function.\n",
    "Instead I'll try `fgets`:\n",
    "\n",
    "```c\n",
    "    for (;;) {\n",
    "        if (!fgets(inbuf, MAXWORD + 2, f)) {\n",
    "            if (feof(f)) break;\n",
    "            perror(\"fgets\");\n",
    "            fclose(f);\n",
    "            return 2;\n",
    "        }\n",
    "\n",
    "        /* strchrnul */\n",
    "        size_t bufpos = 0;\n",
    "        while (inbuf[bufpos] != '\\n' && inbuf[bufpos]) bufpos++;\n",
    "```\n",
    "\n",
    "This allows some simplification of the rest of the code,\n",
    "and it does seem to provide a minor speedup:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "aca2a5a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.010s\n",
      "user\t0m0.006s\n",
      "sys\t0m0.003s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cc -Wall -O5 bagmatch3.c -o bagmatch3\n",
    "time ./bagmatch3 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b09532f",
   "metadata": {},
   "source": [
    "That's taking maybe 6 to 10 ms instead of 11."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "f1ec2ff8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==103211== Callgrind, a call-graph generating cache profiler\n",
      "==103211== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==103211== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==103211== Command: ./bagmatch3 jainrot /usr/share/dict/words\n",
      "==103211== \n",
      "==103211== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==103211== \n",
      "==103211== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==103211== Collected : 27665695 5418917 3305724 1380 1232 618 1356 914 576\n",
      "==103211== \n",
      "==103211== I   refs:      27,665,695\n",
      "==103211== I1  misses:         1,380\n",
      "==103211== LLi misses:         1,356\n",
      "==103211== I1  miss rate:       0.00%\n",
      "==103211== LLi miss rate:       0.00%\n",
      "==103211== \n",
      "==103211== D   refs:       8,724,641  (5,418,917 rd + 3,305,724 wr)\n",
      "==103211== D1  misses:         1,850  (    1,232 rd +       618 wr)\n",
      "==103211== LLd misses:         1,490  (      914 rd +       576 wr)\n",
      "==103211== D1  miss rate:        0.0% (      0.0%   +       0.0%  )\n",
      "==103211== LLd miss rate:        0.0% (      0.0%   +       0.0%  )\n",
      "==103211== \n",
      "==103211== LL refs:            3,230  (    2,612 rd +       618 wr)\n",
      "==103211== LL misses:          2,846  (    2,270 rd +       576 wr)\n",
      "==103211== LL miss rate:         0.0% (      0.0%   +       0.0%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes \\\n",
    "--cache-sim=yes ./bagmatch3 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63031db5",
   "metadata": {},
   "source": [
    "Yeah, so about 8% less instructions executed.  But still 27 million, over $100\\times$\n",
    "what `look` was requiring.\n",
    "\n",
    "Resorting to AI a second time to double the C's speed again\n",
    "----\n",
    "\n",
    "Looking at the output in kcachegrind, it seems to be spending 62% of its time in `fgets`.\n",
    "So to get much faster we'd need to avoid using stdio, but plausibly we *could* get faster\n",
    "by not using stdio.  Maybe `mmap` would be a help.  So I asked GPT-5 Mini, in a new chat:\n",
    "\n",
    "> Hi! Please write a C program for Linux that uses `mmap` to map an input file (given as the first command-line argument) into virtual memory, then iterates over the newline-terminated lines of that (ASCII) file, printing the number of bytes in each line.\n",
    "\n",
    "Then I griped about its response:\n",
    "\n",
    "> Maybe use `memchr` instead of an explicit loop to find the newlines so that can be vectorized for performance.\n",
    "\n",
    "And it produced:\n",
    "\n",
    "```c\n",
    "// mmap_line_lengths_memchr.c\n",
    "// Usage: ./mmap_line_lengths_memchr inputfile\n",
    "// Prints the number of bytes in each newline-terminated ASCII line.\n",
    "// Only lines that are terminated by '\\n' are reported.\n",
    "\n",
    "#define _POSIX_C_SOURCE 200809L\n",
    "#include <stdio.h>\n",
    "#include <stdlib.h>\n",
    "#include <sys/mman.h>\n",
    "#include <sys/stat.h>\n",
    "#include <fcntl.h>\n",
    "#include <unistd.h>\n",
    "#include <stdint.h>\n",
    "#include <errno.h>\n",
    "#include <string.h>\n",
    "#include <inttypes.h>\n",
    "\n",
    "int main(int argc, char *argv[]) {\n",
    "    if (argc < 2) {\n",
    "        fprintf(stderr, \"Usage: %s inputfile\\n\", argv[0]);\n",
    "        return 2;\n",
    "    }\n",
    "\n",
    "    const char *path = argv[1];\n",
    "    int fd = open(path, O_RDONLY);\n",
    "    if (fd == -1) {\n",
    "        perror(\"open\");\n",
    "        return 3;\n",
    "    }\n",
    "\n",
    "    struct stat st;\n",
    "    if (fstat(fd, &st) == -1) {\n",
    "        perror(\"fstat\");\n",
    "        close(fd);\n",
    "        return 4;\n",
    "    }\n",
    "\n",
    "    off_t filesize = st.st_size;\n",
    "    if (filesize == 0) {\n",
    "        close(fd);\n",
    "        return 0;\n",
    "    }\n",
    "\n",
    "    const unsigned char *map = mmap(NULL, (size_t)filesize, PROT_READ, MAP_PRIVATE, fd, 0);\n",
    "    if (map == MAP_FAILED) {\n",
    "        perror(\"mmap\");\n",
    "        close(fd);\n",
    "        return 5;\n",
    "    }\n",
    "\n",
    "    const unsigned char *ptr = map;\n",
    "    const unsigned char *end = map + filesize;\n",
    "\n",
    "    while (ptr < end) {\n",
    "        const void *found = memchr(ptr, '\\n', (size_t)(end - ptr));\n",
    "        if (!found) break;\n",
    "        const unsigned char *nl = found;\n",
    "        ptrdiff_t len = nl - ptr; // bytes before newline\n",
    "        printf(\"%\" PRIdPTR \"\\n\", len);\n",
    "        ptr = nl + 1;\n",
    "    }\n",
    "\n",
    "    if (munmap((void *)map, (size_t)filesize) == -1) {\n",
    "        perror(\"munmap\");\n",
    "    }\n",
    "    close(fd);\n",
    "    return 0;\n",
    "}\n",
    "```\n",
    "\n",
    "Muddling that together with the above code and doing a little cleanup, I got this:\n",
    "\n",
    "```c\n",
    "/* bagmatch.c, mostly by GPT-5 Mini.\n",
    " *\n",
    " * Print all words from a dictionary file whose letters (single-byte ASCII)\n",
    " * form the same multiset (\"bag\") as the first command-line argument.\n",
    " *\n",
    " * Usage: ./bagmatch KEYWORD DICTIONARY_FILE\n",
    " *\n",
    " * Constraints & behavior per request:\n",
    " * - Words and KEYWORD are assumed to contain only letters; no validity checks.\n",
    " * - Maximum input word length is 64 bytes; program exits with error if exceeded.\n",
    " * - Sort letters in a word using an insertion-sort subroutine into canonical order.\n",
    " * - Compare canonical forms via memcmp.\n",
    " * - Use mmap for input; be fast and avoid dynamic allocation.\n",
    " */\n",
    "\n",
    "#define _POSIX_C_SOURCE 200809L\n",
    "#include <stdio.h>\n",
    "#include <stdlib.h>\n",
    "#include <sys/mman.h>\n",
    "#include <sys/stat.h>\n",
    "#include <fcntl.h>\n",
    "#include <unistd.h>\n",
    "#include <stdint.h>\n",
    "#include <errno.h>\n",
    "#include <string.h>\n",
    "#include <inttypes.h>\n",
    "#include <stddef.h>\n",
    "\n",
    "#define MAXWORD 64\n",
    "\n",
    "/* insertion_sort_chars:\n",
    " * Sorts n bytes in-place (array of unsigned char) using insertion sort.\n",
    " * This is optimized for small fixed-size keys (<=64).\n",
    " */\n",
    "static void insertion_sort_chars(unsigned char *a, size_t n) {\n",
    "  for (size_t i = 1; i < n; ++i) {\n",
    "    unsigned char key = a[i];\n",
    "    size_t j = i;\n",
    "    while (j > 0 && a[j - 1] > key) {\n",
    "      a[j] = a[j - 1];\n",
    "      --j;\n",
    "    }\n",
    "    a[j] = key;\n",
    "  }\n",
    "}\n",
    "\n",
    "int main(int argc, char *argv[]) {\n",
    "  int error_return = 0;\n",
    "  if (argc != 3) {\n",
    "    fputs(\"Usage: bagmatch KEYWORD DICTIONARY_FILE\\n\", stderr);\n",
    "    return 2;\n",
    "  }\n",
    "\n",
    "  unsigned char keybuf[MAXWORD];\n",
    "  size_t keylen = strlen(argv[1]);\n",
    "  if (keylen > MAXWORD) {\n",
    "    fputs(\"Error: keyword too long (max 64 bytes)\\n\", stderr);\n",
    "    return 2;\n",
    "  }\n",
    "  /* copy then sort into canonical form */\n",
    "  memcpy(keybuf, argv[1], keylen);\n",
    "  insertion_sort_chars(keybuf, keylen);\n",
    "\n",
    "  const char *path = argv[2];\n",
    "  int fd = open(path, O_RDONLY);\n",
    "  if (fd == -1) {\n",
    "    perror(path);\n",
    "    return 3;\n",
    "  }\n",
    "\n",
    "  struct stat st;\n",
    "  if (fstat(fd, &st) == -1) {\n",
    "    perror(\"fstat\");\n",
    "    close(fd);\n",
    "    return 4;\n",
    "  }\n",
    "\n",
    "  off_t filesize = st.st_size;\n",
    "  if (filesize == 0) {\n",
    "    close(fd);\n",
    "    return 0;\n",
    "  }\n",
    "\n",
    "  const unsigned char *map = mmap(NULL, (size_t)filesize, PROT_READ, MAP_PRIVATE, fd, 0);\n",
    "  if (map == MAP_FAILED) {\n",
    "    perror(\"mmap\");\n",
    "    close(fd);\n",
    "    return 5;\n",
    "  }\n",
    "\n",
    "  const unsigned char *nl;\n",
    "  const unsigned char *end = map + filesize;\n",
    "\n",
    "  for (const unsigned char *ptr = map; ptr < end; ptr = nl + 1) {\n",
    "    /* Read file line-by-line, each word followed by '\\n' including last. */\n",
    "    const void *found = memchr(ptr, '\\n', (size_t)(end - ptr));\n",
    "    if (!found) break;\n",
    "    nl = found;\n",
    "    ptrdiff_t len = nl - ptr; // bytes before newline\n",
    "\n",
    "    /* word is in ptr[0..len-1] */\n",
    "\n",
    "    /* Skip words of the wrong length. */\n",
    "    if (len != keylen) continue;\n",
    "\n",
    "    /* sort a copy so we don't disturb original */\n",
    "    unsigned char work[MAXWORD];\n",
    "    memcpy(work, ptr, len);\n",
    "    insertion_sort_chars(work, len);\n",
    "\n",
    "    if (memcmp(work, keybuf, keylen) == 0) {\n",
    "      /* match: print original word (as bytes) followed by newline */\n",
    "      if (fwrite(ptr, 1, len, stdout) != len || putchar('\\n') == EOF) {\n",
    "        perror(\"write\");\n",
    "        error_return = 2;\n",
    "        break;\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "\n",
    "  if (munmap((void *)map, (size_t)filesize) == -1) {\n",
    "    perror(\"munmap\");\n",
    "    error_return = 8;\n",
    "  }\n",
    "\n",
    "  close(fd);\n",
    "  return error_return;\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "cafc6bc0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.005s\n",
      "user\t0m0.000s\n",
      "sys\t0m0.005s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cc -Wall -O5 -g bagmatch4.c -o bagmatch4\n",
    "time ./bagmatch4 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50d5cd28",
   "metadata": {},
   "source": [
    "This seems to be a real improvement, four times as fast as the first C version and 160 times\n",
    "as fast as the Python.  What does cachegrind think?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "df3d75ea",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==108041== Callgrind, a call-graph generating cache profiler\n",
      "==108041== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==108041== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==108041== Command: ./bagmatch4 jainrot /usr/share/dict/words\n",
      "==108041== \n",
      "==108041== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==108041== \n",
      "==108041== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==108041== Collected : 7345817 870984 479574 1264 16696 656 1244 16259 571\n",
      "==108041== \n",
      "==108041== I   refs:      7,345,817\n",
      "==108041== I1  misses:        1,264\n",
      "==108041== LLi misses:        1,244\n",
      "==108041== I1  miss rate:      0.02%\n",
      "==108041== LLi miss rate:      0.02%\n",
      "==108041== \n",
      "==108041== D   refs:      1,350,558  (870,984 rd + 479,574 wr)\n",
      "==108041== D1  misses:       17,352  ( 16,696 rd +     656 wr)\n",
      "==108041== LLd misses:       16,830  ( 16,259 rd +     571 wr)\n",
      "==108041== D1  miss rate:       1.3% (    1.9%   +     0.1%  )\n",
      "==108041== LLd miss rate:       1.2% (    1.9%   +     0.1%  )\n",
      "==108041== \n",
      "==108041== LL refs:          18,616  ( 17,960 rd +     656 wr)\n",
      "==108041== LL misses:        18,074  ( 17,503 rd +     571 wr)\n",
      "==108041== LL miss rate:        0.2% (    0.2%   +     0.1%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes \\\n",
    "--cache-sim=yes ./bagmatch4 jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9310b870",
   "metadata": {},
   "source": [
    "Now it's down to 7.4 million instructions: still $35\\times$ what `look` needs to consult its\n",
    "database, and about 7.5 instructions per byte of /usr/share/dict/words.  kcachegrind tells\n",
    "me that these are about half and half main(), called once, and `__memchr_sse2`,\n",
    "called 104334 times, once for each line in the input file.  The hot part of main()\n",
    "is the (inlined) insertion search function.\n",
    "\n",
    "If we believe the 2.7 billion instructions per second, this is about 3 milliseconds\n",
    "of actual execution, plus probably about 2 million of fork/exec/exit/wait overhead and\n",
    "dynamic loading.\n",
    "\n",
    "I suspect that we could get it to be faster still with better vectorization, but I'm not sure I want to.\n",
    "\n",
    "I thought I'd see if a static build does better:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "0f81cd70",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.004s\n",
      "user\t0m0.004s\n",
      "sys\t0m0.000s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cc -Wall -O5 -g -static bagmatch4.c -o bagmatch4s\n",
    "time ./bagmatch4s jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a12b1aad",
   "metadata": {},
   "source": [
    "Maybe?  Seems within a margin of error?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "95283725",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==108053== Callgrind, a call-graph generating cache profiler\n",
      "==108053== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==108053== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==108053== Command: ./bagmatch4s jainrot /usr/share/dict/words\n",
      "==108053== \n",
      "==108053== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==108053== \n",
      "==108053== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==108053== Collected : 7370128 862373 476384 584 16357 191 584 16282 175\n",
      "==108053== \n",
      "==108053== I   refs:      7,370,128\n",
      "==108053== I1  misses:          584\n",
      "==108053== LLi misses:          584\n",
      "==108053== I1  miss rate:      0.01%\n",
      "==108053== LLi miss rate:      0.01%\n",
      "==108053== \n",
      "==108053== D   refs:      1,338,757  (862,373 rd + 476,384 wr)\n",
      "==108053== D1  misses:       16,548  ( 16,357 rd +     191 wr)\n",
      "==108053== LLd misses:       16,457  ( 16,282 rd +     175 wr)\n",
      "==108053== D1  miss rate:       1.2% (    1.9%   +     0.0%  )\n",
      "==108053== LLd miss rate:       1.2% (    1.9%   +     0.0%  )\n",
      "==108053== \n",
      "==108053== LL refs:          17,132  ( 16,941 rd +     191 wr)\n",
      "==108053== LL misses:        17,041  ( 16,866 rd +     175 wr)\n",
      "==108053== LL miss rate:        0.2% (    0.2%   +     0.0%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes \\\n",
    "--cache-sim=yes ./bagmatch4s jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dc88e54",
   "metadata": {},
   "source": [
    "Yeah, not clearly faster, maybe slower."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "920c144a",
   "metadata": {},
   "source": [
    "Solving it in LuaJIT 16 times faster than Python (but less than half as fast as C)\n",
    "----\n",
    "\n",
    "I prompted GPT-5 Mini:\n",
    "\n",
    "> Hi! Please write a LuaJIT program to print out all the words from a dictionary file whose (single-byte, ASCII) letters are bag-equivalent to its first command-line argument, that is, where each letter occurs the same number of times in the word as it does in the command-line argument. Write an insertion-sort subroutine to sort the letters in a word into a canonical order, so that bag-equivalence reduces to simple string comparison. Each word in the input file, whose name is provided as the second command-line argument, is followed by a `\\n`, including the last word. Don't make any checks for \"validity\" of the bytes in the argument or the input lines; simply assume they're all letters. Make it fast. In LuaJIT that usually means explicit loops over tables and strings. Any clarifying questions?\n",
    "\n",
    "It output the following Lua program, which almost worked:\n",
    "\n",
    "```lua\n",
    "-- bagmatch.lua\n",
    "-- Usage: luajit bagmatch.lua <key> <dictfile>\n",
    "-- Prints dictionary words whose multiset of ASCII single-byte letters\n",
    "-- equals that of <key>.\n",
    "--\n",
    "-- Approach:\n",
    "-- 1. Sort letters of the key with an insertion sort into canonical order.\n",
    "-- 2. For each line (word) in the file (each terminated by '\\n'), strip the '\\n',\n",
    "--    sort its letters with the same insertion sort, compare strings; on match print.\n",
    "-- Implementation favors explicit loops and string operations for LuaJIT speed.\n",
    "\n",
    "local arg = assert(arg, \"no args\") -- keep code concise; will error if missing\n",
    "\n",
    "local key = arg[1]\n",
    "local fname = arg[2]\n",
    "\n",
    "-- canonical sort: insertion sort on ASCII bytes of the given string\n",
    "-- returns a new string with characters sorted ascending by byte\n",
    "local function sort_string(s)\n",
    "  local n = #s\n",
    "  if n <= 1 then return s end\n",
    "  -- build array of bytes (1-based)\n",
    "  local bytes = {}\n",
    "  for i = 1, n do\n",
    "    bytes[i] = string.byte(s, i)\n",
    "  end\n",
    "  -- insertion sort over bytes\n",
    "  for i = 2, n do\n",
    "    local v = bytes[i]\n",
    "    local j = i - 1\n",
    "    while j >= 1 and bytes[j] > v do\n",
    "      bytes[j+1] = bytes[j]\n",
    "      j = j - 1\n",
    "    end\n",
    "    bytes[j+1] = v\n",
    "  end\n",
    "  -- convert back to string\n",
    "  return string.char(table.unpack(bytes, 1, n))\n",
    "end\n",
    "\n",
    "-- Precompute sorted key\n",
    "local sorted_key = sort_string(key)\n",
    "\n",
    "-- Open file for reading in binary mode to ensure exact single-byte handling\n",
    "local f, err = io.open(fname, \"rb\")\n",
    "if not f then\n",
    "  io.stderr:write(\"error opening file: \", err, \"\\n\")\n",
    "  os.exit(1)\n",
    "end\n",
    "\n",
    "-- Read file in reasonably large chunks and process line by line.\n",
    "-- Simpler approach: read full file into memory if it's expected to fit.\n",
    "-- For speed and simplicity here, read in blocks and handle line buffering.\n",
    "local chunk_size = 65536\n",
    "local buffer = \"\"\n",
    "local out = io.stdout\n",
    "\n",
    "while true do\n",
    "  local chunk = f:read(chunk_size)\n",
    "  if not chunk then\n",
    "    -- process remaining buffer only if it ends with '\\n' assumption? Problem states each word is followed by '\\n' including last, so buffer contains final line with '\\n'.\n",
    "    -- still process if buffer non-empty\n",
    "    if buffer ~= \"\" then\n",
    "      -- process any remaining lines (should be one, ending with '\\n')\n",
    "      -- split by '\\n'\n",
    "      local start = 1\n",
    "      while true do\n",
    "        local s, e = buffer:find(\"\\n\", start, true)\n",
    "        if not s then break end\n",
    "        local word = buffer:sub(start, s-1)\n",
    "        if #word == #key then\n",
    "          if sort_string(word) == sorted_key then\n",
    "            out:write(word, \"\\n\")\n",
    "          end\n",
    "        end\n",
    "        start = e + 1\n",
    "      end\n",
    "    end\n",
    "    break\n",
    "  end\n",
    "\n",
    "  buffer = buffer .. chunk\n",
    "\n",
    "  -- extract complete lines\n",
    "  local start = 1\n",
    "  while true do\n",
    "    local s, e = buffer:find(\"\\n\", start, true)\n",
    "    if not s then\n",
    "      -- keep remainder in buffer\n",
    "      buffer = buffer:sub(start)\n",
    "      break\n",
    "    end\n",
    "    local word = buffer:sub(start, s-1)\n",
    "    -- quick length check before sorting\n",
    "    if #word == #key then\n",
    "      if sort_string(word) == sorted_key then\n",
    "        out:write(word, \"\\n\")\n",
    "      end\n",
    "    end\n",
    "    start = e + 1\n",
    "    -- if we've consumed whole buffer, clear it\n",
    "    if start > #buffer then\n",
    "      buffer = \"\"\n",
    "      break\n",
    "    end\n",
    "  end\n",
    "end\n",
    "\n",
    "f:close()\n",
    "```\n",
    "\n",
    "The fatal flaw was that the line\n",
    "\n",
    "```lua\n",
    "  return string.char(table.unpack(bytes, 1, n))\n",
    "```\n",
    "\n",
    "needed to be\n",
    "\n",
    "```lua\n",
    "  return string.char(unpack(bytes))\n",
    "```\n",
    "\n",
    "And then this runs in about 50 milliseconds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "50ee369c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.050s\n",
      "user\t0m0.045s\n",
      "sys\t0m0.005s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "time luajit ./bagmatch.lua jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e11deb15",
   "metadata": {},
   "source": [
    "I suggested maybe using normal line-oriented I/O, which the AI suggested would be\n",
    "\n",
    "```lua\n",
    "for line in f:lines() do\n",
    "  -- 'line' excludes the trailing '\\n' (file guarantees every word ends with '\\n')\n",
    "  if #line == #key and sort_string(line) == sorted_key then\n",
    "    io.stdout:write(line, \"\\n\")\n",
    "  end\n",
    "end\n",
    "```\n",
    "\n",
    "which simplifies the script a lot but makes it, I think, slightly slower:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "28069f3d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.058s\n",
      "user\t0m0.058s\n",
      "sys\t0m0.000s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "time luajit ./bagmatch2.lua jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "afe99cda",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==110101== Callgrind, a call-graph generating cache profiler\n",
      "==110101== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.\n",
      "==110101== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info\n",
      "==110101== Command: luajit ./bagmatch2.lua jainrot /usr/share/dict/words\n",
      "==110101== \n",
      "==110101== For interactive control, run 'callgrind_control -h'.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "janitor\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==110101== \n",
      "==110101== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw\n",
      "==110101== Collected : 182950938 46153297 30147411 53581 926776 372336 4032 2057 5374\n",
      "==110101== \n",
      "==110101== I   refs:      182,950,938\n",
      "==110101== I1  misses:         53,581\n",
      "==110101== LLi misses:          4,032\n",
      "==110101== I1  miss rate:        0.03%\n",
      "==110101== LLi miss rate:        0.00%\n",
      "==110101== \n",
      "==110101== D   refs:       76,300,708  (46,153,297 rd + 30,147,411 wr)\n",
      "==110101== D1  misses:      1,299,112  (   926,776 rd +    372,336 wr)\n",
      "==110101== LLd misses:          7,431  (     2,057 rd +      5,374 wr)\n",
      "==110101== D1  miss rate:         1.7% (       2.0%   +        1.2%  )\n",
      "==110101== LLd miss rate:         0.0% (       0.0%   +        0.0%  )\n",
      "==110101== \n",
      "==110101== LL refs:         1,352,693  (   980,357 rd +    372,336 wr)\n",
      "==110101== LL misses:          11,463  (     6,089 rd +      5,374 wr)\n",
      "==110101== LL miss rate:          0.0% (       0.0%   +        0.0%  )\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes \\\n",
    "--cache-sim=yes luajit ./bagmatch2.lua jainrot /usr/share/dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2716ca16",
   "metadata": {},
   "source": [
    "LuaJIT seems to be getting a little more ILP than GCC is here, like\n",
    "3.1 instructions per clock (IPC) rather than about 2.7.  But it's still slower.\n",
    "kcachegrind is not as useful in figuring out why; not only can't it show\n",
    "the source code for the JITted machine code,\n",
    "callgrind didn't even dump the actual instructions,\n",
    "instead leaving poor kcachegrind to try to fish them out of the executable with objdump,\n",
    "which is never going to work because they aren't there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71519d90",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
