\ sample code from http://www.yosefk.com/blog/my-history-with-forth-stack-machines.html

: mean_std ( sum2 sum inv_len — mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std
;

( We have some fixed-point math here.  Let’s admit that, and define
the fixed-point math primitives. It looks like one of sum and inv_len
is a fixed-point fraction, while the other is an ordinary integer; I’m
guessing that sum2 is the sum of the squares, and inv_len is the
reciprocal of the length, which therefore must be a fraction, while
sum and sum2 are ordinary integers.  Adopting "." as an indicator
character for fixed-point arithmetic, even though that conflicts with
its usual Forth meaning of “output”: )
: s.* u* ;   : .>s FRAC rshift ;
: .* um* 32 FRAC - lshift swap FRAC rshift or ; 

\ Now this should be quite straightforward.
variable sumsq  variable sum  variable len_recip  variable .mean
: variance  sumsq @  len_recip @ s.*  .mean @ dup .*  - .>s ;
: mean_std  len_recip ! sum ! sumsq !
    sum @  len_recip @ s.*  .mean !
    .mean @  .>s  variance isqrt ;

( I believe that the only performance differences between this and the
original version should be:

1. It doesn't use an integer multiply to multiply FRAC by 2 at
run-time!

2. There are some functions that would benefit from being inlined.
This is somewhat clumsy but doable even if the compiler doesn't do
optimization; you just end up with definitions like

    : s.* postpone u* ; immediate

and the like.  It’s probably also worthwhile to change `32 FRAC -
lshift` to `[ 32 FRAC - ] literal lshift` for a similar reason.

3. It uses memory operations instead of stack operations. The original
uses eight stack operations; this version uses ten memory operations
[plus a stack operation].  Reducing that a bit by using the stack
and/or the return stack would be pretty straightforward.  Still, I
think that after sufficient optimization, you’ll always end up with
something as ugly and incomprehensible as the original version.

4. It might be wrong, since I haven’t tested it, and I assume Yossi’s
original code was copy-pasted from working code.

)