From kragen@dnaco.net Thu Sep 24 23:43:16 1998
Date: Thu, 24 Sep 1998 23:43:14 -0400 (EDT)
From: Kragen <kragen@dnaco.net>
X-Sender: kragen@pike
To: sittler@erim-int.com
Subject: standard deviation
Message-ID: <Pine.GSO.3.96.980924232011.16764L-100000@pike>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Keywords:
X-UID: 2145
Status: RO
X-Status: 

Standard deviation of x is sqrt((sum(i from 1 to n) of (x[i] - average x)^2)/n).

That sum is, again:
sum(i from 1 to n) of (x[i] - average x)^2.

If we multiply out the stuff in the parens, we get:

sum(i from 1 to n) of (x[i]^2 - 2 * x[i] * average x + (average x)^2).

Since the sum distributes over addition, we get

sum(i from 1 to n) of x[i]^2 + sum (i from 1 to n) of -2 * x[i] * average x
	+ sum (i from 1 to n) of (average x)^2.

We can take the factors that are constant with regard to i out of these sums; 
we get

sum(i from 1 to n) of x[i]^2 - 2 * average x * sum (i from 1 to n) of x[i]
	+ (average x)^2 sum (i from 1 to n) of 1.

Now, sum (i from 1 to n) 1 is just n.  And "average x" really means
"(sum (i from 1 to n) of x[i])/n)".  So we get:


sum(i from 1 to n) of x[i]^2 
	- 2 * ((sum(i from 1 to n) of x[i])/n) * sum (i from 1 to n) x[i]
	+ ((sum(i from 1 to n) of x[i])/n)^2 * n.

So we distribute the squaring in the last term over the division, and 
rearrange the various factors in the last two terms, and we get

sum(i from 1 to n) of x[i]^2 
	- 2 * (sum(i from 1 to n) of x[i])^2/n
	+ (sum(i from 1 to n) of x[i])^2/n.

Lucky us -- the last two terms are fixed multiples of the same function
of the data.  So we can add them, which reduces the formula to:

sum(i from 1 to n) of x[i]^2 - (sum(i from 1 to n) of x[i])^2/n.

So if we substitute this into the original formula, we get a formula
that expresses the standard deviation of a series in terms of three
functions of the data: the sum of the data, the sum of the squares of
the data, and the number of data.  This is a better way of calculating
the standard deviation than by making one pass over the data to find
the average, then another pass to find the sum of the squares of the
differences between the data and the average.

I think that it's possible to reduce other similar formulae, like the
formulae for skewness and higher moments, to terms like these,
requiring only a single pass over the data.  I'm not certain of it,
because I don't know the formulae for those higher moments.

Here's a Perl program that computes the standard deviation of a set of
data using the above formula:

#!/usr/bin/perl -wn
# Inputs are reasonably-formatted numbers, one per input line.
$sum += $_;
$sumsq += $_ ** 2;
$n ++;
END { print sqrt(($sumsq - $sum ** 2 / $n)/$n), "\n" }

By the way, I cheated a little on this.  I remembered that I used to
use a pocket calculator that could take the standard deviation of an
arbitrary-sized set of data, and it did this by remembering the sum of
the data, the sum of the squares of the data, and the number of data.
So I knew there had to be a way to do it; it only remained to find out
what it was, which just took a little algebra.

Kragen

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The sages do not believe that making no mistakes is a blessing. They believe, 
rather, that the great virtue of man lies in his ability to correct his 
mistakes and continually make a new man of himself.  -- Wang Yang-Ming


