From kragen@dnaco.net Thu Sep 24 23:43:16 1998 Date: Thu, 24 Sep 1998 23:43:14 -0400 (EDT) From: Kragen X-Sender: kragen@pike To: sittler@erim-int.com Subject: standard deviation Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Keywords: X-UID: 2145 Status: RO X-Status: Standard deviation of x is sqrt((sum(i from 1 to n) of (x[i] - average x)^2)/n). That sum is, again: sum(i from 1 to n) of (x[i] - average x)^2. If we multiply out the stuff in the parens, we get: sum(i from 1 to n) of (x[i]^2 - 2 * x[i] * average x + (average x)^2). Since the sum distributes over addition, we get sum(i from 1 to n) of x[i]^2 + sum (i from 1 to n) of -2 * x[i] * average x + sum (i from 1 to n) of (average x)^2. We can take the factors that are constant with regard to i out of these sums; we get sum(i from 1 to n) of x[i]^2 - 2 * average x * sum (i from 1 to n) of x[i] + (average x)^2 sum (i from 1 to n) of 1. Now, sum (i from 1 to n) 1 is just n. And "average x" really means "(sum (i from 1 to n) of x[i])/n)". So we get: sum(i from 1 to n) of x[i]^2 - 2 * ((sum(i from 1 to n) of x[i])/n) * sum (i from 1 to n) x[i] + ((sum(i from 1 to n) of x[i])/n)^2 * n. So we distribute the squaring in the last term over the division, and rearrange the various factors in the last two terms, and we get sum(i from 1 to n) of x[i]^2 - 2 * (sum(i from 1 to n) of x[i])^2/n + (sum(i from 1 to n) of x[i])^2/n. Lucky us -- the last two terms are fixed multiples of the same function of the data. So we can add them, which reduces the formula to: sum(i from 1 to n) of x[i]^2 - (sum(i from 1 to n) of x[i])^2/n. So if we substitute this into the original formula, we get a formula that expresses the standard deviation of a series in terms of three functions of the data: the sum of the data, the sum of the squares of the data, and the number of data. This is a better way of calculating the standard deviation than by making one pass over the data to find the average, then another pass to find the sum of the squares of the differences between the data and the average. I think that it's possible to reduce other similar formulae, like the formulae for skewness and higher moments, to terms like these, requiring only a single pass over the data. I'm not certain of it, because I don't know the formulae for those higher moments. Here's a Perl program that computes the standard deviation of a set of data using the above formula: #!/usr/bin/perl -wn # Inputs are reasonably-formatted numbers, one per input line. $sum += $_; $sumsq += $_ ** 2; $n ++; END { print sqrt(($sumsq - $sum ** 2 / $n)/$n), "\n" } By the way, I cheated a little on this. I remembered that I used to use a pocket calculator that could take the standard deviation of an arbitrary-sized set of data, and it did this by remembering the sum of the data, the sum of the squares of the data, and the number of data. So I knew there had to be a way to do it; it only remained to find out what it was, which just took a little algebra. Kragen -- Kragen Sitaker The sages do not believe that making no mistakes is a blessing. They believe, rather, that the great virtue of man lies in his ability to correct his mistakes and continually make a new man of himself. -- Wang Yang-Ming