\documentclass[psfig,preprint]{aastex}
\makeatletter
\renewcommand\theequation{\thesection.\arabic{equation}}
\@addtoreset{equation}{section}
\renewcommand\thefigure{\thesection.\arabic{figure}}
\@addtoreset{figure}{section}
\renewcommand\thetable{\thetable.\arabic{table}}
\@addtoreset{table}{section}
\makeatother
\begin{document}
\setcounter{section}{-1}
\title{LEAST-SQUARES AND CHI-SQUARE FOR THE BUDDING AFICIONADO: \\
ART AND PRACTICE }
\author{\copyright Carl Heiles \today }
In our never-ending attempt to make your life easier, we present
you with this highly instructive, time-saving, and labor-saving
informative document! Here we give heuristic derivations, discussions,
examples, and the prescription for doing least-squares the easy way
using matrix techniques generally, and specifically in IDL. This
prescription is given as an example in \S\ref{numexample}, and the {\it
power-user} can skip the details and go directly there.
This document is an update, correction, clarification, and
elaboration of a previous one made exclusively for the undergraduate lab
class. Here we extend the discussion considerably to cover most of what
anyone will need in future professional life. This makes the document
longer, but the first parts (\S \ref{sectionone} to
\ref{chauvenetsection}) are still accessible at the introductory level
because they haven't changed much. We occasionally refer to the books
Bevington and Robinson (1992; BR), Cowan (1998), Press et al.\ (2001;
Numerical Recipes, NR) and Taylor (1997; T97), and we update the
notation to partially conform with NR. We owe sections
\ref{chauvenetsectiontwo} to \ref{bothsection} to the fantastically
excellent website of Stetson, {\it
http://nedwww.ipac.caltech.edu/level5/Stetson/Stetson4.html}.
We begin with least-squares in the classic sense, meaning we
minimize the sum of squares instead of minimizing $\chi^2$. In
astronomy, more often than not you don't have an independent assessment
of the intrinsic uncertainty in the data, which means you cannot
evaluate $\chi^2$, and the least squares approach is the only option.
However, often in astronomy you do want to weight observations
differently, e.g.\ because of integration time, and this requires an
approach similar to the $\chi^2$ one. In later sections we generalize to
the $\chi^2$ and this other weighted-observations case.
\tableofcontents
\section{LEAST-SQUAREs FITTING FOR TWO PARAMETERS, AS WITH A STRAIGHT
LINE.} \label{sectionone}
\subsection{The closed-form expressions for a straight-line fit}
First consider the least-squares fit to a straight line. Let
$y_m$ be the $m^{th}$ measurement of the observed quantity (in this
example, $y_m$ is zenith distance; $t_m$ be the time of the
$m^{th}$ measurement; $M$ = the total number of observations, i.e.\ $m$
runs from 0 to $M-1$. Remember that in the least-squares technique,
quantities such as $t_m$ are regarded to be known with high accuracy
while the quantity $y_m$ has uncertainties in its measurement.
We expect the zenith distance $y_m$ to change linearly with
time as follows:
\begin{equation} \label{one}
A + B t_m = y_m \; .
\end{equation}
\noindent Given this, one does the maximum likelihood (ML) estimate assuming
Gaussian statistics. When all measurements have the same intrinsic
uncertainty, this is the same as looking for the solution that minimizes
the sum of the squares of the residuals (which we will define later).
This leads to the pair of equations (Taylor 8.8, 8.9), called the {\it
normal equations}
\begin{mathletters} \label{normalone}
\begin{equation}
AN + B \ \sum t_m = \sum y_m
\end{equation}
\begin{equation}
A \ \sum t_m + B \ \sum t_m^2 = \sum t_m y_m \ .
\end{equation}
\end{mathletters}
\noindent Two equations and two unknowns---easy to solve! The
closed-form equations for $(A,B)$ are Taylor's equations 8.10 to 8.12.
\subsection{Better is the following generalized notation.}
We want a way to generalize this approach to include any
functional dependence on $t$ and even other variables, and to have an
arbitrarily large number of unknown coefficients instead of just the two
$(A,B)$. This is very easy using matrix math. We will ease into this
matrix technique gently, by first carrying through an intermediate
stage of notation.
First generalize the straight-line fit slightly by having two
functional dependences instead of one. We have something other than the
time $t_m$; call it $s_m$. For example, we could have $s_m = \cos (t_m)$
or $s_m = t_m^2$; or we could have $s_m = x_m$, where $x_m$ is the
position from which the observation was taken. To correspond to equation
\ref{one}, $s_m = 1$. Then we rewrite equation \ref{one} to include this
extra dependence
\begin{equation} \label{two}
A s_m + B t_m = y_m \; .
\end{equation}
\noindent There are still only two unknown parameters, so this is an
almost trivial generalization; later we'll generalize to more
parameters.
We have $M$ equations like equation \ref{two}, one for each
measurement. They are known as the {\it equations of condition} because
they are the equations that specify the theoretical model to which we
are fitting the data. There are $M$ equations of condition and only two
unknowns ($A$ and $B$). This is too many equations! We have to end up
with a system in which the number of equations is equal to the number of
unknowns.
To accomplish this, from equation \ref{two} we form the {\it
normal equations}. The number of normal equations is equal to the
number of unknowns, so in this case we will have two. We could carry
through the same ML derivation to derive equations equivalent to
equation \ref{normalone}; the result is
\begin{mathletters} \label{ones}
\begin{equation}
A\ \sum s_m^2 + B \ \sum s_m t_m = \sum s_m y_m
\end{equation}
\begin{equation}
A \ \sum s_m t_m + B \ \sum t_m^2 = \sum t_m y_m \ .
\end{equation}
\end{mathletters}
\noindent We can rewrite these equations using the notation $[st] =
\sum s_m t_m$, etc.:
\begin{mathletters} \label{normaltwo}
\begin{equation}
A [ s^2 ] + B [ s t ] = [ s y ]
\end{equation}
\begin{equation}
A [ s t ] + B [ t^2 ] = [ t y ] \ .
\end{equation}
\end{mathletters}
\noindent This is, of course, precisely analogous to equation
\ref{normalone}. And now it's clear how to generalize to more
parameters!
\section{LEAST-SQUARES FITTING FOR MANY PARAMETERS, AS WITH A CUBIC}
\label{sectiontwo}
With this notation it's easy to generalize to more ($N$)
unknowns: the method is obvious because in each equation of condition
(like equation \ref{two}) we simply add equivalent additional terms such
as $C u_m$, $D v_m$, etc; and in the normal equations (equation
\ref{normaltwo}) we have more products and also more normal equations.
Let's take an example with four unknowns ($N=4$), which we will
denote by $A, B, C, D$; this would be like fitting a cubic. With $N=4$
we need at least five datapoints ($M=5$), so there must be at least
five equations of condition. The generalization of equation \ref{ones}
is the $M$ equations
\begin{equation} \label{eqcond}
A s_m + B t_m + C u_m + D v_m = y_m \; ,
\end{equation}
\noindent with $m = 0 \rightarrow (M-1)$. Again, the
least-squares-fitting process assumes that the $s_m, t_m, u_m, v_m$ are
known with zero uncertainty; all of the uncertainties are in the
measurements of $y_m$. We then form the four normal equations; the
generalization of equation \ref{normaltwo} written in matrix format is:
\begin{eqnarray} \label{smeqn}
\left[
\begin{array}{cccc}
{[ ss ]} & {[ st ]} & {[ su ]} & {[ sv ]} \\
{[ ts ]} & {[ tt ]} & {[ tu ]} & {[ tv ]} \\
{[ us ]} & {[ ut ]} & {[ uu ]} & {[ uv ]} \\
{[ vs ]} & {[ vt ]} & {[ vu ]} & {[ vv ]} \\
\end{array}
\; \right]
\left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array}
\; \right]
\; =
\left[
\begin{array}{c}
{[ s y ]} \\
{[ t y ]} \\
{[ u y ]} \\
{[ v y ]} \\
\end{array}
\; \right]
\end{eqnarray}
\noindent The $N \times N$ matrix on the left is symmetric. With $N$
equations and $N$ unknowns, you can actually {\it solve} for $A, B, C,
D$!
\section{FAR, FAR BEST AND EASIEST: MATRIX ALGEBRA} \label{matrixmethod}
The above equations are terribly cumbersome to solve in a
computer code because they require lots of loops. However, it becomes
trivial if we use matrices. Here we designate a {\bf matrix} by {\bf
boldface} type.
We illustrate the matrix method by carrying through the above
$N=4$ example, and we assume that there are 5 independent measurements
($M=5$). We first define the matrices
\begin{mathletters} \label{matrixdefinition}
\begin{eqnarray}
{\bf X} = \left[
\begin{array}{cccc}
{s_0} & {t_0} & {u_0} & {v_0} \\
{s_1} & {t_1} & {u_1} & {v_1} \\
{s_2} & {t_2} & {u_2} & {v_2} \\
{s_3} & {t_3} & {u_3} & {v_3} \\
{s_4} & {t_4} & {u_4} & {v_4} \\
\end{array}
\; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf a} = \left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array} \; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf Y} = \left[
\begin{array}{c}
{y_0} \\
{y_1} \\
{y_2} \\
{y_3} \\
{y_4} \\
\end{array}
\; \right]
\end{eqnarray}
\end{mathletters}
\noindent so, in matrix form, the equations of condition (equation
\ref{eqcond}) reduce to the single matrix equation
\begin{equation}
\label{equationofcondition}
{\bf X \cdot a} = {\bf Y} \; .
\end{equation}
\noindent The notation for these equations corresponds to NR's. We write
them with subscripts $\sigma$ to emphasize that they are calculated
without dividing by $\sigma_{meas}$, i.e.\ that we are doing least
squares instead of chi-square fitting. For chi-square fitting, see \S
\ref{chisqsection} and \ref{chicoeffs}.
Our matrix $\mathbf{X}$ corresponds to NR's ``design
matrix'' $\bf A$ of Figure 15.4.1, except that our elements are not
divided by $\sigma_{meas,m}$, and the matrix equation of condition
(equation \ref{equationofcondition}) is identical to the expression
inside the square brackets of NR's equation 15.4.6. The differences
arise because here we are discussing least-squares fitting instead of
chi-square fitting, i.e.\ we have omitted the factors involving
$\sigma_{meas,m}$, the intrinsic measurement uncertainties (\S
\ref{chisqsection}).
Again, there are more equations than unknowns so we can't solve
this matrix equation directly. So next we form the normal equations
from these matrices. In matrix form, the normal equations (equation
\ref{smeqn}) reduce to the single equation
\begin{equation} \label{matrixnormal}
\mathbf{[\alpha] \cdot a} = \mathbf{[\beta]} \; ,
\end{equation}
\noindent (NR equation 15.4.10), where
\begin{mathletters}
\label{ssdef}
\begin{equation}
[\alpha] = {\bf X^T \cdot X}
\end{equation}
\begin{equation}
[\beta] = {\bf X^T \cdot Y} \ .
\end{equation}
\end{mathletters}
\noindent The matrix $[\alpha]$ is known as the {\it curvature
matrix} because each element is twice the curvature of $\sigma^2$ (or
$\chi^2$) plotted against the corresponding product of variables.
The number of equations is equal to the number of unknowns, so
the solution of the matrix equation is easy---just rewrite it by
multiplying both sides by the inverse of $[\alpha]$ (that is, by
$[\alpha]^{-1})$, which gives
\begin{equation}
\label{adef}
{\bf a} = [\alpha]^{-1} \mathbf{ \cdot [\beta]} \; .
\end{equation}
\noindent All of these matrix operations are trivially easy in IDL (\S
\ref{numexample}).
\section{UNCERTAINTIES IN THE DERIVED COEFFICIENTS} \label{sigmas}
How about the uncertainties in the derived quantities contained
in the matrix ${\bf a}$?
The first thing to do is derive the {\it sample} variance $s^2$
(the square of standard deviation, or mean error, or dispersion, etc) of
the individual datapoints using the generalization of the usual
definition for a straight average of $x$, $s^2 = [ \sum_0^{M-1} (x_m -
\overline{ x_m})^2/(M-1)]$. The generalization is, simply, to replace
the $M-1$ in the denominator by $\nu = M-N$. In the straight-average
case, $N=1$ so this fits. Here $\nu$ is known as the number of {\it
degrees of freedom} and $N$, the number of unknown coefficients, is
known as the number of {\it constraints}. So we have
\begin{equation} \label{samplevarianceone}
s^2 = {1 \over M - N} \sum_{m=0}^{M-1} (y_m -
\overline{y_m})^2 \; ,
\end{equation}
\noindent where $\overline{y_m}$ are the values for $y_m$ {\it
predicted by the derived quantities ${\bf a}$}. Note the difference:
$y_m$ are the {\it observed} values, while $\overline{y_m}$ are the
values {\it predicted by the least-squares fit}. The predicted values
are those that are computed from the derived coefficients $A, B, C$\dots
The $M$ quantities
\begin{equation} \label{sequation}
\delta y_m = y_m - \overline{y_m}
\end{equation}
\noindent are called the {\it residuals} or {\it deviations} from the
fit.
It's worth reiterating some essentials about $s^2$, and in
particular the denominator $(M-N)$. First consider the case of a
single-parameter fit, e.g.\ $N=1$. Then we cannot possibly derive a
sample variance from only one measurement $M=1$; but we can from two
$M=2$. So the denominator makes sense from that standpoint. The same
goes for $N>1$.
Next consider the effect of using $(M-N)$ in the denominator: it
increases $s^2$ by the ratio ${M \over M-N}$ over what you'd get if you
just took a straight average and used $M$. This compensates for the fact
that you are subtracting $\overline {y_m}$, which is derived from
the data, instead of the {\it truly} correct value $y^*$. (In
formal statistical language, $y^*$ is the mean of the parent
population from which your set of measurements is drawn.) If you used
the truly correct value $y^*$, then the sum would be larger than
when using $\overline{ y_m}$. The use of $M-N$ in the denominator
compensates for this larger value in exactly the right way: the
expectation value $E_{(s^2)}$ for a large number of experiments is
precisely equal to the normal variance $\sigma^2$, which you'd get by
using [$y^*$ and $M$] instead of [$\overline{y_m}$ and
$(M-N)$] in equation \ref{sequation}; see Cowan equations 5.9 and 5.10.
So $s^2$ is, in fact, exactly the number we want: an unbiased estimate
of the true variance of our sample. Why not use [$y^*$ and $M$]
in equation \ref{sequation}? The reason is obvious: we don't know
$y^*$! (If we did, we wouldn't be doing this analysis!)
It's easy to calculate the $\overline{y_m}$ with matrices.
First define the matrix ${\bf \overline{ Y}}$ that contains these
values:
\begin{eqnarray}
{\bf \overline{ Y}} = \left[
\begin{array}{c}
\overline{y_0} \\
\overline{y_1} \\
\overline{y_2} \\
\overline{y_3} \\
\overline{y_4} \\
\end{array}
\; \right]
\end{eqnarray}
{\noindent} Calculating ${\bf \overline{ Y}}$ is simple:
\begin{equation} \label{ybardefinition}
{\bf \overline{ Y}} = {\bf X \cdot a} \; .
\end{equation}
\noindent Note that ${\bf X}$ is already defined (equation
\ref{matrixdefinition}) and ${\bf a}$ was solved for in equation
\ref{adef}. It's convenient to define the residual matrix
\begin{equation}
{\bf \delta Y } = {\bf Y - \overline{ Y}}
\end{equation}
\noindent so we can write
\begin{equation} \label{samplevariancetwo}
s^2 = {1 \over M - N} {\bf \delta Y^T \cdot \delta Y} \; .
\end{equation}
This is the sample variance of the datapoints, not the variances
in the derived coefficients. We can obtain these as before, by
generalizing the results from the two-parameter case like the
straight-line fit discussed in \S \ref{sectionone}. We won't go through
the derivation here; you can find it in Taylor \S 8.4 and equation 8.16,
8.17. The result is
\begin{equation} \label{coeffvarianceone}
{\bf s_a}^2 = s^2 diag \{[\alpha]^{-1} \} \ .
\end{equation}
\noindent Or, to put it simply in words: to get the variance of
coefficient $n$ in the matrix ${\bf a}$, multiply $s^2$ by the $n^{th}$
diagonal element of $[\alpha]^{-1}$.
Although the above equation for ${\bf s_a}^2$ is correct, there
is more to the story because of covariances, which are the off-diagonal
elements. We return to this topic in \S \ref{ncov} and \S \ref{chicoeffs}.
\section{A NUMERICAL EXAMPLE AND ITS SOLUTION IN IDL} \label{numexample}
If the following sounds like Greek to you, take a look at \S
\ref{matrixmethod} and \ref{sigmas}.
\subsection{Generation of the numerical example}
Suppose that we make four measurements of the angle $y$ and
we want to fit to a parabolic function in time $t$. In the notation of
equation~\ref{eqcond}, $s$ would be unity, $t$ the time, and $u$ the
time squared, so the number of unknowns is three ($N=3$). Because there
are four independent measurements ($M=4$) the subscripts run from $m = 0
\rightarrow 3$. Suppose that the four values of time are 5, 7, 9, 11.
\begin{figure}[h!]
\begin{center}
\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
%\includegraphics{twolvla.ps}
\end{center}
\caption{Our numerical example. Stars are the four datapoints; the solid
line is the fit. We perform two fits: one uses the original definition
of time; the other uses $(time-8)$, in effect moving the $y$-axis to the
dashed line. The two fits give the same line but the coefficients and
their errors differ greatly.\label{lsfitfig}}
\end{figure}
First we create the matrix {\bf X} in IDL
\begin{equation}
{\bf X = fltarr(N,M) = fltarr(3,4)}
\end{equation}
\noindent and then we populate it with numbers. In your own work, you
would normally do this by reading a data file and transferring the
numbers to the matrix using IDL commands; to work through this example,
you might manually type them in. After populating the matrix, in direct
correspondence with equation \ref{matrixdefinition}a we have $s_m = 1$,
$t_m = time_m$, $u_m = time_m^2$:
\begin{mathletters}
\begin{eqnarray}
{\bf X} = \left[
\begin{array}{ccc}
1 & 5 & 25 \\
1 & 7 & 49 \\
1 & 9 & 81 \\
1 & 11 & 121 \\
\end{array} \; \right] \ .
\end{eqnarray}
\end{mathletters}
\noindent Suppose that the four measured values of $y$ are
(equation~\ref{matrixdefinition}c)
\begin{mathletters}
\begin{eqnarray}
{\bf Y} = \left[
\begin{array}{c}
142 \\
168 \\
211 \\
251 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent Figure \ref{lsfitfig} shows the datapoints, together with the
fitted curve.
One word of caution here: in IDL, to get these into a column
matrix, which is how we've treated $\bf Y$ above, you have to define
${\bf Y}$ as a two-dimensional array because the second dimension
represents the column. When working in IDL it's more convenient to
define a row vector, which has only one dimension; in IDL you do this
by defining ${\bf Y} = [142,168,211,251]$; you can make it into the
necessary column vector by taking its transpose, i.e.\ ${\bf Y =
transpose(Y)}$.
\subsection{Solution of the Numerical Example in IDL} \label{idlsolution}
In IDL we calculate the normal equation matrices and denote the
$[\alpha]$ in equation \ref{ssdef}a by ${\bf XX}$:
\begin{mathletters} \label{eqn24}
\begin{equation}
{\bf XX} = {\bf transpose(X) \# \# X} \; ,
\end{equation}
\noindent and we denote the $[\beta]$ in equation \ref{ssdef}b by {\bf
XY}:
\begin{equation}
{\bf XY} = {\bf transpose(X) \# \# Y} \; .
\end{equation}
\end{mathletters}
\noindent In IDL we take the inverse of $[\alpha]$ (same as $\bf XX$) by
\begin{equation}
{\bf XXI} = {\bf invert(XX)} \; .
\end{equation}
The least-squares fitted quantities are in the matrix {\bf a}
(equation \ref{adef}), which we obtain in IDL with
\begin{equation}
{\bf a} = {\bf XXI \ \#\# \ XY} \; .
\end{equation}
In IDL we denote the matrix of predicted values $\overline{
y_m}$ by ${\bf YBAR}$, which is
\begin{equation}
\label{bteqn}
{\bf YBAR} = {\bf X \ \#\# \ a} \; ,
\end{equation}
\noindent and we can also define the residuals in $\bf Y$ as
\begin{equation}
{\bf DELY} = {\bf Y - YBAR} \ .
\end{equation}
\noindent In IDL we denote $s^2$ in equations \ref{samplevarianceone}
and \ref{samplevariancetwo} by $s\_sq$ and write
\begin{mathletters}
\label{sigsq}
\begin{equation}
s\_sq = {\bf transpose(DELY) \#\# DELY}/(M-N) \; ,
\end{equation}
\noindent or
\begin{equation}
s\_sq = total({ \bf DELY} \wedge 2)/(M-N) \; .
\end{equation}
\end{mathletters}
\noindent It is {\it always} a good idea to plot all three quantities
(the measured values {\bf Y}, the fitted values {\bf YBAR}, and the
residuals ${\bf DELY}$) to make sure your fit looks reasonable and to
check for bad datapoints.
To get the error in the derived coefficients we need a way to
select the diagonal elements of a matrix. Obviously, any $N \times N$
matrix has $N$ diagonal elements; a convenient way to get them is
\begin{equation}
diag \ elements \ of \ {\bf XXI}= {\bf XXI[(N+1) * indgen(N)]} \ .
\end{equation}
\noindent In IDL, we define the variances of
the $N$ derived coefficients by {\bf vardc} (think of ``{\bf var}iances of
{\bf d}erived {\bf c}oefficients''). You can get this as in equation
\ref{coeffvarianceone} from\footnote{If you used equation \ref{sigsq}a
instead of \ref{sigsq}b, then IDL considers $s\_ sq$ an array and you
need to use a $\#$ instead of a $*$ in this equation.}
\begin{equation} \label{lscoefferrorone}
{\bf vardc} = s\_sq * {\bf XXI[(N+1)*indgen(N)]} \; .
\end{equation}
\subsection {Discussion of the numerical example}
For this numerical example, the solution (the array of derived
coefficients) is
\begin{mathletters}
\label{avalues}
\begin{eqnarray}
{\bf a} = \left[
\begin{array}{c}
96.6250 \\
4.5000 \\
0.8750 \\
\end{array} \; \right]
\end{eqnarray}
\noindent and the errors in the derived coefficients [the square root of
the $\sigma^2$'s of the derived coefficients, i.e. $[\sigma_n^2]^{1/2}$
or, in IDL, $sqrt({\bf vardc})$ in equations \ref{lscoefferrorone}] are:
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
34.012 \\
9.000 \\
0.5590 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent These results look {\it horrible}: the uncertainties are large
fractions of the derived coefficients,
The reason: we have specifically chosen an example with terrible
covariance. And the great thing is this can be fixed easily (at least in
this case---certainly not always), without taking more data!
\section{THE COVARIANCE MATRIX AND ITS NORMALIZED COUNTERPART}
\label{ncov}
First we provide a general discussion, then we apply it to the
above numerical example.
\subsection{ Definition of the normalized covariance matrix}
The variances in the derived coefficients are obtained from the
diagonal elements of ${\bf XXI}$. The off-diagonal elements represent
the {\it covariances} between the derived coefficients. Covariance
means, specifically, the degree to which the {\it uncertainty} in {\it
one} derived coefficient affects the uncertainty in {\it another}
derived coefficient.
Because the covariance matrix elements relate pairwise to the
various coefficients, the units of the matrix elements are all
different. This makes it convenient to reduce all the matrix elements to
a standard set of units---namely, no units at all. So before discussing
the covariance matrix {\it per se}, we first discuss its normalized
counterpart.
The normalized covariance
matrix\footnote{It is a pleasure to thank Doug Finkbeiner for
introducing me to this concept.} {\bf ncov} is derived from ${\bf XXI}$
by dividing each element by the square root of the product of the
corresponding diagonal elements. Let ${\bf ncov}$ be the normalized
covariance matrix; then
\begin{equation}
ncov_{ik} = {XXI_{ik} \over \sqrt {XXI_{ii} \ XXI_{kk}}} \ .
\end{equation}
\noindent This is the same normalization that one does with the Pearson
linear correlation coefficient of two variables. In fact, the elements
of the normalized covariance matrix {\it are} the correlation
coefficients. In IDL, you do the following:
\begin{mathletters} \label{idlcovariance}
\begin{equation}
{\bf dc = XXI[(N+1)*indgen(N)]}
\end{equation}
\begin{equation}
{\bf ncov = XXI/{\it sqrt}(dc \# \#dc)} \ .
\end{equation}
\end{mathletters}
\noindent In the above, ${\bf dc \# \# dc}$ is an $N \times N$ matrix
consisting of products of the diagonals of ${\bf XXI}$, so dividing
${\bf XXI}$ by $sqrt({\bf dc \# \# dc})$ generates the normalized version.
Because {\bf ncov} is a {\it normalized} covariance matrix, you
might think that its non-normalized parent is {\bf XXI}---and you'd be
{\it almost} right. For the least-squares case we are discussing, the
true covariance matrix ${\bf C}$ is\footnote{For chi-square, you use
$\sigma_{meas}^2$ instead of $s^2$; see \S \ref{chisqsection}.}
\begin{equation}
{\bf C} = s^2 \ {\bf XXI} \ .
\end{equation}
In ${\bf ncov}$, the diagonal elements are all unity and the
off-diagonal elements reflect the interdependence of the derived
coefficients on each other. The off-diagonal elements can range from
$-1 \rightarrow 1$. Each matrix element is the correlation coefficient
between the {\it uncertainties} of its two parameters. In particular,
suppose that the data happen to produce a coefficient that differs from
its true value by some positive number. If the normalized matrix element
is negative, then the other coefficient will tend to differ from its
true value by a negative number.
Here's a more detailed discussion of what the covariance means.
Suppose you are least-squares fitting for two derived coefficients
($A_0$ and $A_1$). When you do a least-squares fit to a set of data,
you are fitting one set of data out of a possible infinity of possible
sets that you'd get by repeating the experiment, and your particular set
of data happens to produce specific values of $\overline{ A_0}$ and $\overline{
A_1}$, which differ from the {\it true} values $(A_0^*, A_1^*)$ by
amounts $\delta A_0$ and $\delta A_1$. If their covariance is zero,
then in the infinity of data sets you'd find that $\delta A_0$ is
uncorrelated with $\delta A_1$. But if it is nonzero, these two
quantities would be correlated.
A high covariance is bad because the derived variables depend on
each other. For one, this means that with noisy data power can be
shared or passed from one parameter to/from its covariant
counterpart(s). As we shall see in \S \ref{chicoeffs}, it also
significantly influences the uncertainties in derived coefficients.
Often a high covariance results from a poor choice of functions that you
are fitting or even a bad choice of the zero point of the independent
variable---as in our numerical example (see the next subsection). And,
as in that example, you can sometimes eliminate the bad covariance by
reformulating the problem---you don't even need to take more data! The
best reformulation involves using a set of orthonormal functions.
However, sometimes your interest is a specific set of functions that are
{\it not} orthogonal, and in such cases it makes no sense to convert to
orthogonal functions---because you just have to convert back again and
do the error propagation after-the-fact instead of letting the
least-squares process do it for you.
\subsection{ The covariance in our numerical example} \label{covarianceone}
Apply equation \ref{idlcovariance} to obtain the covariance
matrix for our numerical example:
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & -.989848 & .969717 \\
-.989848 & 1 & -.993808 \\
.969717 & -.993808 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
The off-diagonal elements are {\it huge}. This is the reason
why our derived coefficients have such large uncertainties. Note,
however, that the fitted predicted fit is a good fit even with these
large uncertaintis.
In this seemingly innocuous example we have an excellent case of
a poor choice of zero point for the independent variable (the time).
The reason is clear upon a bit of reflection. We are fitting for $y =
A_0 + A_1 t + A_2 t^2$. The coefficient $A_0$ is the $y$-intercept and
$A_1$ is the slope. Inspection of Figure \ref{lsfitfig} makes it very
clear that an error in the slope has a big effect on the $y$-intercept.
Now we retry the example, making the zero point of the time
equal to the mean of all the times, that is we set $(time_m = time_m -
8)$. We get the same fitted line, but the derived coefficients are
completely different---and amazingly better! We get
\begin{mathletters}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
188.625 \\
18.500 \\
0.87500 \\
\end{array} \; \right]
\end{eqnarray}
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
3.58 \\
1.00 \\
0.559 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent In redefining the origin of the independent variable, we have
made the zero intercept completely independent of the slope: changing
the slope has no affect at all on the intercept. You can see this from
the normalized covariance matrix, which has become
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & 0 & -0.78086881 \\
0 & 1 & 0 \\
-0.78086881 & 0 & 1 \\
\end{array} \; \right] \; ,
\end{eqnarray}
\noindent which is nice, but not perfect: Our step is {\it partial}
because the second-order coefficient $A_2$ affects $A_0$ because, over
the range of $[(time - 8) = -3 \rightarrow +3]$, the quantity $[A_2 \;
\Sigma(time_m -8)^2]$ is always positive and is thereby correlated with
$A_0$.
We could complete the process of orthogonalization by following
the prescription in BR chapter 7.3, which discusses the general
technique of orthogonalizing the functions in least-squares fitting. The
general case is a royal pain, analytically speaking, so much so that we
won't even carry it through for our example. But for numerical work you
accomplish the orthogonalization using Singular Value Decomposition
(SVD), which is of course trivial in IDL (\S \ref{SVD}).
For some particular cases, standard pre-defined functions are
orthogonal. For example, if $t_m$ is a set of uniformly spaced points
between $(-1 \rightarrow 1)$ and you are fitting a polynomial, then the
appropriate orthogonal set is Legendre polynomials. This is good if your
only goal is to represent a bunch of points by a polynomial function,
because the coefficients of low-order polynomials are independent of the
higher ones. However, it's more work and, moreover, often you are
interested in the coefficients for specific functions that don't happen
to be orthogonal; in such cases, you should just forge ahead.
But {\it always} look at the normalized covariance matrix.
Suppose one pair of off-diagonal elements departs significantly from
zero. Then their corresponding functions are far from being orthogonal
and the variances of the derived coefficients will suffer as a result.
You might be able to eliminate one of the parameters to make the fit
more robust. For example, suppose one function is $t \cos (t)$ and the
other is $\sin(t) \cos (t)$. If the range of $t$ is small, these two
functions are indistinguishable and have a large covariance; you should
eliminate one from the fit. If the range of $t$ is large, there is no
problem.
For further discussion of covariance, see \S \ref {chicoeffs}.
Also, you might also want to try out another example in Taylor's \S 8.5.
\section{REJECTING BAD DATAPOINTS I.: CHAUVENET'S CRITERION}
\label{chauvenetsection}
Least-squares fitting is derived from the maximum likelihood
argument assuming the datapoint residuals $\delta y_m$ have a Gaussian
pdf. This means that the errors are distributed as
\begin{equation}
p_{(\delta y; \sigma)} = {1 \over \sqrt{2\pi} \sigma}
e^{-\left( \delta y^2 \over 2 \sigma^2 \right)} \ ,
\end{equation}
\noindent where $\sigma^2$ is the true variance of the datapoints, i.e.\
$s^2$ in equation \ref{samplevarianceone} (to be precise, $s^2$ needs to
be averaged over many experiments).
More importantly, the probability of finding datapoints inside
the limits $\pm \Delta y$ is
\begin{equation}
P_{(|\delta y| < \Delta y)} = \int_{-\Delta y}^{+\Delta y}
p_{(\delta y; \sigma)} d(\delta y) =
{\rm erf}\left({\Delta y \over \sqrt{2} \sigma} \right) \ ,
\end{equation}
\noindent where we use the commonly-defined error function ${\rm
erf}(X) = {1 \over \sqrt{\pi}} \int_{-X}^{+X} e^{-x^2} dx$. A
particularly important value is for $\Delta y = \sigma$, for which
\begin{equation}
P_{(|\delta y| < \sigma)} = 0.683 \ .
\end{equation}
If we have an experiment with $M$ datapoints, then the number of
datapoints we expect to lie outside the interval $\pm \Delta y$ is
\begin{equation}
M_{(outside \ \Delta y)} = M \left[ 1 -
{\rm erf} \left({\Delta y \over \sqrt{2} \sigma} \right) \right] \ .
\end{equation}
\noindent Chauvenet's criterion simply says: \begin{enumerate}
\item Find $\Delta y$ such that $M_{(outside \ \Delta y)} =
0.5$. This is given by
\begin{equation} \label{chauvenetexplicit}
{\Delta y \over \sigma} = {\rm erf}^{-1}\left( {1 - {1 \over 2M}} \right) \ .
\end{equation}
\noindent This criterion leads to the numbers in the associated table,
which is a moderately interesting set of numbers. Many
astronomers adopt $3 \sigma$, which is clearly inappropriate for large
$N$!
\begin{table} [!h]
\begin{center}
%\hline
%\caption{ Chauvenet's criterion versus $M$} \label{chauvenettable}
\begin{tabular}{cc}
%\hline
\\ \hline \hline
\multicolumn{2}{c}{ Chauvenet's criterion versus $M$}
\\ \hline
$M$ & $\Delta y \over \sigma$ \\ \hline
100 & 2.81 \\
1000 & 3.48 \\
$10^4$ & 4.06 \\
$10^5$ & 4.56 \\ \hline
\hline
\end{tabular}
\end{center}
\end{table}
\item Discard all datapoints outside this range.
\end{enumerate}
We offer the following important {\it Comments}:
\begin{itemize}
\item This assumes data are Gaussian-distributed. In real life
this doesn't often happen because of ``glitches''. Examples of glitches
can be interference in radio astronomy, meteors in optical astronomy,
and cosmic rays on CCD chips. These glitches produce bad points that
depart from Gaussian statistics. They are often called {\it outliers}.
It is very important to get rid of the outliers because the
least-squares process minimizes the {\it squares} of the residuals.
Outliers, being the points with the largest residuals, have a
disproportionately evil effect on the result.
On the other hand, if your data don't follow Gaussian statistics
as their {\it intrinsic} pdf, then you should think twice before using
least squares! (Like, maybe you should try the median fitting discussed in
\S \ref{medianfitting}.)
\item You may wish to relax Chauvenet's criterion by {\it
increasing} the $\Delta x$ beyond which you discard points. This is
being conservative and, in the presence of some non-Gaussian statistics,
not a bad idea. But think about why you are doing this before you do it.
Maybe the intrinsic statistics aren't Gaussian?
You should {\it never} make Chauvenet's criterion more stringent
by {\it decreasing} the $\Delta x$ beyond which you discard points. This
rule hardly needs elaboration: it means you are discarding datapoints
that follow the assumed pdf!
\item Most statistics books (e.g. Taylor, BR) harp on the purity
aspect. One extreme: don't throw out any datum without examining it from all
aspects to see if discarding it is justified. The other extreme: apply
Chauvenet's criterion, but do it {\it only once} and certainly not
repeatedly.
Being real-life astronomers, our approach is different. There
{\it do} exist outliers. They increase the calculated value of $\sigma$.
When you discard them, you are left with a more nearly perfect
approximation to Gaussian statistics and the new $\sigma$ calculated
therefrom will be smaller than when including the outliers. Because the
original $\sigma$ was too large, there may be points that should have
been discarded that weren't. So our approach is: repeatedly apply
Chauvenet's criterion until it converges.
If it doesn't converge, or if it discards an inordinately large
number of datapoints, you've got real problems and need to look at the
situation from a global perspective.
\item Many observers use the $3\sigma$ criterion: discard any
points with residuals exceeding $3\sigma$. This is definitely {\it not}
a good idea: the limit $3\sigma$ is Chauvenet's criterion for $M=185$
datapoints. Very often $M$ exceeds this, often by a lot.
\item To apply Chauvenet's criterion it's most convenient to
calculate the inverse error function. For this, you have two choices.
One (for sissies like myself), you can use {\bf inverf.pro} from my area
$\sim$heiles/idl/gen . But the real he-man will want to learn about
using a root-finding algorithm such as Newton's method (NR \S 9.4 and
9.6) together with the error function; both procedures exist in IDL as
{\bf newton} and {\bf errorf}. You at least ought to skim lightly some
of NR's chapter 9 about root finding, because some day you'll need it.
\end{itemize}
\section{NONLINEAR LEAST SQUARES} \label{nonlinearls}
The least-squares formulation requires that the data values
depend {\it linearly} on the unknown coefficients. For example, in
equation \ref{one}, the unknown coefficients $A$ and $B$ enter linearly.
Suppose you have a nonlinear dependence, such as wanting to
solve for $A$ and $B$ with equations of condition that look like
\begin{equation} \label{basicone}
\sin(A t_m) + B t_m = y_m \; .
\end{equation}
\noindent What do you do here? You linearize the process, using the
following procedure.
First, assume trial values for $A$ and $B$; call these $A_0$
and $B_0$. You should pick values that are close to the correct ones. In
our example you wouldn't need to do this for $B$, but it's easier to
treat all coefficients identically. These trial values produce {\it
predicted} values $y_{0,m}$:
\begin{equation} \label{trialone}
\sin(A_0 t_m) + B_0 t_m = y_{0,m} \; .
\end{equation}
\noindent Subtract equation \ref{trialone} from \ref{basicone}, and
express the differences as derivatives. Letting $\delta A = A - A_0$ and
$\delta B = B - B_0$, this gives
\begin{equation} \label{difference}
\delta A [t_m \cos(A_0 t_m)] + \delta B t_m = y_m - y_{0,m} \; .
\end{equation}
\noindent This is linear in $(\delta A, \delta B)$ so you can solve for
them using standard least squares. Increment the original guessed values
to calculate $A_{0,new}=A_0 + \delta A$ and $B_{0,new}=B_0 + \delta B$,
These won't be exact because higher derivatives (including cross
derivatives) come into play, so you need to use these new values to
repeat the process. This is an iterative procedure and you keep going
until the changes become ``small''. The generalization to an arbitrarily
large number of unknown coefficients is obvious.
We now offer some cautionary and practical remarks.
{\bf (0)} In linear least squares, the curvature and covariance
matrices are set by the values of the independent variable, which here
is denoted by $t$, and are independent of the datapoint values. Here,
the matrix elements change from one iteration to the next because they
depend on the guessed parameters, and sometimes they even depend on the
datapoint values.
{\bf (1) Multiple minima:} Nonlinear problems often have
multiple minima in $\sigma^2$. A classical case is fitting multiple
Gaussians to a spectral line profile. Gaussians are most definitely not
orthogonal functions and in some cases several solutions may give almost
comparably good values of $\sigma^2$, each one being a local minimum.
For example, for the case of two blended Gaussians, one can often fit
two narrow Gaussians or the combination of a wide and narrow Gaussian,
the two fits giving almost equal $\sigma^2$. The lower of these is the
real minimum but, given the existence of systematic errors and such, not
necessarily the best solution. The best solution is often determined by
physical considerations; in this case, for example, you might have
physical reasons to fit a broad plus narrow Gaussian, so you'd choose
this one even if its $\sigma^2$ weren't the true minimum.
{\bf (2) The Initial Guess:} When there are multiple minima,
the one to which the solution converges is influenced by your initial
guess. To fully understand the range of possible solutions, you should
try different initial guesses and see what happens. If the solutions
always converge to the same answer, then you can have some confidence
(but not {\it full} confidence) that the solution is unique.
{\bf (3) Iterative stability:} If your initial guess is too far
from the true solution, then the existence of higher derivatives means
that the computed corrections can be too large and drive the iterative
solution into instability. It is often a good idea to multiply the
derived correction factors ($\delta A$ and $\delta B$ above) by a factor
${\cal F}$ less than unity, for example ${\cal F} = 0.5$ or 0.75. This
increases the number of iterations required for convergence but often
allows convergence instead of producing instability.
{\bf (4) Convergence criteria:} How do you know when the
solution has converged? One way: for each iteration, calculate the
uncertainties in the derived coefficients. If the uncertainty exceeds
the correction, then you are getting close. An alternate way, which I
usually use: if the fractional correction (e.g.~$\delta A \over A_0$)
decreases below some threshold, say $1\%$, you're close (some
parameters, such as angles, need a threshold that is absolute instead of
fractional). At this point, if you are using ${\cal F} \neq 1$, set
${\cal F} = 1$, do a few more iterations, and you're done.
{\bf (5) Numerical derivatives:} Sometimes the equations of
condition are so complicated that taking the derivatives, as in
obtaining equation \ref{difference}, is a huge job and subject to
mistakes. So you can take numerical derivatives instead of analytic
ones. Be careful, though; it's safest to use double precision and think
a bit about numerical accuracy; take a look at NR's section 5.7 on
evaluating numerical derivatives.
{\bf (6) Canned nonlinear least squares (particularly
Levenberg-Marquardt, and various Gaussian fit routines):} Packages like
IDL offer canned nonlinear least squares routines. They are designed to
work well for a wide range of different problems. However, for the
specific problem at hand you can often do better by tailoring things
(such as the factor ${\cal F}$ and convergence criteria above). A good
example is Gaussian fitting: IDL's fitting program doesn't converge for
multiple overlapping Gaussians, while for many of these cases the
program that I wrote myself works fine; and converseley, my program
doesn't work well for single Gaussians with a small number of
datapoints, in which case IDL's \verb$GAUSSFIT$ is much better..
When convergence is slow or doesn't occur because your functions
are complicated, you might wish to try the Levenberg-Marquardt method
(NR \S 15.5); IDL function {\bf LMFIT}. This technique involves
increasing the diagonal elements of the curvature matrix by a set of
suitably chosen factors; when you get close to the minimum, it resets
these factors to unity. LM is the gold standard for nonlinear
least-squares fitting because it is supposed to converge faster than
other methods. Because of its sterling reputation, many people think
it's the panacea. How many times have I seen journal articles saying
that the LM method was used---as if that's all one needs to know---but
without saying anything about the important stuff, such as how parameter
space was explored to determine uniqueness of the solution! See the
discussion in NR. I've done lots of nonlinear fits and have never had
to resort to any tactic other than the simple, straightforward
linearization process discussed above.
{\bf (7) Be careful and LOOK at the solution before accepting
it!} These nonlinear problems can produce surprising results, sometimes
completely meaningless results. Don't rely on them to be automatic or
foolproof!
{\bf (8) Reformulate! (?)} Sometimes you can avoid all this by
reformulating the problem. There are two cases: the harmless case and
the not-so-harmless case.
An example of the harmless case is fitting for the phase $\phi$
in the function $y = \cos( \theta + \phi)$. This is definitely a
nonlinear fit! But its easy to reformulate it in a linear fit using the
usual trig identities to write $y = A \cos \theta - B \sin \theta$,
where ${B \over A} = \tan \phi$. Solve for $(A,B)$ using linear least
squares, calculate $\phi$, and propagate the uncertainties.
An example of the not-so-harmless case is in NR's \S 15.4
example: fit for $(A,B)$ with equations of condition $y_m = A
e^{-Bx_m}$. They suggest linearizing by rewriting as $\log (y_m) = C -
Bx_m$, solving for $(B,C)$, and deriving $A$ after-the-fact. This is
not-so-harmless because you are applying a nonlinear function to the
{\it observed} values $y_m$; thus the associated errors
$\sigma_{meas,m}$ are {\it also} affected. This means you have to do
weighted fitting, which is discussed in \S \ref{chisqsection} below.
Suppose that $A=1$, your datapoints all have $\sigma_{meas,m} = 0.05$,
and the observed $y_m$ ranges from 0.05 to 1. The datapoint with $y_m =
0.05$ has a manageable $\sigma_{meas_m}$, but what is the corresponding
value of $\sigma_{meas,m}$ for $\log y_m = \log 0.05$? It's ill-defined
and asymmetric about the central value. Or even, God forbid, you have
an observed $y_m$ that's {\it negative}??? Even for $y_m$ not near zero,
you need to calculate new $\sigma_{meas,m}$ by error propagation; in
this case, you need to reassign $\sigma(\log y) = {d \log y \over dy}
\sigma(y) = {\sigma(y) \over y}$. This is OK when $y_m$ is large enough
so that the linear approximation is accurate, but if not the converted
noise becomes non-Gaussian.
You should regard your datapoints as sacrosanct and never apply
any nonlinear function to them.
\section{CHI-SQUARE FITTING, AND WEIGHTED FITTING: DISCUSSION IGNORING
COVARIANCE } \label{chisqsection}
In least-squares fitting, the derived parameters minimize the sum of
squares of residuals as in equation \ref{samplevarianceone}, which we
repeat here:
$$ s^2 = {1 \over M - N} \sum_{m=0}^{M-1} \delta y_m^2 \, . $$
\noindent where the $m^{th}$ residual $\delta y_m = (y_m -
\overline{y_m})$. Chi-square fitting is similar except for two
differences. One, we divide each residual by its intrinsic measurement
error $\sigma_{m}$; and two, we define $\chi^2$ as the sum
\begin{mathletters} \label{chisqdefinition}
\begin{equation}
\chi^2 = \sum_{m=0}^{M-1} {\delta y_m^2
\over \sigma_{m}^2 } \; .
\end{equation}
\noindent Along with $\chi^2$ goes the {\it reduced} chi square
$\widehat{\chi^2} = {\chi^2 \over M-N}$
\begin{equation}
\widehat{\chi^2} = {1 \over M - N} \sum_{m=0}^{M-1} {\delta y_m^2
\over \sigma_{m}^2 } \; ,
\end{equation}
\end{mathletters}
\noindent which is more directly analogous to the definition of $s^2$.
Chi-square fitting is very much like our least-squares fitting
except that we divide each datapoint by its intrinsic measurement
uncertainty $\sigma_{m}$. Thus, the reduced chi-square
($\widehat{\chi^2}$) is equal to the ratio of the {\it variance of the
datapoint residuals} ($s^2$) to the {\it adopted intrinsic
measurement variances} ($\sigma_{m}^2$). So it should be obvious
that in chi-square fitting, you must know the measurement uncertainties
$\sigma_{m}$ of the individual datapoints beforehand. If you want
to give the various datapoints weights based on something other than
$\sigma_{m}$, then that is just like chi-square fitting except that
you can adopt an arbitrary scale factor for the uncertainties
(section~\ref{diatribe}).
Chi-square fitting treats uncertainties of the derived
parameters in a surprising way. Getting the
coefficient uncertainties with chi-square fitting is a tricky business
because \begin{enumerate}
\item With the standard treatments, the errors in the derived parameters
don't depend on the residuals of the datapoints from the fit (!).
\item The errors in the derived parameters can depend on their mutual
covariances. This discussion requires a separate section, which we
provide below in \S \ref{chicoeffs}.
\end{enumerate}
\noindent In this section we treat chi-square fitting ignoring
covariance. We begin by illustrating the difference between least
squares and chi-square fitting by discussing the simplest chi-square
fitting case of a weighted mean; then we generalize to the multivariate
chi-square fitting case.
\subsection{ The weighted mean: the simplest chi-square fit}
\label{chicoeffsnoco}
First, recall the formulas for an ordinary {\it unweighted}
average in which the value of each point is $y_m$ and the residual of
each point from the weighted mean is $\delta y_m$:
\begin{mathletters}
\begin{equation}
mean = {\sum y_m \over M}
\end{equation}
\begin{equation}
%\sigma^2 = {\sum \delta y_m^2 \over M-1}
s^2 = {\sum \delta y_m^2 \over M-1}
\end{equation}
\begin{equation} \label{meanvariance}
%\sigma_{mean}^2 = {\sigma^2 \over M} = {\sum \delta y_m^2 \over M(M-1)} \ ,
s_{mean}^2 = {s^2 \over M} = {\sum \delta y_m^2 \over M(M-1)} \ ,
\end{equation}
\end{mathletters}
\noindent where $s_{mean}^2$ is the variance of the mean and $s^2$
is the variance of the datapoints around the mean. Recall that in this
case the mean is the least-squares fit to the data, so to use least squares
jargon we can also describe $s_{mean}$ as the error in the derived
coefficient for this single-parameter least-squares fit.
Now for a {\it weighted} average in which the weight of each
point is $w_{meas,m} = {1 \over \sigma_{m}^2}$. Applying maximum
likelihood, in an unweighted average the quantity that is minimized is
$\sum \delta y_m^2$; in a weighted average the quantity minimized is
$\chi^2 = \sum {\delta y_m^2 \over \sigma_{m}^2} = \sum w_{meas,m}
\delta y_m^2 \rightarrow w_{meas,m} \sum \delta y_m^2$, where to the
right of the arrow we assume all $w_{meas,m}$ are identical. So your
intuition says that the three equations corresponding to the above would
become
\begin{mathletters} \label{wgtideal}
\begin{equation} \label{wmean}
mean_{w, intuit} = {\sum w_{meas,m}y_m \over \sum w_{meas,m}} \rightarrow { \sum y_m \over M}
\end{equation}
\noindent Again, to the right of the arrow we assume all $w_{meas,m}$
are identical and the subscript {\it intuit} means ``intuitive''. For
the variances the intuitive expressions are
\begin{equation} \label{wvariance}
%\sigma_{w,intuit}^2 =
s_{w,intuit}^2 =
{M \over M-1} {\sum w_{meas,m} \delta y_m^2 \over \sum w_{meas,m}}
= {\widehat{\chi^2} \over (\sum w_{meas,m}/M)}
\rightarrow {\sum \delta y_m^2 \over M-1}
\end{equation}
\begin{equation} \label{wmeanvariancecc}
s_{mean,intuit}^2 = {s_{w,intuit}^2 \over M} =
{\sum w_{meas,m} \delta y_m^2 \over (M-1) \sum w_{meas,m}}
= {\widehat{\chi^2} \over \sum w_{meas,m}}
\rightarrow {\sum {\delta y_m^2} \over M(M-1) } \ .
\end{equation}
\end{mathletters}
\noindent In fact, after a formal derivation, the first two equations
(\ref {wmean} and \ref{wvariance}) are correct, so we will drop the
additional subscipts {\it intuit} and {\it formal} on $mean$ and
$s_w^2$. However, after a formal derivation, the last of these
equations becomes, and is always written (e.g.\ BR equation 4.19; Taylor
equation 7.12)
\begin{equation} \label{wmeanvariancebb}
s_{mean,formal}^2 = {1 \over \sum w_{meas,m}} \rightarrow { \sigma_{meas}^2 \over M}
\ .
\end{equation}
\noindent {\it This is a problem}, for the following reason.
Note the excruciatingly painful difference between the intuitive
equation \ref{wgtideal}c and the formally correct equation
\ref{wmeanvariancebb}: the former depends on the {\it variance of the
datapoint residuals} $s_w^2$, as you'd think it should, while the latter
depends on only the {\it adopted intrinsic measurement variances of the
data} $\sigma_{m}^2$, which are chosen by the guy doing the fit. If you
do an unweighted average, and derive a certain variance, and next do a
weighted average in which you choose some values for $\sigma_{m}$ that
happen to be wrong, the two fits give different results for
$s_{mean}^2$. This is crazy.
To get around this difficulty, we follow the procedure in BR
equations 4.20 to 4.26. This introduces an arbitrary multiplicative
factor for the weights and goes through the ML calculation to derive,
instead of equation \ref{wmeanvariancebb}, the {\it far superior}
\begin{equation} \label{wmeanvarianceaa}
s_{mean,BR}^2 = {s_w^2 \over M} = {\widehat{\chi^2} \over \sum w_{meas,m}} \ ,
\end{equation}
\noindent which is precisely the same as our intuitive guess, equation
\ref{wmeanvariancecc}. The difference between the formal equation
\ref{wmeanvarianceaa} and the intuitive equations \ref{wvariance} and
\ref{wmeanvariancebb} is the numerator, which contains the reduced
chi-square $\widehat{\chi^2}$; for the case where all $\sigma_{meas,m}$
are identical, $\widehat{\chi^2} = {s_w^2 \over \sigma_{meas}^2}$.
Note that $\chi^2$ and $\widehat{\chi^2}$ are defined in equations
\ref{chisqdefinition}.
\subsection{The multivariate chi-square fit}
\label{sectionchisq}
Here we generalize \S \ref{chicoeffsnoco}, which dealt with the
weighted average, to the multivariate case. In this case, chi-square
fitting is just like least-squares fitting except for the following:
\begin{enumerate}
\item In the least-squares matrix $\bf X$ of equation
\ref{matrixdefinition}a, each row $m$ is a different measurement with a
different intrinsic variance $\sigma_{m}$. For chi-square fitting you
generate a new matrix $\mathbf {X_\chi}$, which is identical to ${\bf
X}$ except that each row $m$ (which contains a particular equation of
condition) is divided by $\sigma_{m}$. This new matrix is the same as
NR's {\it design matrix} (Figure 15.4.1), which they denote by $\bf A$.
\item For chi-square fitting, divide each datapoint $y_m$ in
equation \ref{matrixdefinition}b by $\sigma_{m}$. You are generating a
new data vector $\mathbf {Y_\chi}$, which is identical to ${\bf Y}$
except that each datapoint is divided by $\sigma_{m}$. This new data
vector is the same as NR's vector {\bf b}.
\item {\it Note} that the above two steps can be accomplished
matrixwise by defining the $M \times M$ diagonal matrix $[\sigma]$ in
which the diagonal elements are $\sigma_m$.
\begin{eqnarray} \label{diagsigma}
[\sigma] =
\left[
\begin{array}{cccc}
\sigma_0 & 0 & {\dots} & 0 \\
0 & \sigma_1 & {\dots} & 0 \\
\vdots & \vdots & \ddots & 0 \\
0 & 0 & 0 & \sigma_{M-1}
\end{array}
\right]
\end{eqnarray}
\noindent in which case we can write
\begin{mathletters}
\begin{equation}
{\bf X_\chi} = [\sigma]^{-1} {\bf \cdot X}
\end{equation}
\begin{equation}
{\bf Y_\chi} = [\sigma]^{-1} {\bf \cdot Y} \ .
\end{equation}
\end{mathletters}
\item Carry through the matrix calculations in equations
\ref{matrixcalcs} below (using the matrices subscripted with
$\chi$).
\end{enumerate}
\noindent You've divided each row, i.e.\ the equation of condition for
each row $m$, by a common factor, so the solution of that particular
equation of condition is unchanged. However, in the grand scheme of
things---i.e.\ the normal equations---it receives a greater or lesser
weight by a factor ${1 \over \sigma_m^2}$.
To perform the chi-square fit, we use the following equations:
\begin{mathletters} \label{matrixcalcs}
\begin{equation}
{\bf X_\chi} = [\sigma]^{-1} {\bf \cdot X}
\end{equation}
\begin{equation}
{\bf Y_\chi} = [\sigma]^{-1} {\bf \cdot Y} \ .
\end{equation}
\begin{equation}
{\bf X_\chi \cdot a} = {\bf Y_\chi}
\end{equation}
\begin{equation} \label{curvmatrixalpha}
[\alpha] = {\bf X_\chi^T \cdot X_\chi}
\end{equation}
\begin{equation}
[\beta] = {\bf X_\chi^T \cdot Y_\chi}
\end{equation}
\begin{equation} \label{a_equation}
{\bf a} = [\alpha]^{-1} \cdot [\beta] \ .
\end{equation}
\noindent Having calculated the derived coefficients $\bf a$, we can
calculate the residuals. In doing so we must recall that $\bf X_\chi$
and $\bf Y_\chi$ contain factors of $1 \over \sigma_m$ and
$[\alpha]^{-1}$ contains factors of $\sigma_m^2$. With all this, we can
write the chi-square fit predicted data values as
\begin{equation}
\overline {\bf Y_\chi} = {\bf X_\chi \cdot a}
\end{equation}
\noindent and the chi-square residuals as
\begin{equation}
\delta {\bf Y_\chi} = {\bf Y_\chi} - \overline {\bf Y_\chi}
\end{equation}
\noindent Because the data vector $\bf Y_\chi$ contains factors of $1
\over \sigma_m$, so do the residuals $\delta {\bf Y_\chi}$. You should,
of course, always look at the residuals from the fit, so {\it remember
these scale factors affect the residual values!} For example, if all
$\sigma_m$ are identical and equal to $\sigma$, then ${\bf Y_\chi} =
{\bf Y \over \sigma}$. If they don't, then when you plot the residuals
$\delta {\bf Y_\chi}$ {\it each one will have a different scale factor!}
Moving on, we have
\begin{equation}
\label{chisq}
\chi^2 = {\bf \delta Y_\chi^T \cdot \delta Y_\chi}
\end{equation}
\begin{equation}
\label{reducedchisq}
\widehat{\chi^2} = { {\bf \delta Y_\chi^T \cdot \delta Y_\chi} \over M-N} \ .
\end{equation}
\end{mathletters}
\noindent Finally, we have the analogy of equation \ref{wmeanvarianceaa}
expressed in matrix form as in equation \ref{coeffvarianceone}:
\begin{equation} \label{coeffvariancegood}
{\bf s_{a,intuit}}^2 = \widehat{\chi^2} \ diag \{[\alpha]^{-1} \} \ .
\end{equation}
\noindent This {\it intuitively-derived} result is in contrast to the
result derived from a {\it formal derivation}, which is the analogy to
equation \ref{wmeanvariancebb}; again, it omits the $\widehat{\chi^2}$
factor:
\begin{equation} \label{coeffvarianceawful}
{\bf s_{a,formal}}^2 = diag \{[\alpha]^{-1} \} \ .
\end{equation}
\noindent This formally-derived result is what's quoted in textbooks
(e.g.\ NR equation 15.4.15, BR equation 7.25). It provides parameter
errors that are independent of the datapoint residuals, and leads to the
same difficulties discussed above for the weighted mean case.
\subsection{Which equation---\ref{coeffvariancegood} or
\ref{coeffvarianceawful}?}
In most cases---but not all---we recommend that you use equation
\ref{coeffvariancegood}. Equation \ref{coeffvariancegood} is very
reasonable. Suppose, for example, that the least-squares fit model is
perfect and the only deviations from the fitted curve result from
measurement error. Then by necessity we have $s^2 \approx
\sigma_{meas}^2$ and $\widehat{\chi^2} \approx 1$. (We write
``$\approx$'' instead of ``='' because different experiments produce
somewhat different values of $s^2$ because of statistical fluctuations;
an average gives $\sigma^2 = \langle s^2 \rangle$.) In this situation,
though, equations \ref{coeffvariancegood} and \ref{coeffvarianceawful}
are identical. However, if the least-squares fit model is {\it not
correct}, meaning that it doesn't apply to the data, then the residuals
will be larger than the intrinsic measurement errors, which will lead to
larger values of $\chi^2$ and $\widehat{\chi^2}$.
However, equation \ref{coeffvariancegood} is not a panacea. The
numerical value of $\widehat{\chi^2}$ is subject to statistical
variation. If the number of datapoints $M$ is small (or, more properly,
if the number of degrees of freedom $(M-N)$ is small), then the
fractional statistical variation in $\widehat{\chi^2}$ is large and this
affects the normalization inherent in equation \ref{coeffvariancegood}.
Alternatively, if you {\it really do know} the experimental errors
equation \ref{coeffvarianceawful} is appropriate.
Use your head!
\subsection{Datapoints with known {\it relative} but unknown {\it
absolute} dispersions} \label{weightedcase}
Here the $\sigma_{m}$ are all different. The $m^{th}$ row
of the equation-of-condition matrix ${\bf X}$ and the $m^{th}$
element of the data vector ${\bf Y}$ get divided by their
corresponding $\sigma_{m}$. The equation embodied in each row of
the matrix equation~\ref{equationofcondition} remains unchanged, but the
different rows are weighted differently with respect to each other.
Consider two measurements with intrinsic measurement
uncertainties $(\sigma_{1}, \sigma_{2})$; suppose
$\sigma_{1} < \sigma_{2}$. After being divided by their
respective $\sigma_{m}$'s, all of the numbers in row 1 are larger
than those in row 2. In all subsequent matrix operations, these larger
numbers contribute more to all of the matrix-element products and sums.
Thus, the measurement with smaller uncertainty has more influence on the
final result, as it should.
Suppose that the above two measurements were taken under
identical conditions except that measurement 1 received more integration
time than measurement 2; we have ${\sigma_{1} \over \sigma_{2}} =
\left({\tau_1 \over \tau_2}\right)^{-1/2}$, so the rows of $\mathbf
{X_\chi}$ are weighted as $\tau^{1/2}$. This means that during the
computation of $[\alpha] = \mathbf {X_\chi^T \cdot X_\chi}$, the
self-products of row 1 are weighted as $\tau_1$. This means that each
datapoint is weighted as $\tau$, which is exactly what you'd expect!
Note that this is also exactly the same weighting scheme used in a
weighted average, in which the weights are proportional to $\left(1
\over \sigma_{m}\right)^2$. We conclude that the weighting scheme of
the first two steps in section~\ref{sectionchisq} agrees with common
sense.
Suppose you don't know the intrinsic measurement dispersion
$\sigma_{m}$, but you {\it do} know the {\it relative} dispersion of
the various measurements. For example, this would be the case if the
datapoints were taken under identical conditions except for integration
time; then $\sigma_{m} \propto \tau^{-1/2}$. In this case, multiply
each row by its weight $w \propto {1 \over \sigma_{m}}$ and proceed
as above. (The factors ${1 \over \sigma_{m}}$ in the equations of
condition become ${1 \over \sigma_{m}^2}$ in the normal equations.)
\subsection{ Persnickety Diatribe on Choosing $\sigma_{m}$ }
\label{diatribe}
\subsubsection{Choosing and correcting $\sigma_{m}$}
\label{diatrabeone}
In the previous section, equation \ref{coeffvarianceawful}
taught us that---formally, at least---the variances in the derived fit
parameters (or their uncertainties, which are the square roots) depend
only on the adopted uncertainties $\sigma_{m}$ and not on the {\it
actual variance} of the {\it datapoints}.
Are you bothered by the fact that the variances of the derived
parameters ${\bf s_a}$ are independent of the data residuals? You should be:
it is obvious that the residuals should affect ${\bf s_a}$.
Formally, ${\bf s_a}$ depends only on the {\it adopted}
uncertainties $\sigma_{m}$, which are chosen beforehand by
you---you're supposed be such a good experimentalist that you really do
know the intrinsic uncertainty in your measured values. Moreover, you
are assuming that there are no other sources of uncertainty---such as
``cosmic scatter'' or an inappropriate model to which you are fitting
the data. Suppose your adopted values of $\sigma_{m}$ are off by a
common scale factor, i.e.\ if $\sigma_{m,adopted} =
f\sigma_{m,true}$. Then $\widehat{\chi^2} \approx f^{-2}$ instead of
$\widehat{\chi^2} \approx 1$. And to obtain the parameter errors from
$\delta \chi^2$, you must find the offset $\delta x$ such that $\Delta
\chi^2 = f^{-2} \approx \widehat{\chi^2}$.
You can correct for this erroneous common factor $f$ by dividing
your adopted values of $\sigma_{m}$ by $f$. Of course, you don't
know what this factor $f$ is until you do the chi square fit. Dividing
them by $f$ is equivalent to multiplying them by $\widehat{\chi}$. And,
of course, the same as multiplying $\sigma_{m}^2$ by
$\widehat{\chi^2}$.
\subsubsection{When you're using equation \ref{coeffvariancegood}\dots}
\label{diatrabetwo}
To be kosher, after having run through the problem once with the
adopted $\sigma_{m}$, calculate the $\widehat{\chi^2}$; multiply all
$\sigma_{m}$ by $\widehat{\chi}$; and redo the problem so that the new
$\widehat{\chi^2} = 1$. Then the derived variance ${\bf s_a}$ is also
correct. You can obtain it either as the corresponding diagonal to the
covariance matrix (equations \ref{coeffvariancegood} and
\ref{coeffvarianceawful}, which are identical in this case) or by
finding what departure from $x_0$ is necessary to make $\Delta \chi^2 =
1$.\footnote{To understand this comment about $\Delta \chi^2 = 1$, see
\S \ref {chicoeffs}.} This redoing the fit may seem like unnecessary
work, but when we deal with multiparameter error estimation it's the
best way to go to keep yourself from getting confused.
\subsubsection{Think about your results!}
In the case $\Delta \chi^2 \approx 1$ (and $\widehat{\chi^2}
\approx 1$) the dispersions of the observed points $s_m$ are equal to
the intrinsic dispersions of the datapoints $\sigma_{m}$ and the
mathematical model embodied in the least-squares fit is perfect. That,
at least, is the {\it theoretical} conclusion. In practice, however,
your obtaining such a low, good value for $\widehat{\chi^2}$ might mean
instead that you are using too large values for $\sigma_{m}$:
you are ascribing more error to your datapoints than they really have,
perhaps by not putting enough faith in your instrument.
But there is {\it another way} you can get artificially
small values for $\widehat{\chi^2}$. This will occur if your {\it
measurements are correlated}. Suppose, for example, that by mistake you
include the same measurements several times in your fit. Then your
measurements are no longer independent. Cowan discusses this possibility
in his \S 7.6.
High values of $\widehat{\chi^2}$ indicate that the model
is not perfect and could be improved by the use of a different model,
such as the addition of more parameters---or, alternatively, that you
think your equipment works better than it really does and you are
ascribing {\it less} error to your datapoints than they really have. And
in this case, using equation \ref{coeffvarianceawful} instead of
\ref{coeffvariancegood} is disastrous.
Think about your results.
\subsubsection{When your measurements are correlated\dots}
One more point, a rather subtle one. There are circumstances
in which your datapoints are not independent. Then the formulation
of chi-square fitting (and least-squares fitting, for that matter) is
more complicated. You need to calculate the covariance matrix for the
measured values $y_m$; call this covariance matrix ${\bf V}$. If this
matrix is not unitary, then $\chi^2$ is no longer given by equation
\ref{chisq}. Rather, it is given by
\begin{mathletters}
\begin{equation}
\label{chisq_cov}
\chi^2 = {\bf \delta Y^T \cdot V^{-1} \cdot \delta Y} \ .
\end{equation}
\noindent Of course, this leads to a different expression for {\bf a},
which replaces equation \ref{a_equation},
\begin{equation}
{\bf a} = ({\bf X^T \cdot V^{-1} \cdot X})^{-1}
{\bf \cdot X^T \cdot V^{-1} \cdot Y} \ ,
\end{equation}
\noindent and also to a different equation for the covariance matrix,
\begin{equation}
[\alpha]^{-1} = ({\bf X^T \cdot V^{-1} \cdot X})^{-1} \ .
\end{equation}
\end{mathletters}
Correlated datapoints can occur when the measured $y_m$ are
affected by systematic errors or instrumental effects. Cowan \S 7.6
discusses this case. For example, suppose you take an image with a
known point-spread function (psf) and want to fit an analytic function
to this image. Example: a background intensity that changes linearly
across the field plus a star. Here the independent variables in the
function are the $(x,y)$ pixel positions and the data are the
intensities in each pixel. You'd take the intensity in each individual
pixel and fit the assumed model. But here your data values are
correlated because of the psf. Because you know the psf, you know the
correlation between the various pixels. Such a formulation is required
for CBR measurements because of the sidelobes of the radio telescope
(which is just another way of saying ``pdf'').
Another case of correlated measurements occurs when your assumed
model is incorrect. This is the very definition of correlation, because
the residual $\delta y_m$ is correlated with the data value $y_m$. But
how do you calculate ${\bf V}$? If you could do a large number $J$ of
experiments, each with $M$ datapoints producing measured values
$y_{m,j}$, each measured at different values of $x_{m}$, then each
element of the covariance matrix would be $V_{mn} = \sum_j (y_{m,j} -
\overline{y_m}) (y_{n,j} - \overline{y_n})$. You don't normally have
this opportunity. Much better is to look at your residuals; if the
model doesn't fit, use another one!
Normally, and in particular we assume everywhere in this
tutorial, the measurements are uncorrelated, so one takes ${\bf V} =
{\bf I}$ (the unitary matrix).
\section{CHI-SQUARE FITTING AND WEIGHTED FITTING: DISCUSSION INCLUDING
COVARIANCE } \label{chicoeffs}
\subsection{ Phenomenological description}
Consider the first two coefficients in our example of \S
\ref{covarianceone}. In this example, the fit gives $y = A_0 + A_1 t +
A_2 t^2$, where the numerical values are given in vector form by
equation~\ref{avalues}. The coefficient $A_0$ is the $y$-intercept and
$A_1$ is the slope. They have derived values $A_0 = 96 \pm 34$ and $A_1
= 4 \pm 9$.
Remember what these uncertainties really mean: in an infinity of
similar experiments, you'll obtain an infinity of values of $(A_0,A_1)$
that are normally distributed with dispersions (34,9). Loosely
speaking, this means that $A_0$ lies between $(96-34=62)$ and
$(96+34=130)$ and $A_1$ lies between $-5$ and $13$.
Suppose you are interested in knowing about $A_0$ {\it without
regard to $A_1$}. By this we mean that as $A_0$ is varied from its
optimum value of 96, $\chi^2$ increases from its minimum value. As we
vary $A_0$, if we allow $A_1$ to take on whatever value it needs to for
the purpose of minimizing $\chi^2$, then this is what we mean by
``knowing about $A_0$ without regard to $A_1$''. For this case, the
uncertainty of $A_0$ is indeed 34. Ditto for $A_1$. In other words,
equations \ref{coeffvarianceone}, and \ref{coeffvariancegood} apply.
However, if you are interested in knowing about {\it both}, you
must include their covariance. In our example, the large negative
covariance follows logically just from looking at a graph: if you fit
some points, all of which lie at positive $t$, then a more negative
derived slope will raise the $y$-intercept.
Specifically, the large negative covariance means that positive
departures of $A_0$ are associated with negative departures of $A_1$.
So even though the {\it individual} values $\delta A_0=+34$ and $\delta
A_1=+9$ are acceptable, you {\it cannot} conclude that the {\it pair} of
values $(\delta A_0,\delta A_1) = (+34, +9)$ is acceptable, because this
pair has both positive. In contrast, what {\it is} acceptable here
would be something like $(\delta A_0,\delta A_1) = (+34, -9)$.
We stress that the acceptable ranges of values depend on what
you are interested in. This is sort of like the observer's influence in
quantum mechanics. If you are interested in $A_1$ alone, then you can
say $A_1 = 4 \pm 9$ and, in making this statement, you have to realize
that, as $A_1$ varies over this range, $A_0$ can vary over (formally, at
least) the range $(\infty \rightarrow -\infty)$: you just don't give a
damn {\it what} happens to $A_0$ because you're not interested. But the
moment you become interested and restrict its possible range, that
influences the possible range for $A_1$, too.
There is no simple relationship between the covariance matrix
elements and the acceptable ranges. For two variables, the best way to
express this is to construct the ellipses that define the loci of
constant $\Delta \chi^2$ and present them on a graph with axes $(\delta
a_0, \delta a_1)$ as in BR Figure 11.2 or NR Figure 14.5.4. For three
variables, these ellipses become ellipsoids; for four, they become
four-dimensional volumes, etc.
\begin{figure}[h!]
\begin{center}
\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
%\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
\includegraphics{chisq.ps}
\end{center}
\caption{Illustrating
the meaning of variance and covariance between $(a1,a2)$ for our numerical
example. See text for discussion. \label{figchisq}}
\end{figure}
We illustrate these concepts for the $(a_1,a_2)$ parameters in
our numerical example. We subtracted 7.75 from all times so that the
covariance would be small enough to illustrate the difference between
the tangents to the ellipses and the end points of the ellipses.
Contours are calculated as described in \S \ref{easycalc} and are at
$\Delta \chi^2 = 1$ and 2.3. The dashed horizontal and vertical lines
are at $\delta_a = \pm \sigma_a$.
First consider the pair of vertical lines, which are drawn at
$\delta a_1 = \pm \sigma_{a_1}$, where $\sigma$ is the square root of
the variance of the parameters as described in equations
\ref{coeffvarianceone}, \ref{lscoefferrorone}, \ref{coeffvariancegood},
and \ref{coeffvarianceawful}. If the datapoints were projected
downward, i.e.\ look at the marginal pdf of $\delta a_1$ by taking
small strips of $\delta a_1$ and integrating over $\delta a_2$, the
marginal pdf of $\delta a_1$ is Gaussian; ditto for the other
coordinate. Thus, $68\%$ of the points lie between these dashed lines.
This is what we mean by the phrase ``being interested in knowing about
$a_1$ without regard to $a_2$''. If we allow $a_2$ to vary so as to
minimize $\chi^2$ as we consider departures $\delta a_1$, then the pdf
of $\delta a_1$ has dispersion $\sigma_{a_1}$. Alternatively, we can say
that in a large number of experiments, the pdf of $\delta a_1$ follows a
chi-square pdf with one degree of freedom if we don't care what happens
to $\delta a_2$.
If, however, we are concerned about the pair, then we must look
not at the projection down one axis or the other, but rather at the
two-dimensional distribution. This is characterized by the tilted
ellipses. Here, for a large number of experiments, the pair $(a_1,a_2)$
follows a chi-square distribution with 2 degrees of freedom (if we don't
care about $a_0$; if we do, it's 3 degrees of freedom and the ellipse
becomes an ellipsoid, but this is very hard to plot!). For $\nu=2$,
$68.3\%$ of the points lie within $\Delta \chi^2=2.3$, where we have
drawn the outer contour in Figure \ref{figchisq}. The points inside this
ellipse are darker; $68.3\%$ of the points lie within that ellipse.
The best description of the specifics of calculating these
ellipsoids is in BR \S11.5 (Confidence Intervals, Confidence Levels for
Multiparameter Fits). To describe it, we'll talk specifically about our
numerical example, which has $M=4$ measurements and $N=3$ unknowns. The
unknowns are ${\bf a} = [a_0, a_1, a_2]$. We'll first begin by
discussing the case of a single parameter; then we'll generalize.
\subsection{ Calculating the uncertainties of a single
parameter---gedankenexperiment}
\label{gencalc0}
First, suppose we want to know the value $\sigma_{a_0}$ without
regard to the values of $a_1$ and $a_2$. Having already done the
solution, we know the chi-square value of $a_0$ so we consider
variations $\delta a_0$ around this best value.
Pick a nonzero value of $\delta a_0$ and redo the least-squares
solution for $[a_1, a_2]$; because of the covariance, these adopt values
different from those when $\delta a_0 = 0$. This gives a new value for
$\chi^2$ which is, of course, larger than the minimum value that was
obtained with $\delta a_0=0$. Call this difference $\Delta \chi_{\delta
a_0}^2$. Determine the dependence of $\Delta \chi_{\delta a_0}^2$ upon
$\delta a_0$ and find the value of $\delta a_0$ such that $\Delta
\chi_{\delta a_0}^2 = 1$. This is the desired result, namely the value
$\sigma_{a_0}$ without regard to the values of $a_1$ and $a_2$.
This value is $\sigma_{a_0}^2 = [\alpha]_{00}^{-1}$, the same
result quoted in equation \ref {coeffvarianceawful}.
Consider now what you've done in this process. For each
least-squares fit you used a trial value of $\delta a_0$. In specifying
$\delta a_0$ you had exactly one degree of freedom because you are
fixing one and only one parameter. Having done this, you could do a
large number of experiments (or Monte Carlo trials) to determine the
resultant distribution of $\Delta \chi_{\delta a_0}^2$. It should be
clear that this distribution follows a chi-square distribution with one
degree of freedom ($\nu = 1$). So the uncertainty $\sigma_{a_0}$ is
that value for which $\Delta \chi_{\delta a_0}^2 = 1$. (The chi-square
fit for the other two parameters has $M-2$ degrees of freedom, but this
is irrelevant because---by hypothesis---you don't care what happens to
those variables.)
\subsection{ Calculating the uncertainties of two
parameters---gedankenexperiment} \label{gencalc1}
Suppose we want to know the value $(\sigma_{a_0},\sigma_{a_2})$
without regard to the value of $a_1$. Now we consider variations
$(\delta a_0, \delta a_2)$ around the best values $(a_0, a_2)$.
Pick values for $(\delta a_0, \delta a_2)$ and redo the
least-squares solution for $a_1$. This gives a new value for $\chi^2$
which is, of course, larger than the minimum value that was obtained
with $(\delta a_0,\delta a_2)=0$. Call this difference $\Delta
\chi_{(\delta a_0,\delta a_2)}^2$. As above, this follows a chi-square
distribution, but now with $\nu = 2$. Determine the dependence of
$\Delta \chi_{(\delta a_0,\delta a_2)}^2$ upon $(\delta a_0, \delta
a_2)$ and find the set of values of $(\delta a_0, \delta a_2)$ such that
$\Delta \chi_{(\delta a_0, \delta a_2)}^2 = 2.3$. This is the desired
result, namely the ellipse within which the actual values $(\delta a_0,
\delta a_2)$ lie with a probability of $68.3\%$, without regard to the
value of $a_1$.
These values can be defined in terms of the curvature matrix
$[\alpha]$, as we discuss below.
Consider now what you've done in this process. For each
least-squares fit you used trial values of $(\delta a_0,\delta a_2)$.
In specifying them you had exactly two degrees of freedom because you
are fixing two parameters. This distribution follows a chi-square
distribution with two degree of freedom ($\nu = 2$). So the uncertainty
$\sigma_{a_0}$ is that value for which $\Delta \chi_{\delta a_0}^2 =
2.3$, which follows from the integrated probability for the chi-square
distribution for $\nu=2$. (The chi-square fit for the third parameter
$a_1$ has $M-1$ degrees of freedom, but again this is irrelevant.)
One can expand this discussion in the obvious way. Consider
finally\dots
\subsection{ Calculating the uncertainties of three
parameters---gedankenexperiment} \label{gencalc2}
Suppose we want to know the values of all three parameters (or,
generally, all $N$ parameters). Then we pick trial values for all
three. There is no least-squares fit for the remaining parameters,
because there are none. For each combination of the three (or $N$)
parameters we obtain $\Delta \chi_{\bf a}$, which defines a 3- (or $N$-)
dimensional ellipsoid. This follows a chi-square distribution with
$\nu = 3$ (or $N$). We find the (hyper)surface such that $\Delta
\chi_{\bf a}$ is that value within which the integrated probability is
$68.3\%$. This defines the (hyper)surface of $\sigma_{\bf a}$.
\subsection{ Doing these calculations the non-gedanken easy way}
\label{easycalc}
The obvious way to do the gedanken calculations described above
is to set up a grid of values in the parameters of interest $(\delta
a_n)$; perform the chi-square fit on the remaining variables, keeping
track of the resulting grid of $\chi^2$; and plot the results in terms
of a contour plot (for two parameters of interest) or higher dimensions.
There's an easier way which is applicable {\it unless} you are
doing a nonlinear fit and the parameter errors are large\footnote{In
which case you use the gedanken technique!}. The curvature matrix
$[\alpha]$ of equation \ref{curvmatrixalpha} contains the matrix of the
second derivatives of $\chi^2$ with respect to all pairwise combinations
of $\delta a_n$, evaluated at the minimum $\chi^2$; it's known as the
curvature matrix for this reason. Clearly, as long as the Taylor
expansion is good we can write
\begin{equation}
\Delta \chi_{\bf a}^2 = {\bf \delta a^T \cdot [\alpha] \cdot \delta a} \ .
\end{equation}
\noindent Knowing the curvature matrix, we don't have to redo the fits
as we described above. Rather, we can use the already-known matrix
elements.
Suppose, however, that you are interested in an $N_i$ (for
``$N_{interested}$'') subset of the $N$ parameters, and you want to know
their variances (and covariance matrix) {\it without regard to the
values of the other $(N - N_i)$ parameters}. You could use the gedanken
technique, but you can also use the ``non-gedanken easy way'' by using
the following procedure [see NR \S15.6 (Probability Distribution of
Parameters in the Normal Case)]. \begin{enumerate}
\item Decide which set of parameters you are interested in; call
this number $N_i$ and denote their vector by ${\bf a_i}$. Here we use
the above example and consider $N_i=2$ and ${\bf a_i} = [a_0,a_2]$.
\item From the $N \times N$ covariance matrix $[\alpha]^{-1}$,
extract the rows and columns corresponding to the $N_i$ parameters and
form a new $N_i \times N_i$ covariance matrix $[\alpha]_i^{-1}$; in our
case the original covariance matrix is
\begin{mathletters}
\begin{eqnarray}
[\alpha]^{-1} = {\bf XXI} = \left[
\begin{array}{ccc}
1156.8125 & -303.000 & 18.4375 \\
-303.000 & 81.000 & -5.000 \\
18.4375 & -5.000 & 0.31250 \\
\end{array} \; \right]
\end{eqnarray}
\noindent and it becomes
\begin{eqnarray}
[\alpha]_i^{-1} = {\bf XXI} = \left[
\begin{array}{cc}
1156.8125 & 18.4375 \\
18.4375 & 0.31250 \\
\end{array} \; \right] \ .
\end{eqnarray}
\end{mathletters}
\item Invert this new covariance matrix to form a new curvature
matrix $[\alpha]_i$. The elements differ from the those in the original
curvature matrix.
\item As usual, we have
\begin{equation}
\Delta \chi_{\bf a_i}^2 = {\bf \delta a_i^T \cdot [\alpha]_i \cdot \delta
a_i} \ ,
\end{equation}
\noindent so find the locus of ${\bf a_i}$ such that the integrated
probability of $\Delta \chi_{\bf a_i}$ for $\nu = N_i$ contains $68.3\%$
of the space; e.g.\ for $\nu=2$ this is $\Delta \chi_{\bf a_i} = 2.3$.
\end{enumerate}
You may well wonder why, in steps 2 and 3, you need to derive a
new curvature matrix from the extracted elements of the original
covariance matrix. Why not just use the extracted elements of the
original curvature matrix? To understand this, read NR's discussion
surrounding equation (15.6.2); this is not very clear, in my opinion. I
find it easier to recall the way covariance matrices propagate errors
according to the derivatives of the quantities derived, as in Cowan's
equation 1.54. I could explain this here but, quite frankly, don't have
the time; maybe in the next edition of these notes!
\subsection{ Important comments about uncertainties}
Having said all the above, we offer the following important {\it
Comments}: \begin{itemize}
\item The easiest way to calculate these (hyper)surfaces is to
set up a grid in $N_i$-dimensional space of trial values for $\delta
{\bf a_i}$ and use a contour plot or volume plot package to plot the
loci of constant $\Delta \chi_{\bf a_i}^2$.
\item The procedure described in \S \ref{easycalc} works well
for linear fits, or nonlinear fits in which the $\sigma_{\bf a}$ are
small so that $\Delta \chi^2$ is well-approximated by the second
derivative curvature matrix. This is not necessarily the case; an
example is shown in BR Figure 11.2. Here, the higher-order curvature
terms are important and it's better to actually redo the fit for the
grid of trial values of $\bf a_i$ as described above in \S
\ref{gencalc0}, \ref{gencalc1}, and \ref{gencalc2}. In other words, use
the gedanken technique.
\item The variance (i.e., uncertainty squared) of the derived
parameters {\bf a} depends {\it only on the elements in the covariance
matrix $[\alpha]^{-1}$}. These, in turn, depend only on the curvature
matrix $[\alpha]$. These, in turn, depend only on ${\bf X}$. This
matrix {\bf X} is the matrix of the quantities that are {\it known
exactly}. For example, we began with the example in which the elements
of {\bf X} were the times at which the measurements were taken.
Generally, then, the curvature and covariance matrix elements
depend on the {\it locations} of the datapoints (the ensemble of $t_m$
in equation \ref{one}) but not on the {\it measured values} (the
ensemble of $y_m$ in equation \ref{one}). And on your adopted
values for $\sigma_{m}$. Because of this\dots
\item Think {\it before} making your measurements about the
covariance matrix and how to minimize the off-diagonal elements. By
taking measurements at {\it well-chosen} times, or {\it well-chosen}
values of the independent variable $x_m$ whatever it is, you can really
optimize the accuracy-to-effort ratio! For example, in our numerical
example if you can get a few measurements at negative times your efforts
will be repaid in terms of much better accuracy for the $y$-intercept.
\end{itemize}
\section{REJECTING BAD DATAPOINTS II: STETSON'S METHOD PLUS
CHAUVENET'S CRITERION}
\label{chauvenetsectiontwo}
Chauvenet's criterion is an on-off deal: either you include the
datapoint or you don't. This makes sense from a philosophical point of view:
either a datapoint is good or not, so you should either include it or
exclude it.
However, when doing a nonlinear fit this presents a problem. As
you iterate, the solution changes, and a given datapoint can change from
being ``bad'' to ``good''. Or vice-versa. You can imagine being in a
situation in which the iteration oscillates between two solutions, one
including a particular datapoint and the other excluding it; the solution
never converges, it just keeps chasing its tail.
Enter Stetson's beautiful technique\footnote{Stetson is one of
those anomalies, a {\it true} expert on fitting. He invented many of the
stellar photometry routines used in {\it daophote}, all of which use
least-squares techniques. He provides a lively, engaging discussion of
many fascinating and instructive aspects in his website: {\it
http://nedwww.ipac.caltech.edu/level15/Stetson/Stetson4.html}.}. Stetson
reasons that we shouldn't have an on-off criterion. Rather, it should
relieve a datapoint of its influence adiabatically: as its residual gets
larger and larger, its weight gets smaller and smaller. With this, in
nonlinear fitting all datapoints are always included and their weights
automatically adjust as the fit parameters home into their correct
values. And you can't get into the chasing-tail syndrome that can happen
with the strict on-off inclusion.
\subsection{Stetson's sliding weight} \label{sliding}
Stetson recommends using a sliding weight. To explain this, we
review the ML concept of chi-square fitting. We define chi-square as
\begin{mathletters} \label{stetsonchisq1}
\begin{equation} \chi^2 = \sum_{m=0}^{M-1} { (y_m -
{\bf a \cdot f(x_m)})^2 \over \sigma_m^2} \ ,
\end{equation}
\noindent and we minimize $\chi^2$ by setting its derivative with
respect to each parameter $a_n$ equal to zero:
\begin{equation}
{d \chi^2 \over da_n} = -2 \sum_{m=0}^{M-1} { f_n(x_m) \Delta y_m \over
\sigma_m^2} \ .
\end{equation}
\noindent Here $\Delta y_m = (y_m - {\bf a \cdot f(x_m)})$. For each
coefficient $a_n$ setting this to zero gives
\begin{equation} \label{sevenonec}
\sum_{m=0}^{M-1} { f_n(x) \Delta y_m \over \sigma_m^2} = 0 \ .
\end{equation}
\end{mathletters}
Now we wish to modify this equation by introducing a weight
$w_{(|\Delta y_m|)}$ that makes datapoints with large $|\Delta y_m|$
contribute less, so it reads like this:
\begin{equation} \label{wgtmodone}
\sum_{m=0}^{M-1} { w_{(|\Delta y_m|)} f_n(x_m) \Delta y_m
\over \sigma_m^2} = 0 \ .
\end{equation}
\noindent It's clear that we need the following properties for
$w_{(\Delta y_m)}$: \begin{enumerate}
\item $w_{(\Delta y_m)} = w_{(|\Delta y_m|)}$, meaning simply that
it should depend on the absolute value of the residual and not bias the
solution one way or the other.
\item $w_{(|\Delta y_m|)} \rightarrow 1$ as $|\Delta y_m|
\rightarrow 0$, meaning that datapoints with small residuals contribute their
full weight.
\item $w_{(|\Delta y_m|)} \rightarrow 0$ as $|\Delta y_m|
\rightarrow \infty$, meaning that datapoints with large residuals contribute
nothing. \end{enumerate}
\noindent Stetson recommends
\begin{equation} \label{stetsonwgteqn}
w_{(|\Delta y_m|)} = {1 \over 1 +
\left( |\Delta y| \over \alpha \sigma \right) ^\beta } \ .
\end{equation}
\noindent This function $w_{(|\Delta y_m|)}$ has the desired properties.
Also, for all $\beta$ it equals 0.5 for $|\Delta y_m| = \alpha \sigma$.
As $\beta \rightarrow \infty$ the cutoff gets steeper and steeper, so in
this limit it becomes equivalent to a complete cutoff for $|\Delta y_m|
> \alpha \sigma$.
Stetson recommends $\alpha = 2$ to 2.5, $\beta = 2$ to 4 on the
basis of years of experience. Stetson is a true expert and we should
take his advice seriously; he provides a vibrant discussion to justify
these choices in real life, including an interesting set of numerical
experiments.
However, for large $M$ I see a problem with the choice $\alpha =
2$ to 2.5. For large $\beta$, for which the cutoff is sharp, it seems to
me that the cutoff should duplicate Chauvenet's criterion. Referring to
equation \ref{chauvenetexplicit}, this occurs by setting
\begin{equation} \label{alphaeqn}
\alpha = {\rm erf}^{-1}\left( {1 - {1 \over 2M}} \right)
\end{equation}
\noindent and I recommend making this change, at least for problems
having reasonably large $M$; this makes $\alpha$ larger than Stetson's
choice. I'm more of a purist than Stetson, probably because I'm a radio
astronomer and often fit thousands of spectral datapoints that are,
indeed, characterized mainly by Gaussian statistics. Stetson is an
optical astronomer and probably sees a lot more departures from things
like cosmic rays. Nevertheless, in a CCD image with millions of pixels,
of which only a fraction are characterized by non-Gaussian problems such
as cosmic ray hits, it seems to me only reasonable to increase $\alpha$
above Stetson's recommended values by using equation \ref{alphaeqn}.
\subsection{Implementation of the weight in our matrix equations}
Clearly, implementing Stetson's method requires a weighted fit,
so you have to use the chi-square technique discussed in \S
\ref{chisqsection}. There equation \ref{diagsigma} defines a matrix of
weights (which is diagonal) in which
\begin{equation}
{\bf W}_{m,m} = {1 \over \sigma_m} \ .
\end{equation}
\noindent Comparing this with equation \ref{sevenonec}, it's clear what
to do: we modify this equation to read
\begin{equation} \label{wgteqntwo}
{\bf W}_{m,m} = {w_m^{1/2} \over \sigma_m} \ ,
\end{equation}
\noindent where the weight $w_m$ is defined in equation
\ref{stetsonwgteqn}.
Now you must not forget here that the solution depends on the
weights $w_m$, which in turn depend on the solution. Thus when you
implement this technique you must iterate until the solution converges
by not changing.
\section{MEDIAN/MWARS, INSTEAD OF LEAST-SQUARES, FITTING}
\label{medianfitting}
Least-squares fitting minimizes the squares of the residuals.
This means that datapoints having large residuals contribute importantly
to the fit. If these datapoints are really bad, you've got problems;
this is why it's important to get rid of outliers! Sometimes you're
faced with a set of datapoints that look bimodal: most datapoints have a
Gaussian-like pdf, and many lie outside the main distribution; sometimes
it's difficult to decide where to draw the line between outliers and
good datapoints. Or you might have non-Gaussian statistics. In these
cases, using least squares might not be a great idea because least
squares gives greatest weight to the datapoints having the largest
residuals, but you don't know what the residuals are until after you've
done the fit---and the fit is influenced, and maybe even dominated, by
the outliers!
In these cases the median is often a good solution. The median
is the solution for which there are as many positive as negative residuals,
{\it irrespective of how big they are}. The median works especially well
when the discrepant datapoints are asymmetrically distributed with
respect to sign.
The median isn't always appropriate: for example, for pdfs that
are symmetrically dominated by large residuals and have few small ones,
datapoints near the expectation value are few and far between so the
median has high error. Also, if the statistics are Gaussian then the
error of the median is somewhat greater than that of the mean (by a
factor of something like ${\pi \over 2}$; I forget, but it's
straightforward to calculate).
If you have non-Gaussian statistics, then apply to your situation
these somewhat contradictory introductory remarks both extolling and
trashing the median. If you decide that a median fit is appropriate,
read on!
\subsection{ The Median versus the MARS and the MWARS}
Consider ARS, the sum of the absolute values of the residuals
(the Absolute Residual Sum) and suppose we want to minimize this. This
is the MARS: the {\it Minimum Absolute Residuals Sum}. For the standard
median (that takes the ``average'' of a bunch of numbers), the MARS is
identical to the median. This isn't true for more complicated functions,
as we briefly discuss in the next two subsections.
\subsubsection{ For the Standard Median---it's the MARS}
When we take the median of a bunch of points that's like taking
a biased average of the points. In this case the residual $\Delta y_m =
y_m - y_{MARS}$, where $y_{MARS}$ is the derived median value. Normally
we take the median by finding that datapoint for which the number of
positive residuals is equal to the number of negative ones.
Here we take a more general approach. First, write the ARS as a
function of the a trial value $y_{ARS}$; we will find that the median
value is $y_{MARS}$, the one that gives the minimum of the ARS. So write
\begin{mathletters}
\begin{equation} \label{arsdef1}
ARS(y_{ARS}) = \sum_{m=0}^{M-1} | (y_m - y_{ARS})|
= \sum_{m=0}^{M-1} | \Delta y_m |
\end{equation}
\noindent The absolute value signs are horrible to deal with, so we
rewrite this as
\begin{equation}
ARS(y_{ARS}) =
\sum_{\Delta y_m > 0} ( y_m - y_{ARS})
- \sum_{\Delta y_m < 0} ( y_m - y_{ARS})
\end{equation}
\end{mathletters}
\noindent To find the minimum of the ARS we take its the derivative with
respect to $y_{ARS}$ and set it equal to zero. The derivative of each
term is equal to $-1$, so we get
\begin{equation} \label{expmedian1}
{ d ARS \over dy_{ARS}}= \sum_{\Delta y > 0} 1
- \sum_{\Delta y < 0} 1 = M(\Delta y > 0) - M(\Delta y < 0)
\end{equation}
\noindent Setting this equal to zero requires choosing $M(\Delta y > 0)
= M(\Delta y < 0)$. In plain English: $y_{MARS}$ is that value of
$y_{ARS}$ which provides equal numbers of positive and negative
residuals.
This is the very definition of the median! Thus, for the
``average'' of a bunch of numbers, the median is the same as the MARS.
The median is defined unambiguously only if $M$ is odd: the
median datapoint then has equal numbers with positive as negative
residuals. If $M$ is even, then the median lies anywhere between the two
middle points. Similarly, the ARS doesn't change between the two middle
points because as you increase the residual from one point the residual
from the other decreases by an identical amount, so the sum of the
absolute values doesn't change.
\subsubsection{ For an arbitrary function, e.g.\ the slope---it's the
MWARS}
Things change if we are fitting a function other than the
``average'' (i.e., the standard median). Suppose, for example, that we
want to fit the slope $s$ using MARS. Then $y_m = s_{MARS} x_m$ so the
analog to equation \ref{arsdef1} is
\begin{equation} \label{arsdef2}
ARS(s_{ARS}) = \sum_{m=0}^{M-1} | (y_m - x_m s_{ARS})|
= \sum_{m=0}^{M-1} | \Delta y_m |
\end{equation}
\noindent Carrying out the same steps, the analog for equation
\ref{expmedian1} becomes
\begin{equation} \label{expmedian2}
{ d ARS \over ds_{ARS}}= \sum_{\Delta y > 0} x_m
- \sum_{\Delta y < 0} x_m
\end{equation}
\noindent This is no longer the median. Rather, it's a {\it weighted
ARS}, or WARS. In calculating the ARS, each datapoint is weighted by its
$x$ value.
This weighting is exactly the same weighting that would occur in
a standard least-squares fit for the slope. To show this, carry out the
steps in equations \ref{stetsonchisq1} for a single function $f(x) =
sx$. This makes sense because points with large $x$ are intrinsically
able to define the slope better than points with small $x$.
\begin{figure}[h!]
\begin{center}
%\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
%\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
\includegraphics{slope_medianplot.ps}
\end{center}
\caption{Illustrating the difference between the {\it median} and {\it
MWARS} fits for a slope. \label{slope_medianplot}}
\end{figure}
Let's explore the effect of this MWARS by considering a simple
example, shown in Figure \ref{slope_medianplot}. Five datapoints are
shown with black circles. The dashed line is the median fit; there are
two points above and two below the line, so it is the true median. But
this fit is unsatisfying because it's the datapoints at large $x$ that
matter for determining the slope.
The solid line is the {\it MWARS} fit. This is
satisfying because the points at large $x$ cluster around it instead of
lying to one side, and the points at small $x$ (which have comparable
residuals to the others) don't (and intuitively shouldn't) matter as much.
Our intuition, as well as the math in equation \ref{expmedian2},
tells us that the {\it MWARS} is the appropriate choice. This
is valid for not only this example, but for {\it any} combination of
functions such as the general case formulated in equations
\ref{stetsonchisq1}.
\subsection{ The General Technique for MWARS Fitting}
By being clever with weights $w_m$ we can formulate a general
technique for MWARS fitting\footnote{It is fortunate that
MWARS, instead of median, fitting is what we want. Otherwise
we could not formulate the solution using $w_m$ because, when $y$
depends on more than one function of $x$, one would need a separate
weight for each function---and that is completely incompatible with the
general least-squares approach.}. Consider chi-square fitting the
function $y = {\bf a \cdot f(x)}$, in which ${\bf a}$ is an $N$-long
vector of unknown parameters and ${\bf f(x)}$ a set of $N$ basis
functions. In \S \ref{sliding} we reviewed the definition of $\chi^2$
and the resulting equations for minimization; this discussion culminated
in equation \ref{wgtmodone} in which we included a weight $w_m$,
specified by the user to accomplish some particular objective. We repeat
that equation here:
\begin{equation} \label{wgtmodone2}
\sum_{m=0}^{M-1} { w_m f_n(x_m) \Delta y_m
\over \sigma_m^2} = 0 \ .
\end{equation}
Here our objective is reproduce the spirit of equation
\ref{expmedian2}, in which all dependence $\Delta y_m$ (but not on $f_n$
or $\sigma_m^2$) is removed so that the sum is the multifunction
equivalent of equivalent of equation \ref{expmedian2}. To accomplish
this, we simply choose
\begin{equation} \label{wmmm}
w_m = {1 \over |\Delta y_m|} \ .
\end{equation}
\noindent So MWARS fitting is just another version of the
sliding weight technique of \S \ref{sliding}.
To implement MWARS fitting, you include a factor $w_m^{1/2}$ in
the diagonal elements of equation \ref{diagsigma}. Begin with $w_m = 1$
and iterate, keeping track of the minimum value of $|\Delta y_m|$. When
this minimum value stops changing by very much you've iterated enough.
One more point: you might get too close to one point, making its $\Delta
y_{m, central} \rightarrow 0$. Then in equation \ref{wmmm}
$w_{m,central} \rightarrow \infty$ and your solution goes crazy. You can
build in a limit to take the lesser value, i.e.\ something like $w_m =
(10^9 < {1/|\Delta y_m|})$ ($<$ means ``whichever is smaller'').
All this works for multivariate fits using any set of functions
$f{(x_m)}$. However, we caution again that the weights $w_m$ depend on
the parameters {\bf a} and the parameters depend on the weights, so you
must iterate. In the experiments that I've performed the iteration is slow
(but sure).
How about errors in the derived parameters? This is a tough one
because they depend on the pdf of the datapoint errors. For example, if
you have a singly wildly discrepant datapoint and evaluate $\sigma^2$ in
the usual way, then the sum $\sum \Delta y_m^2$ reflects that single
datapoint and this colors all the derived errors. This isn't fixed by
using the reduced chi-square $\widehat{\chi^2}$, which will be nowhere
near unity. Replacing the sum of residuals squared by the ARS seems
reasonable until you realize that the MWARS coefficient values
are completely independent of the residuals---yet, you'd expect the
errors to be smaller for better data! A similar concern holds for
replacing the ARS by its weighted counterpart. Frankly, I've put too
much time into this topic for now and will think more about this later
(do I say this every year I teach this class???).
\subsection{ Pedantic Comment: The MWARS and the Double-sided
Exponential pdf}
In fact, there is a specific pdf for which the MWARS is the
{\it theoretically correct} solution: the double-sided exponential. Here
the pdf of the measured datapoints is
\begin{equation}
p_{(\Delta y_m)} = { e^{- |\Delta y_m| / \sigma_m} \over 2 \sigma_m} \ .
\end{equation}
\noindent where, again, $\Delta y_m = (y_m - {\bf a \cdot f(x_m)})$.
For this, the logarithm of the likelihood function is (we exclude the
term involving $\log \Pi_{m=0}^{M-1} {1 \over \sigma_m}$ for simplicity)
\begin{mathletters}
\begin{equation}
{\cal L}_{(\Delta y_m)} = \log( L_{(\Delta y_m)}) = - \sum_{m=0}^{M-1}
\left[ { |y_m - {\bf a \cdot f(x_m)})| \over \sigma_m} \right] \ .
\end{equation}
\noindent The absolute value signs are horrible to deal with, so we
rewrite this as
\begin{equation}
{\cal L}_{(\Delta y_m)}) =
\sum_{\Delta y_m > 0} { y_m - {\bf a \cdot f(x_m)} \over \sigma_m}
- \sum_{\Delta y_m < 0} { y_m - {\bf a \cdot f(x_m)} \over \sigma_m}
\ .
\end{equation} \end{mathletters}
\noindent Now we take the derivative of $\cal L$ with respect to $a_n$ and
set it equal to zero to find the maximum. This gives
\begin{equation} \label{expmedian}
{ d L \over da_n}= \sum_{\Delta y > 0} { f_n(x_m) \over \sigma_m}
- \sum_{\Delta y < 0} { f_n(x_m) \over \sigma_m} = 0 \ ,
\end{equation}
\noindent which is the MWARS fit.
\subsection{IDL's related resources}
IDL provides the \verb$median$ function, which uses sorting and
is much faster than our general weighted-MWARS technique---but of
course cannot deal with functional forms. IDL also provides
\verb$ladfit$ (``least absolute deviation fit''), which does a
weighted-MWARS fit for a straight line. My routine,
\verb$polyfit_median$, does a weighted-MWARS fit for an arbitrary
polynomial; it is slightly less accurate as \verb$ladfit$ for an odd
number of points but is slightly better for an even number.
\section{FITTING WHEN BOTH X AND Y HAVE UNCERTAINTIES}
\label{bothsection}
We've mentioned that one of the essential assumptions of least
squares is that the independent variables are known with high precision
and the errors occur only in the measured data. Suppose you're fitting
two variables, $t$ and $y$, as in equation~\ref{one}. This
essential assumption means that $t$ is known with high precision and all
the uncertainty is in $y$, and you are minimizing the squares of
the residuals in the $y$-direction only. If {\it both} variables
have uncertainties, then you have to be careful because the essential
assumption is violated. If you go ahead with a standard least-squares
fit when there are errors in both coordinates, the slope will be
systematically too small, as we discuss briefly below.
Thanks to Stetson, the ML formulation of this problem is
straightforward. Nevertheless, as far as I know the proper formulation
is rather recent; Stetson's treatment is the first I've seen. Before
reviewing Stetson's formulation, let's see two approaches: \begin{enumerate}
\item Taylor \S 8.4 argues that you can account for $x$-variance
$\sigma_{x_m}^2$ by increasing the $y$-variance by the usual error
propagation, i.e.\ define an equivalent $y$-variance
$\sigma_{y_m}^2(equiv) = [\sigma_{y_m}^2 + (a_1\sigma_{x_m})^2]$, where
$a_1$ is the slope. This is equivalent to our results below.
\item Isobe et al.\ (1990, ApJ 364, 104) discuss the case
incorrectly. Look in particular at their Section V, where they make 5
numbered recommendations. Two of these are incorrect: \begin{enumerate}
\item Number 3 says, in essence, that if you have measurement
errors in $y$ but not in $x$, and want to predict $x$ from $y$ in some
future dataset, that you should least-squares fit the $x$ values (which
have no errors) to the $y$. This is {\it flat wrong}. Again, it leads to
a slope that is systematically too small. The proper procedure is to
fit $y$ to $x$ in the standard way, which is consistent with the ML
formulation and gives the right answer; then use the resulting
parameters, whose errors you know about, to predict $x$ from $y$ in the
future.
\item Number 4 says that if both $x$ and $y$ have errors, and your
main focus is finding the true slope, you should use their ``bisector''
method. I won't explain this because this concept is wrong.
\end{enumerate}
\end{enumerate}
Stetson has a beautiful discussion of the problem, which is
obviously correct---both because it makes sense and because numerical
experiments show that the derived parameters don't depend on whether you
consider $x$ or $y$ the independent variable. With his technique you are
not restricted to fitting a straight line (a one-degree polynomial). NR
\S 15.3 provides another method, which is more complicated and is
restricted to a straight line. Here we review Stetson's method.
\subsection{A preliminary: Why the slope is systematically small}
Why is the derived slope systematically too small if you use the
standard least-squares technique when both variables have errors? To see
this, take a look back at equation \ref{normaltwo}, where we explicitly
write the normal equations for fitting a straight line of the form $A
s_m + B t_m = y_m$. To focus the discussion and make it easy,
replace that problem with a single-parameter solution for only the slope
$B$, and use the usual variables $(x,y)$ in place of $(t,y)$. Then
we are fitting the set of $M$ equations
\begin{mathletters}
\begin{equation}
B x_m = y_m \ .
\end{equation}
\noindent The set of two normal equations becomes just the single equation
\begin{equation}
B [x^2] = [ xy ] \ ,
\end{equation}
\noindent or, writing out the sums explicitly,
\begin{equation} \label{eqnforbone}
B = {\sum_{m=0}^{M-1} {x^*}_m y_m \over \sum_{m=0}^{M-1} {x^*}_m^2
} \ .
\end{equation}
\end{mathletters}
Here we use the star to designate the perfectly-known
independent variable ${x^*}_m$. It is important to realize that the $x_m$
that appear in this equation are the perfectly-known ones ${x^*}_m$; this
is a fundamental tenet of least-squares fitting, which comes from the
concept and principle of maximum likelihood ML.
Because $B$ is defined by the ${x^*}_m$ and we are asking what
happens when we use the imperfectly known $x_m$ instead, let us reduce
the problem to the essence and imagine that $y_m$ is perfectly known,
i.e.\ $y_m={y^*}_m$; and that
\begin{equation} \label{deltaxeqnfirst}
{x^*}_m = x_m - \delta x_m \ ,
\end{equation}
\noindent where $\delta x_m$ is the observational error in point $m$. If
we do standard least squares on this situation, then we (incorrectly)
rewrite equation \ref{eqnforbone} to read
\begin{mathletters}
\begin{equation} \label{eqnforbtwo}
B = {\sum_{m=0}^{M-1} x_m y_m \over \sum_{m=0}^{M-1} x_m^2 } \ ,
\end{equation}
\noindent that is, using $x_m$ instead of ${x^*}_m$ (because we don't
know what ${x^*_m}$ is!). Substituting equation \ref{deltaxeqnfirst}, and
remembering that $y_m = {y^*}_m = B {x^*}_m$, we have
\begin{equation} \label{eqnforbthree}
B = {\sum_{m=0}^{M-1} {y^*}_m ({x^*}_m + \delta x_m) \over
\sum_{m=0}^{M-1} ({x^*}_m^2 + 2 {{x^*}}_m \delta x_m + \delta x_m^2)} \ .
\end{equation}
\end{mathletters}
Now all terms having $\delta x_m$ sum to zero because the errors
are distributed symmetrically around zero. But the denominator contains
$\delta x_m^2$. The denominator is irrevocably increased by this term,
which decreases the derived value of $B$ from its true value. Yes, this
is only a second-order effect, but it matters---after all, $\chi^2$ is a
second-order quantity! Try some numerical experiments!
\subsection{Stetson's method}
\subsubsection{Philosophy} \label{philosophy}
To make things clear, we discuss the case of a straight line.
Thus we assume
\begin{equation} \label{stareqn}
{y^*} = a_0 + a_1{x^*} \ ,
\end{equation}
\noindent where the $^*$ indicates the true quantity without errors. The
measured points differ from the true ones:
\begin {equation} \label{diffeqn}
\delta x_m = x_m - {x^*}_m \ \ ; \ \ \delta y_m = y_m - {y^*}_m \ ,
\end{equation}
\noindent and we define, as usual, the {\it apparent} residual in $y_m$ as
\begin{equation} \label{deltayeqn}
\Delta y_m= [y_m - a_0] - a_1 x_m = y_m' - a_1 x_m \ .
\end{equation}
\noindent Note that this differs from the {\it true} residual in $y_m$,
which is $\delta y_m$; the {\it apparent} residual $\Delta y_m$ assumes
that $x_m$ has no error, i.e.\ that $\Delta x_m=0$. Also, we define
$y_m' = [y_m - a_0]$ for convenience below; $a_0$ can be not only just a
constant, but the sum any set of functions of independent variables that
we include in the least-squares solutions as long as they have no error.
\begin{figure}[h!] \begin{center} \leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
%\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
\includegraphics[scale=1.0]{xyfig.ps} \end{center}
\caption{Illustrating the discussion of fitting when both variables have
errors. Here, as usual in this document, differences are ``measured
minus true'', e.g.\ $\delta y_m = y_m - {y^*}_m$, etc.; in the figure,
$\Delta y_m < 0$, $\delta y_m < 0$, $\delta x_m > 0$. Note that the {\it
apparent} residuals are $\Delta x_m=0$ and $\Delta y_m = y_m - a_1x_m$.
\label{xyfig}}
\end{figure}
Look at Figure \ref{xyfig}. There $(x_m,y_m)$ (the square plot
symbol) is a measured point. The ellipse centered on it has $(x,y)$
axial ratio $(\sigma_x, \sigma_y)$ and traces out the locus where the
point's contribution to $\chi^2$ is constant; call this contribution
$\chi_m^2$. The straight line is the fitted straight line, assumed to be
perfectly correct, i.e.\ the line given by equation \ref{stareqn}. The
most probable place where the observed point $(x_m,y_m)$ would lie if it
had no errors is $({x^*}_m,{y^*}_m)$; it lies on the smallest ellipse
centered on $(x_m,y_m)$, as pictured. This is the ellipse having the
smallest possible $\chi_m^2$. The true differences between the observed
and most probable points are $(\delta x_m, \delta y_m)$ given by
equation \ref{diffeqn} above, where these are measured from the center
to the starred point on the ellipse.
It's easy to calculate $(\delta x_m, \delta y_m)$ in terms of
$\Delta y_m$. These are the distances from the measured to the starred
point. We realize that the starred point must satisfy two criteria:
\begin{enumerate}
\item The straight-line fit is tangent to the ellipse, so the
slopes much match. The ellipse has equation
\begin{equation}
{ \delta x_m^2 \over \sigma_{x_m}^2 } +
{ \delta y_m^2 \over \sigma_{y_m}^2 } = const = \chi_m^2 \ ,
\end{equation}
\noindent so the slope is
\begin{equation} \label{eqnonee}
{d \delta y_m \over d \delta x_m} =
- {\sigma_{y_m}^2 \over \sigma_{x_m}^2 }
{\delta x_m \over \delta y_m} \ ,
\end{equation}
\noindent and this, evaluated at $({x^*}_m, {y^*}_m)$, must equal $a_1$.
\item The actual values must match. Rather than solve for these,
we can solve in terms of $\Delta y_m$ directly. \end{enumerate}
From the figure it's clear that
\begin{equation} \label{eqntwoo}
\Delta y_m = \delta y_m - a_1 \delta x_m
\end{equation}
\noindent so combining equations \ref{eqnonee} and \ref{eqntwoo} we get
\begin{mathletters} \label{deltaxyeqns}
\begin{equation} \label{deltaxeqn}
\delta x_m = { - {\Delta y_m \over a_1} \over
1 + {\sigma_{y_m}^2 \over a_1^2 \sigma_{x_m}^2} } \ ,
\end{equation}
\noindent and
\begin{equation}
\delta y_m = { \Delta y_m \over
1 + { a_1^2 \sigma_{x_m}^2 \over \sigma_{y_m}^2 }} \ .
\end{equation}
\end{mathletters}
You could minimize $\chi^2$ by brute force. You would begin
with the standard least-squares solution to get an estimate of the
correct parameter values. Then you'd make up a grid in the $(a_0,a_1)$
plane to find the location of the minimum in $\chi^2$, which is
\begin{equation} \label{chisqtwov}
\chi^2 = \sum_{m=0}^{M-1} \chi_m^2 =
\sum_{m=0}^{M-1} { \delta x_m^2 \over \sigma_{x_m}^2} +
{ \delta y_m^2 \over \sigma_{y_m}^2} \ ,
\end{equation}
\noindent which gives the parameters $(a_0,a_1)$. The quantities
$(\delta x_m^2, \delta y_m^2)$ are given by equation \ref{deltaxyeqns}.
The errors are given by the usual $\Delta \chi^2$ contours as discussed
in \S \ref{chicoeffs}.
\subsubsection{Direct solution}
\label{directsoln}
The basic idea is to formulate this in terms of a traditional,
but modified, least-squares fit. There are two modifications: one,
defining a modified value for $\bf X$; and two, defining appropriate
weights for a weighted least-squares fit. We approach this by expressing
$\chi_m^2$ in terms of $\Delta y_m$ [instead of the combination $(\delta
x_m, \delta y_m)$] and proceeding in the usual way, i.e.\ using the
principle of maximum likelihood (ML).
As usual with ML, we find the minimum $\chi^2$ by taking the
derivative of $\chi^2$ with respect to each parameter $a_n$ and setting
the derivative equal to zero. To illustrate, we work this through for
the slope. We are fitting for the slope in the equation
\begin{equation}
y_m' = a_1 x_m
\end{equation}
\noindent for which $\chi^2$ is given by equation \ref{chisqtwov}.
Express $(\delta x_m, \delta y_m)$ in terms of $\Delta y_m$, as in
equations \ref{deltaxyeqns}, and rewrite equation \ref{chisqtwov} to
read
\begin{equation} \label{ninetyfour}
\chi^2 = \sum_{m=0}^{M-1}
{(y_m' - a_1 x_m)^2 \over a_1^2 \sigma_{x_m}^2 + \sigma_{y_m}^2}
= \sum_{m=0}^{M-1}
{\Delta y_m^2 \over a_1^2 \sigma_{x_m}^2 + \sigma_{y_m}^2} \ .
\end{equation}
\noindent Differentiate with respect to $a_1$ and set equal to zero. After
canceling extraneous factors, and using equation \ref{deltaxyeqns}a, we
get the normal equation (which corresponds to equation \ref
{matrixnormal})
\begin{equation} \label{normaltwov}
\sum_{m=0}^{M-1} { (y_m'- a_1 x_m) (x_m - \delta x_m) \over
a_1^2 \sigma_{x_m}^2 + \sigma_{y_m}^2} = 0 \ .
\end{equation}
\noindent In contrast, if we had standard least squares and {\it no}
errors in $x$, the normal equation \ref{normaltwov} would instead read
\begin{equation} \label{normalonev}
\sum_{m=0}^{M-1} { (y_m'- a_1 x_m) (x_m) \over \sigma_{y_m}^2} = 0 \ .
\end{equation}
Looking at these and, also, referring back to \S
\ref{matrixmethod}, it's easy to see how to obtain equation
\ref{normaltwov} using matrices by defining a modified $\bf X$ and a set
of weights $\bf W$. To see this, we begin by examining how we derive
the normal equations from the equations of condition. Normally, we begin
with the equations of condition ${\bf X \cdot A = Y}$ and premultiply by
$\bf X^T$ (whose elements are evaluated at $x_m$), obtaining
\begin{equation}
{\bf X^T \cdot X \cdot A} = {\bf X^T \cdot Y}
\end{equation}
\noindent and solve in the usual manner.
Here, however, we define a modified $\bf X$ matrix, which we
call $\bf X\_MOD$; its elements are evaluated not at $x_m$, but instead
at $x_m - \delta x_m$. Thus
\begin{equation} \label{xmodeqn}
{\bf X\_MOD} = {\bf X} \ {\rm evaluated \ at \ } (x_m - \delta x_m) \ .
\end{equation}
\noindent We premultiply by $\bf X\_MOD^T$ instead of $\bf X^T$.
Moreover, we also define a modified weight matrix $\bf W\_MOD$, which we
apply to both ${\bf X}$ and $\bf X\_MOD$. The weighted equations of
condition are ${\bf (W\-MOD \cdot X) \cdot A = \bf W\-MOD \cdot Y}$ and
the modified normal equations become
\begin{equation} \label{modnormaleqn}
{\bf (W\_MOD \cdot X\_MOD)^T \cdot (W\_MOD \cdot X) \cdot A} =
{\bf (W\_MOD \cdot X\_MOD)^T \cdot (W\_MOD \cdot Y)} \ .
\end{equation}
\noindent These are the same as the set of equations \ref{normaltwov}.
Again, we solve these in the usual manner.
Thus, we follow this procedure:
\begin{enumerate}
\item Define a modified $\bf X$ matrix ${\bf X\_MOD}$ and
evaluate it as discussed below in \S \ref{evaldelx}.
\item Define the array of modified weights $\bf W\_MOD$ and
evaluate it as discussed below in \S \ref{evaldelx}.
\item Define the weighted versions of ${\bf X}$,
${\bf X\_MOD}$, and ${\bf Y}$
\begin{mathletters} \label{xwwxtwo}
\begin{equation}
{\bf XW} = {\bf W\_MOD \cdot X}
\end{equation}
\begin{equation}
{\bf X\_MODW} = {\bf W\_MOD \cdot X\_MOD}
\end{equation}
\begin{equation}
{\bf YW} = {\bf W\_MOD \cdot Y} \ .
\end{equation}
\end{mathletters}
\item We now have the weighted equations of condition
\begin{mathletters}
\begin{equation}
{\bf XW \cdot A} = {\bf YW} \ .
\end{equation}
\noindent Incorporate ${\bf X\_MODW}$ in your solution to these:
\begin{equation} \label{normaltwov1}
{\bf X\_MODW^T \cdot XW \cdot A} = {\bf X\_MODW^T \cdot YW} \ ,
\end{equation}
\noindent or
\begin{equation}
[\alpha\_mod] {\bf \cdot A} = [\beta\_mod] \ ,
\end{equation}
\noindent where
\begin{equation}
[\alpha\_mod] = {\bf X\_MODW^T \cdot XW}
\end{equation}
\begin{equation}
[\beta\_mod] = {\bf X\_MODW^T \cdot YW } \ .
\end{equation}
\item Solve these equations for ${\bf A}$ in the obvious way.
\item For the errors, use equations \ref{deltaxyeqns} and
\ref{chisqtwov} to evaluate the chi-square. Then use the equivalent of
equation \ref{coeffvariancegood} for the parameter errors, i.e.
\begin{equation}
{\bf s_a^2} = \widehat{ \chi}^2 diag\{ [\alpha\_modmod]^{-1} \} \ ,
\end{equation}
\noindent where $[\alpha\_modmod]^{-1}$ is the covariance matrix, given by
\begin{equation}
[\alpha\_modmod]^{-1} = ([\alpha\_mod]^{-1}) {\bf \cdot
(XMODW^T \cdot XMODW) \cdot} ([\alpha\_mod]^{-1})^{\bf T } \ .
\end{equation}
\noindent or, equivalently,
\begin{equation}
[\alpha\_modmod]^{-1} =
\{ {\bf (XMODW \cdot} ([\alpha\_mod]^{-1})^{\bf T} \} ^{\bf T} \ {\bf \cdot} \
\{ {\bf (XMODW \cdot} ([\alpha\_mod]^{-1})^{\bf T} \}
\end{equation}
\end{mathletters}
\noindent {\bf Caution: I have not thoroughly checked this equation with
numerical experiments!!} If you are fitting a straight line, then the
slope $a_1$ is everywhere the same and in this case
$[\alpha\_modmod]=[\alpha\_mod]$.
\end{enumerate}
\subsection{Evaluating $\bf X\_MOD$ and $\bf W\_MOD$} \label{evaldelx}.
First, $\bf X\_MOD$. Comparing equations \ref{normaltwov},
\ref{normalonev}, and \ref{normaltwov1} we must define $\bf X\_MOD$ as
being the same as $\bf X$, but evaluated at $(x_m - \delta x_m)$, where
$\delta x_m$ is from equation \ref{deltaxeqn}. If we are not doing a
straight-line fit, then $y_m = f(x_m)$ and we have the $m^{th}$ element
of $\bf X\_MOD$ as
\begin{equation}
{\bf X\_MOD_m} = f(x_m - \delta x_m) \ .
\end{equation}
Next, $\bf W\_MOD$. In normal least squares, $\chi^2$ has
denominators $\sigma_m^2 = \sigma_{y,m}^2$. Here we replace those by
\begin{mathletters} \label{localslope}
\begin{equation} \label{localslopea}
\sigma_{m}^2 = (a_m^2 \sigma_{x_m}^2 +
\sigma_{y_m}^2) \ ,
\end{equation}
\noindent where the local slope $a_m$ is evaluated at $(x_m - \delta
x_m)$:
\begin{equation}
a_m = \left. \partial y \over \partial x \right| _{({x_m - \delta x_m), (y_m
- \delta y_m)} } \ .
\end{equation}
\end{mathletters}
\noindent So we define $\bf W\_MOD$ as the $M \times M$ diagonal matrix
in which the elements are ${\bf W\_MOD}_{m,m} = {1 \over \sigma_{m}}$
(from equation \ref{localslopea}).
But, you ask, how can we form the matrices ${\bf X\_MOD}$ and
${\bf W\_MOD}$ if we don't know the slope $a_m$? The answer: iterate.
That is, \begin{enumerate}
\item Solve $y_m = a_0 + a_1 f(x_m)$ using standard least squares
or chi-square; or use any other technique to get an initial guess for
the parameters {\bf a}. Calculate the local slopes using $(x_m, y_m)$.
\item Determine the $\Delta y_m$ and calculate $\delta x_m$
using equation \ref{deltaxeqn}.
\item Calculate the matrices ${\bf X\_MOD}$, etc.\ using equation
\ref{xmodeqn}.
\item Solve for new parameters $\bf a$ using equation
\ref{modnormaleqn}.
\item Using the new parameters $\bf a$, recalculate the local
slopes using $(x_m- \delta x_m, y_m - \delta y_m)$.
\item Loop back and repeat steps 2, 3, 4, and 5 until
convergence.
\item After convergence, calculate the uncertainties, as usual,
by the equivalent of equation \ref{coeffvariancegood}, i.e.
\begin{equation}
{\bf s_a^2} = \widehat{ \chi}^2 diag\{ [\alpha\_modmod]^{-1} \} \ .
\end{equation}
\end{enumerate}
\subsection{Commentary on Stetson's solution}
\subsubsection{Stetson's solution is general and makes sense}
Stetson's solution makes sense. Consider standard least squares
in which $\sigma_{x_m} \rightarrow 0$; here we should revert to standard
least squares. Also, when the slope is zero it's like taking an average
of $y_m$, so it shouldn't matter what $\sigma_x$ is. Equation
\ref{deltaxeqn} satisfies these expectations because the denominator
$\rightarrow \infty$: the ellipse becomes tall and thin so $\delta x_m
\rightarrow 0$.
Similarly, if $\sigma_{y_m} \rightarrow 0$ or $a_1 \rightarrow
\infty$ then the ellipse becomes very short and fat: $\delta x_m
\rightarrow -{\Delta y_m \over a_1}$, meaning that {\it all} of the
correction goes to $\delta x_m$. This is equivalent to $y$ being the
independent variable, and you can solve it this way, but of course it is
much more efficient to use standard least squares with $y$ {\it
actually} being the independent variable.
Finally, there's no reason why we had to use a straight-line
fit. It could have been any function; in an arbitrary case, we replace
the slope by the {\it local} slope, namely the derivative of the
function evaluated at the current value $x_m+\delta x_m$. So this
technique is {\it general}. In his website, Stetson provides an example
of fitting a {\it circle} to points in $(x,y)$!
\subsubsection{Generality}
Above we developed the solution for the special case of a
straight-line fit assuming that the one independent variable, $x$, had
measurement errors. The resulting solution for both $a_0$ and $a_1$ is
unbiased. The Goddard library IDL routine \verb$fitexy$ handles this
case.
It seems reasonable that $a_0$ need not be a constant, but could
be a parameter for any function of an independent variable, say $z$, as
long as $z$ had no measurement errors. And by extension, there is no
reason why the solution wouldn't work for any number of independent
variables, as long as only one of them have measurement errors. I have
an IDL procedure called \verb$ls_xy_3.pro$ that treats this more general
case.
\subsubsection{More Generality}
But you might encounter the case in which {\it more than one}
independent variable has measurement errors. The above equations won't
work for this case. But all is not lost! See Jefferys (1980) for a much
more general treatment of least squares. It treats {\it all} the
observed variables completely equivalently, putting all on the ``left
hand side'' and setting the right hand side equal to zero. Thus the
equations of condition are written as ${\bf f(x,a)} = 0$, where $\bf x$ is
the vector of observed quantities and $\bf a$ the vector of parameters
to be fitted. He also adds the possibility of constraints on the
parameters; an example of such a constraint is fitting three angles of a
triangle, whose sum must be $180^\circ$. These equations, together with
the requirement to minimize the sum of chi square, leads to a Lagrange
multiplier solution. Again, matrix techniques make this tractable. I
have not had time to examine this paper in the detail it clearly
deserves.
\section{USING SINGULAR VALUE DECOMPOSITION (SVD)} \label{SVD}
Occasionally, a normal-equation matrix $[\alpha] = {\bf X^T
\cdot X}$ is sufficiently ill-posed that inverting it using standard
matrix inversion doesn't work. In its \verb$invert$ function, IDL even
provides a keyword called \verb$status$ to check on this (although I
find that it is not perfectly reliable; the best indicator of
reliability is to check that the matrix product ${\bf [\alpha ^{-1}
\cdot [\alpha]^{-1}} = {\bf I}$, the unitary matrix). In these cases,
Singular Value Decomposition (SVD) comes to the rescue.
First, we reiterate the least-squares problem. {\it For this
entire section we change our variable names from those in the remainder
of this document to those used by NR}. Least squares begins with
equations of condition (equation \ref{equationofcondition}), which are
expressed in matrix form as
\begin{equation} \label{eqnofcond}
{\bf A \cdot x} = {\bf b}
\end{equation}
\noindent In our treatments above we premultiply both sides by $\bf
A^T$, on the left generating the curvature matrix $[\alpha]$ to obtain
the normal equations (equation \ref{matrixnormal} or its equivalent
derived from equations \ref{matrixcalcs})
\begin{equation} \label{alphabeta1}
[\alpha] {\bf \cdot x} = {\bf A^T \cdot b} = [\beta] \ ,
\end{equation}
\noindent for which the solution is, of course,
\begin{equation} \label{alphabeta2}
{\bf x} = \left( [\alpha]^{-1} {\bf \cdot A^T} \right) {\bf \cdot b} =
[\alpha]^{-1} {\bf \cdot} [\beta] \ ,
\end{equation}
\noindent We need to find the inverse matrix $[\alpha]^{-1}$. Above in
this document we did this using simple matrix inversion.
SVD provides a bombproof and interestingly informative way do
least squares. Perhaps surprising, with SVD you don't form normal
equations. Rather, you solve for the coefficients directly, writing
equation \ref{eqnofcond} as
\begin{equation} \label{eqnofcondsoln}
{\bf x} = {\bf A^{-1} \cdot b} \ .
\end{equation}
\noindent SVD gives $\bf A^{-1}$ directly! By comparison of equations
\ref{alphabeta1} and \ref{eqnofcondsoln}, we have ${\bf A^{-1}} = \left(
{\bf [\alpha]^{-1} \cdot A^T} \right)$. We'll see how this works below in \S
\ref {svdandidl}.
You don't need to slog through the ``phenomenological
description'' of SVD below in \S \ref{svdphenom} to use the SVD
technique in IDL. Implementing SVD in our least-squares solutions is
trivially easy, and we provide the IDL prescription below in \S
\ref{svdandidl} and \S \ref{myidl}. But {\it be sure} to look at \S
\ref{svdimport}!!
\subsection{Phenomenological description of SVD} \label{svdphenom}
SVD is, among other things, a technique for inverting a square
symmetric $N \times N$ matrix $\bf A$. But SVD is far more general, and
in particular provides solutions the least-squares problem in which the
equation-of-condition matrix is inherently nonsquare. For a discussion of
the details of SVD, see NR \S 2.6; for SVD applied to least squares, see
NR \S 15.4; if you are rusty on matrix algebra, look at NR \S 11.0. Here
we first discuss the square-matrix case in \S \ref{square} and then the
non-square case in \S \ref{nonsquare}.
The cornerstone of SVD is that any $M \times N$ matrix {\bf A},
where the number of rows $M$ and the number of columns $N$ satisfy $M
\ge N$, can be expressed as a product of three matrices:
\begin{equation} \label{basicsvd}
{\bf A} = {\bf U \cdot} [ w ] {\bf \cdot V^T}
\ ,
\end{equation}
\noindent where \begin{enumerate}
\item $\bf U$ is $M \times N$, $[w]$ is $N \times N$ and
diagonal, and $\bf V$ is $N \times N$; and
\item the columns of $\bf U$ and $\bf V$ are unit vectors that are
orthonormal. Because $\bf V$ is square, its rows are also orthonormal so
that ${\bf V \cdot V^T} = 1$. Recall that, for square orthonormal
vectors, the transpose equals the inverse so $\bf V^T = V^{-1}$.
\end{enumerate}
\subsubsection{SVD to Obtain the Inverse of a Square Matrix} \label{square}
First we consider using SVD to obtain the inverse of a square
matrix. This is directly applicable to the usual normal equations
(equation \ref{alphabeta1}), where we need to determine $[\alpha]^{-1}$.
Suppose we have the following equation in which $\bf A$ is square:
\begin{equation}
{\bf A \cdot x} = {\bf b} \ ,
\end{equation}
\noindent and wish to obtain ${\bf A^{-1}}$. The two vectors $\bf x$ and
$\bf b$ both have $N$ dimensions. The matrix $\bf A$ is $N \times N$, so
if it is nonsingular then, no matter what the values of $\bf b$, an $\bf
x$ can be found that satisfies this equation; in the conventional
nomenclature of matrix algebra, the columns of $\bf A$ form a basis
set that spans the space. But if $\bf A$ is singular, this is no longer
true. For example, if two rows of $\bf A$ are identical, then there is
one dimension of $\bf b$ that cannot be reproduced by $\bf x$, no matter
how $\bf x$ is chosen. The term {\it nullity}, which we denote by $NL$,
is the number of degenerate rows in $\bf A$. The term {\it rank}, which
we denote by $R$, is the number of independent rows in $\bf A$. Clearly,
$R + NL = N$.
Because $\bf A$ is square, both $\bf U$ and $\bf V$ are square.
The columns of $\bf U$ and $\bf V$ are orthonormal, so their squareness
means that ${\bf U^{-1} = U^T}$ and ${\bf V^{-1} = V^T}$. Given this,
equation \ref{basicsvd}, and the general matrix identity ${\bf (P \cdot
T)^T = T^T \cdot P^T}$, it's clear that $\bf A^{-1}$ can be expressed
as a product of three matrices:
\begin{equation} \label{basicsvdsquare}
{\bf A^{-1}} = {\bf V \cdot} \left[ {1 \over w} \right] {\bf \cdot U^T}
\ ,
\end{equation}
\noindent where \begin{enumerate}
\item $w$ is the set of $N$ weights (also: the $N$ {\it singular
values}) of $\bf A$, and the $N \times N$ matrix $[{1 \over w}]$ is
diagonal, with the diagonal elements being ${ 1 \over w_n}$. Some of
these $w_n$ might be zero, in which case $\bf A$ is singular. The number
of {\it nonzero} $w_n$ is the rank $R$, i.e.\ the number of independent
rows in $\bf A$. The number of {\it zero} $w_n$ is the nullity $NL$,
i.e.\ the number of degenerate rows in $\bf A$.
\item ${\bf U}$ is a square matrix of $N$ column vectors. $R$ of
these are associated with nonzero $w_n$; these are an orthonormal set of
$R$ basis vectors that, when multiplied by an infinite variety of
vectors $\bf x$, can reproduced a set of $R$ dimensions in vectors $\bf
b$. The $n^{th}$ column of $\bf U$ corresponds to the $n^{th}$ element
of $w$.
\item ${\bf V}$ is a square matrix of $N$ column vectors. $NL$
of these are associated with zero $w_n$; these are an orthonormal set of
$NL$ basis vectors that, when multiplied by an infinite variety of vectors
$\bf x$, can reproduce the set of $NL$ dimensions in $\bf b$ that {\it
cannot} be attained using $\bf U$. The $n^{th}$ column of $\bf V$
corresponds to the $n^{th}$ element of $w$.
\end{enumerate}
The inverse matrix $\bf A^{-1}$ is defined by the equation
\begin{equation}
{\bf A \cdot A^{-1}} = {\bf I} \ ,
\end{equation}
\noindent and, after a little thought, it is clear that the process of
evaluating the inverse of our matrix $\bf A$ can be thought of as
solving for the $N$ column vectors $x_n$, separately, using the $N$
equations
\begin{mathletters}
\begin{equation}
{\bf A \cdot x_0} = [1, 0, 0 ...]
\end{equation}
\begin{equation}
{\bf A \cdot x_1} = [0, 1, 0 ...]
\end{equation}
$$ {\rm \dots} $$
\begin{equation}
{\bf A \cdot x_{N-1}} = [ 0, 0, ..., 1] \ ,
\end{equation}
\end{mathletters}
\noindent and putting the $N$ $\bf x$ vectors as columns in $\bf
A^{-1}$. If $\bf A$ is degenerate, then there will be only $R$ equations
that have a solution; these constitute the corresponding basis vectors
in $\bf U$. They are associated with the $R$ nonzero weights; think of
these weights as the projections of $\bf x$ into the $R$ dimensions of the
right hand side above.
Because of the degeneracy, $NL$ of the above equations have no
solution---in other words, {\it no} projection onto the $NL$ dimensions
with which they are associated. This means that their associated weights
are zero. This, in turn, means that any combination of their associated
$\bf V$ vectors is consistent with the solution.
When calculating $\bf A^{-1}$, equation \ref{basicsvdsquare}
uses the inverses of the weights $[{1 \over w_n}]$. It's easy to see
why: in order to reproduce a dimension on which $\bf x$ always has a
small projection, i.e.\ one that is represented by a small weight, you
need to compensate by multiplying the basis vectors by a big number. And
if the projection is zero (i.e., $w_n=0$), you need to multiply by
infinity!
But because the projection is zero, {\it any} multiple of the
small-weight $\bf V$ vectors can be added to the solution for $\bf
A^{-1}$. What multiple is appropriate? Common sense says that, because
these $\bf V$ vectors have no meaning for the solution, the multiple
should be {\it zero}. So instead of trying, in vain, to express these
$NL$ vectors by a huge multiple, you toss in the towel and eliminate
them altogether by replacing their corresponding $w_n^{-1} = \infty$ by
$w_n^{-1} = 0$. So we have the rule: {\it wherever $w_n = 0$, replace
$w_n^{-1}$ by 0!} This replacement provides the minimum length for $\bf
x$ and thereby constitutes the least-squares solution.
\subsubsection{Using SVD for Least Squares} \label{nonsquare}
Above in \S \ref{square} we used SVD to take the inverse of the
square matrix $[\alpha]$. However, this {\it not} the proper way to apply
SVD to least squares. Regard our discussion of \S \ref{square} as a
pedagogical one.
With SVD you {\it don't form normal equations}. SVD works for
non-square matrices as well as square ones. Instead of applying SVD to
the normal equations (equation \ref{alphabeta1}), we can apply it
directly to the equations of condition (equation \ref{eqnofcond})
without first making the explicit calculations of $[\alpha]$ and
$[\beta]$. Thus equation \ref{basicsvdsquare} applies {\it directly to
the equations of condition}, and all of our discussion in \S
\ref{square} also applies.
In particular, our discussion of $w_n$, and how $w_n^{-1}$ should
be set to zero when $w_n$ is small, also applies; and this procedure
provides the least-squares solution in the presence of degeneracy.
One still requires the covariance matrix to evaluate
uncertainties in the derived coefficients. It isn't obvious, but go to
NR \S 14.3. That discussion shows that the covariance matrix is given by
\begin{equation}
[\alpha ^{-1}] = {\bf V \cdot} \left[ 1 \over w^2 \right] \cdot {\bf V^T}
\end{equation}
\subsection{Important Conclusion for Least Squares!!!} \label{svdimport}
For our task of least-squares fitting, the curvature matrix
$[\alpha]$ is symmetric. Just above we discussed the case in which two
rows of $\alpha$ were degenerate. Suppose instead that two rows of this
matrix are {\it almost} degenerate; then because of the matrix symmetry
this near-degeneracy of rows is also a near-degeneracy of columns. The
near-degeneracy is revealed by a {\it small but non-zero} $w_n$. The
near-degeneracy means that two derived parameters that are associated
with those two columns have huge covariance.
{\it This, in turn, means that the formulation of the
least-squares model is faulty}: two parameters represent (or {\it
nearly}) represent the same physical quantity, or at least one parameter
is (nearly) a linear combination of other parameters.
The explicit evaluation of the $w_n$ is a big advantage for the
SVD technique. The model parameter that is associated with any small
weight is highly covariant, or degenerate, with another parameter. The
SVD technique tells you which parameters have problems. You should then
take another look at your physical model and reformulate it to get rid
of the covariance. So SVD not only tells about true degeneracy with
$w_n=0$; it also tells about {\it approaches} to degeneracy with $w_n$
small.
\subsection{How Small is ``Small''?}
When looking at $w_n$, just exactly how small is ``small''?
\subsubsection{Strictly Speaking\dots}
Strictly speaking, this is governed by numerical machine
accuracy. NR suggests measuring this by the ratio of the largest $w_n$
to the smallest. This ratio is the {\it condition number}. Single
precision 32-bit floats have accuracy $\sim 10^{-6}$, so if the
condition number is larger than $10^6$ you certainly have problems.
However, in my limited experience a large condition number---even much
larger than this---does not necessarily represent degeneracy. Rather, I
have found a small weight (getting down towards $10^{-6}$ for single
precision) to a more reliable indicator. But NR's discussion makes more
sense, and my experience is limited!
\subsubsection{Practically Speaking\dots}
Practically speaking it's not machine accuracy that counts.
Rather, it's the accuracy of your data and the ability of those
uncertain data to define the parameters. In any real problem, plot the
vector $w$. If the values span a large range, then the data with small
$w_n$ might not be defining the associated vectors well enough. If so,
then {\it practically} speaking, these vectors are degenerate.
How to tell what constitutes ``small $w_n$''? As far as I know,
and as far NR says, it's an art.
\subsection {Doing SVD in IDL} \label{svdandidl}
Doing SVD in IDL is easy. IDL's \verb$svdc$ procedure provides
the SVD decomposition. Thus for equation \ref {basicsvd}, the IDL
SVD decomposition is given by
\begin{equation}
{\bf svdc, A, w, U, V} \ ;
\end{equation}
\noindent the notation is identical to that in equation \ref {basicsvd}.
\subsection{My routine {\it lsfit\_svd}} \label{myidl}
I've written an IDL routine \verb$lsfit_svd$ that makes the
above process of dealing with the weights easy. The documentation is
detailed enough to remind you what to do. It allows you to mess with
the inverse weights and it provides not only the coefficients but also
their errors and the covariance matrix.
\section{BRUTE FORCE CHI-SQUARE AND THE CURVATURE MATRIX}
\subsection{Parameter Uncertainties in Brute Force chi-square Fitting}
There are times when ``brute force'' least squares is
appropriate. For example, if you have a nonlinear problem in which
taking derivatives is complicated, and if the number of unknown
coefficients is small, then it might be easier to search through the
coefficient parameter space, calculate the $\chi^2$ or $s^2$ for each
combination of parameters, and find the minimum. This provides the
best-fit parameter values.
How about the parameter uncertainties? Here we describe the case
for a single parameter fit; call this parameter $a$. Generalizing to
more parameters is straightforward.
For a chi-square fit, getting the uncertainty is easy. Calculate
$\chi^2$ as a function of the guessed values of $a$. As usual, define
$\Delta \chi^2$ as $\chi^2$ minus its minimum value; the minimum value
gives the best estimate of $a$. The uncertainty in $a$ is that offset
where $\Delta \chi^2 = 1$. See \S \ref{chicoeffs}.
For a least-squares fit, it's exactly the same idea. A
least-squares fit implies that the measurement uncertainties $\sigma_m$
are all identical, equal to $\sigma$. Thus, the sample variance $s^2 =
{1 \over M-1} \sum \Delta y_m^2$ is equal to $\chi^2 \sigma^2 \over
(M-1)$. In other words, $\chi^2 = {(M-1) \over \sigma^2} s^2$, which has
expectation value $(M-1)$. Therefore, the uncertainty in $a$ is that
offset where $\Delta \chi^2 = 1$, i.e.\ where $\Delta s^2 = {\sigma^2
\over (M-1)}$.
To be totally explicit:
For the fitted
value of $a_{fit}$, the sample variance is
\begin{equation}
s_{min}^2 = {1 \over M-1} \sum (y_m - a_{fit})^2
\end{equation}
\noindent As $a$ is moved from its fitted value, $s^2$ increases, so we
can speak of the minimum sample variance $s_{min}^2$. As we move $a$
from its fitted value by amounts $\Delta a$, the uncertainty in $a$ is
that value of $\Delta a$ for which $s^2$ increases by $s_{min}^2 \over
M-1$, i.e.\ that value of $\Delta a$ for which
\begin{equation}
\Delta s^2 = s^2 - s_{min}^2 = {s_{min}^2 \over M-1}
\end{equation}
\section{ NOTATION COMPARISON WITH NUMERICAL RECIPES}
I learned least squares from Appendix A of Chauvenet (1863). He
didn't use $\chi^2$ and didn't use matrix techniques, but \S
\ref{sectionone} and \ref{sectiontwo} follows his development quite
closely. I wrote the first version of this document before knowing of
NR's treatment, which explains my orientation towards least squares
instead of chi-square. I'm fortunate in this approach because it made me
realize the pitfalls one can get into with chi-square, as I discuss in
\S \ref{chisqsection}.
On the other hand, NR describe the least-squares approach with
some disdain in the discussion of equation (15.1.6) and warn that it is
``dangerous'' because you aren't comparing the residuals to the
intrinsic inaccuracies of the data. In astronomy, though, more often
than not you don't have an independent assessment of $\sigma_{m}$.
But you might know the relative weights, and this is a plus for
chi-square fitting. In any case, heed our warnings about chi-square
fitting in \S \ref{chisqsection}.
In this writeup I have revised my old notation to agree,
partially, with NR's. This effort wasn't completely successful because I
didn't read NR very carefully before starting. To make it easier to
cross-reference this document with NR, I provide the following table of
correspondences (left of the double arrow is ours, to the right is
theirs):
{\begin{mathletters}
\begin {equation}
{\bf X} \longleftrightarrow {\bf A}
\end{equation}
\begin{equation}
{\bf Y} \longleftrightarrow {\bf b}
\end{equation}
\begin{equation}
{\bf X^T \cdot X} = {\bf XX} = [\alpha] \longleftrightarrow
{\bf A^T \cdot A} = [\alpha]
\end{equation}
\begin{equation}
{\bf XX}^{-1} = {\bf XXI} = [\alpha]^{-1} \longleftrightarrow
[\alpha]^{-1} = [C] = {\bf C}
\end{equation}
\noindent I use $M$ for the number of measurements and $N$ for the
number of unknown coefficients; NR uses the opposite, so we have
\begin{equation}
N \longleftrightarrow M
\end{equation}
\begin{equation}
M \longleftrightarrow N
\end{equation}
\end{mathletters}
\noindent Confusing, hey what?
\acknowledgements It is a great pleasure to thank Tim Robishaw for his
considerable effort in providing detailed comments on several aspects of
this paper. These comments led to significant revisions and improvements.
He also fixed bad formatting and manuscript glitches.
\begin{references}
\reference {} Bevington, P.R. \& Robinson, D. 1992, Data Reduction and
Error Analysis for the Physical Sciences (WCB/McGraw-Hill).
\reference {} Chauvenet, W. 1863, A Manual of Spherical and Practical
Astronomy, Dover Press.
\reference {} Cowan, G. 1998, Statistical Data Analysis, Clarendeon
Press.
\reference {} Jefferys, W.H. 1980, AJ, 85, 177.
\reference {} Press, W.H., Flannery, B.P., Teukolsky, S.A., \&
Vetterling, W.T. 2001, Numerical Recipes (second edition), Cambridge
University Press.
\reference {} Stetson, P.B., {\it
http://nedwww.ipac.caltech.edu/level5/Stetson/Stetson4.html}.
\reference {} Taylor, J.R. 1997, An Introduction to Error Analysis,
University Science Books.
\end{references}
\end{document}