\documentclass[psfig,preprint]{aastex}
\begin{document}
\title{LEAST-SQUARES AND CHI-SQUARE FITTING FOR AY250 \\
\today}
\author{Carl Heiles}
In our never-ending attempt to make your life easier, we present
you with this highly instructive, time-saving, and labor-saving
informative document! Here we give heuristic derivations, discussions,
examples, and the prescription for doing least-squares the easy way
using matrix techniques generally, and specifically in IDL. This
prescription is given as an example in \S\ref{numexample}, and the {\it
power-user} can skip the details and go directly there.
This document is an update, correction, clarification, and
elaboration of a similar document made for the undergraduate lab class.
Here we update the notation to conform with {\it Numerical Recipes}
(NM), and occasionally refer to the books Taylor (1997; T97), Bevinton
and Robinson (1992; BR), and Press et al (19**; Numerical Recipes, NM).
We begin with least squares in the classic sense, meaning we minimize
the sum of square instead of minimizing $\chi^2$. This is often the
practical case in astronomy. In later sections we generalize to the
$\chi^2$ case.
\tableofcontents
\section{LEAST-SQUARE FITTING FOR TWO PARAMETERS, AS WITH A STRAIGHT
LINE.} \label{sectionone}
\subsection{The closed-form expressions for a straight-line fit}
First consider the least squares fit to a straight line. Let
$\theta_m$ be the $m^{th}$ measurement of the observed quantity (in this
example, $\theta_m$ is zenith distance; $t_m$ be the time of the
$m^{th}$ measurement; $M$ = the total number of observations, i.e.\ $m$
runs from 0 to $M-1$. Remember that in the least-squares technique,
quantities such as $t_m$ are regarded to be known with high accuracy
while the quantity $\theta_m$ has uncertainties in its measurement.
We expect the zenith distance $\theta_m$ to change linearly with
time as follows:
\begin{equation} \label{one}
A + B t_m = \theta_m \; .
\end{equation}
\noindent Given this, one does the maximum liklihood estimate assuming
Gaussian statistics. When all measurements have the same intrinsic
uncertainty, this is the same as looking for the solution that minimizes
the sum of the squares of the residuals (which we will define later).
This leads to the pair of equations (Taylor 8.8, 8.9), called the {\it
normal equations}
\begin{mathletters} \label{normalone}
\begin{equation}
AN + B \ \sum t_m = \sum \theta_m
\end{equation}
\begin{equation}
A \ \sum t_m + B \ \sum t_m^2 = \sum t_m \theta_m
\end{equation}
\end{mathletters}
\noindent Two equations and two unknowns---easy to solve! The
closed-form equations for $(A,B)$ are Taylor's equations 8.10 to 8.12.
\subsection{Better is the following generalized notation.}
We want a way to generalize this approach to include any
functional dependence on $t$ and even other variables, and to have an
arbitrarily large number of unknown coefficients instead of just the two
$(A,B)$. This is very easy using matrix math. We will ease into this
matrix technique gently, by first carrying through an intermediate
stage of notation.
First generalize the straight-line fit slightly by having two
functional dependences instead of one. We have something other than the
time $t_m$; call it $s_m$. For example, we could have $s_m = \cos (t_m)$
or $s_m = t_m^2$; or we could have $s_m = x_m$, where $x_m$ is the
position from which the observation was taken. To correspond to equation
\ref{one}, $s_m = 1$. Then we rewrite equation \ref{one} to include this
extra dependence
\begin{equation} \label{two}
A s_m + B t_m = \theta_m \; .
\end{equation}
\noindent There are still only two unknown parameters, so this is an
almost trivial generalization; later we'll generalize to more
parameters.
We have $M$ equations like equation \ref{two}, one for each
measurement. They are known as the {\it equations of condition} because
they are the equations that specify the theoretical model to which we
are fitting the data. There are $M$ equations of condition and only two
unknowns ($A$ and $B$). This is too many equations! We have to end up
with a system in which the number of equations is equal to the number of
unknowns.
To accomplish this, from equation \ref{two} we form the {\it
normal equations}. The number of normal equations is equal to the
number of unknowns, so in this case we will have two. We could carry
through the same ML derivation to derive equations equivalent to
equation \ref{normalone}; the result is
\begin{mathletters} \label{ones}
\begin{equation}
A\ \sum s_m^2 + B \ \sum s_m t_m = \sum s_m\theta_m
\end{equation}
\begin{equation}
A \ \sum s_m t_m + B \ \sum t_m^2 = \sum t_m \theta_m \ .
\end{equation}
\end{mathletters}
\noindent We can rewrite these equations using the notation $[st] =
\sum s_m t_m$, etc.:
\begin{mathletters} \label{normaltwo}
\begin{equation}
A [ s^2 ] + B [ s t ] = [ s \theta ]
\end{equation}
\begin{equation}
A [ s t ] + B [ t^2 ] = [ t \theta ] \ .
\end{equation}
\end{mathletters}
\noindent This is, of course, precisely analogous to equation
\ref{normalone}. And now it's clear how to generalize to more
parameters!
\section{LEAST-SQUARE FITTING FOR MANY PARAMETERS, AS WITH A CUBIC}
With this notation it's easy to generalize to more ($N$)
unknowns: the method is obvious because in each equation of condition
(like equation \ref{two}) we simply add equivalent additional terms such
as $C u_m$, $D v_m$, etc; and in the normal equations (equation
\ref{normaltwo}) we have more products and also more normal equations.
Let's take an example with four unknowns ($N=4$), which we will
denote by $A, B, C, D$; this would be like fitting a cubic. With $N=4$
we need at least five data points ($M=5$), so there must be at least
five equations of condition. The generalization of equation \ref{ones}
is the $M$ equations
\begin{equation} \label{eqcond}
A s_m + B t_m + C u_m + D v_m = \theta_m \; ,
\end{equation}
\noindent with $m = 0 \rightarrow (M-1)$. Again, the least squares
fitting process assumes that the $s_m, t_m, u_m, v_m$ are known with
zero uncertainty; all of the uncertainties are in the measurements of
$\theta_m$. We then form the four normal equations; the generalization
of equation \ref{normaltwo} written in matrix notation is:
\begin{eqnarray} \label{smeqn}
\left[
\begin{array}{cccc}
{[ ss ]} & {[ st ]} & {[ su ]} & {[ sv ]} \\
{[ ts ]} & {[ tt ]} & {[ tu ]} & {[ tv ]} \\
{[ us ]} & {[ ut ]} & {[ uu ]} & {[ uv ]} \\
{[ vs ]} & {[ vt ]} & {[ vu ]} & {[ vv ]} \\
\end{array}
\; \right]
\left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array}
\; \right]
\; =
\left[
\begin{array}{c}
{[ s \theta ]} \\
{[ t \theta ]} \\
{[ u \theta ]} \\
{[ v \theta ]} \\
\end{array}
\; \right]
\end{eqnarray}
\noindent The $N \times N$ matrix on the left is symmetric. With $N$
equations and $N$ unknowns, you can actually {\it solve} for $A, B, C,
D$!
\section{FAR, FAR BEST AND EASIEST: MATRIX ALGEBRA.} \label{matrixmethod}
The above equations are terribly cumbersome to solve in a
computer code because they require lots of loops. However, it becomes
trivial if we use matrices. Here we designate a {\bf matrix} by {\bf
boldface} type.
We illustrate the matrix method by carrying through the above
$N=4$ example, and we assume that there are 5 independent measurements
($M=5$). We first define the matrices
\begin{mathletters} \label{matrixdefinition}
\begin{eqnarray}
{\bf X} = \left[
\begin{array}{cccc}
{s_0} & {t_0} & {u_0} & {v_0} \\
{s_1} & {t_1} & {u_1} & {v_1} \\
{s_2} & {t_2} & {u_2} & {v_2} \\
{s_3} & {t_3} & {u_3} & {v_3} \\
{s_4} & {t_4} & {u_4} & {v_4} \\
\end{array}
\; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf a} = \left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array} \; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf Y} = \left[
\begin{array}{c}
{\theta_0} \\
{\theta_1} \\
{\theta_2} \\
{\theta_3} \\
{\theta_4} \\
\end{array}
\; \right]
\end{eqnarray}
\end{mathletters}
\noindent so, in matrix form, the equations of condition (equation
\ref{eqcond}) reduce to the single matrix equation
\begin{equation}
\label{equationofcondition}
{\bf A \cdot X} = {\bf Y} \; .
\end{equation}
\noindent The notation for these equations agrees with NM's. The matrix
$\mathbf{X}$ is almost the same as NM's ``design matrix'' of Figure
15.4.1, and the matrix equation of condition (equation
\ref{equationofcondition}) is the almost same as the expression inside
the square brackets of NM's equation 15.4.6. The differences arise
because we have omitted the factors involving $\sigma_m$, the intrinsic
measurement uncertainties; we include them below in \S
\ref{chisqsection}.
Again, there are more equations than unknowns so we can't solve
this matrix equation directly. So next we form the normal equations
from these matrices. In matrix form, the normal equations (equation
\ref{smeqn}) reduce to the single equation
\begin{equation} \label{matrixnormal}
\mathbf{[\alpha] \cdot a} = \mathbf{[\beta]} \; ,
\end{equation}
\noindent (NM equation 15.4.10), where
\begin{mathletters}
\label{ssdef}
\begin{equation}
[\alpha] = {\bf X^T \cdot X}
\end{equation}
\begin{equation}
[\beta] = {\bf X^T \cdot Y}
\end{equation}
\end{mathletters}
\noindent The matrix $[\alpha]$ is known as the {\it curvature matrix}
because each element is twice the curvature of $\sigma^2$ plotted
against the corresponding product of variables, as we discuss below in
\S \ref{chicoeffs}.
The number of equations is equal to the number of
unknowns, so the solution of the matrix equation is easy---just rewrite
it by multiplying both sides by the inverse of $[\alpha]$ (that is, by
$[\alpha]^{-1})$, which gives
\begin{equation}
\label{adef}
{\bf a} = [\alpha]^{-1} \mathbf{ \cdot [\beta]} \; .
\end{equation}
\noindent All of these matrix operations are trivially easy in IDL (\S
\ref{numexample}).
\section{UNCERTAINTIES IN THE DERIVED COEFFICIENTS.} \label{sigmas}
How about the uncertainties in the derived quantities contained
in the matrix ${\bf a}$?
The first thing to do is derive the sample variance $s^2$ (the
square of standard deviation, or mean error, or dispersion, etc) of the
individual data points using the generalization of the usual definition
for a straight average of $x$, $s^2 = [ \sum_0^{M-1} (x_m - \overline{
x_m})^2/(M-1)$. The generalization is, simply, to replace the $M-1$ in
the denominator by $\nu = M-N$. In the straight-average case, $N=1$ so
this fits. Here $\nu$ is known as the number of {\it degrees of freedom}
and $N$, the number of unknown coefficients, is known as the nmber of
{\it constraints}. So we have
\begin{equation} \label{samplevarianceone}
s^2 = {1 \over M - N} \sum_{m=0}^{M-1} (\theta_m -
\overline{\theta_m})^2 \; ,
\end{equation}
\noindent where $\overline{\theta_m}$ are the values for $\theta_m$ {\it
predicted by the derived quantities ${\bf a}$}. Note the difference:
$\theta_m$ are the {\it observed} values, while $\overline{\theta_m}$ are the
values {\it predicted by the least squares fit}. The predicted values
are those that are computed from the derived coefficients $A, B, C$\dots
The $M$ quantities
\begin{equation}
\delta \theta_m = \theta_m - \overline{\theta_m}
\end{equation}
\noindent are called the {\it residuals} or {\it deviations} from the
fit.
It's easy to calculate the $\overline{\theta_m}$ with matrices. First
define the matrix ${\bf \overline{ Y}}$ that contains these values:
\begin{eqnarray}
{\bf \overline{ Y}} = \left[
\begin{array}{c}
\overline{\theta_0} \\
\overline{\theta_1} \\
\overline{\theta_2} \\
\overline{\theta_3} \\
\overline{\theta_4} \\
\end{array}
\; \right]
\end{eqnarray}
{\noindent} Calculating ${\bf \overline{ Y}}$ is simple:
\begin{equation} \label{ybardefinition}
{\bf \overline{ Y}} = {\bf a \cdot X} \; .
\end{equation}
\noindent Note that {\bf X} is already defined (equation
\ref{matrixdefinition}) and {\bf a} was solved for in equation
\ref{adef}. It's convenient to define the residual matrix
\begin{equation}
{\mathbf \delta Y } = {\bf Y - \overline{ Y}}
\end{equation}
\noindent so we can write
\begin{equation} \label{samplevariancetwo}
s^2 = {1 \over M - N} {\bf \delta Y^T \cdot \delta Y} \; .
\end{equation}
This is the sample variance of the datapoints, not the variances
in the derived coefficients. We can obtain these as before, by
generalizing the results from the two-parameter case like the
straight-line fit discussed in \S \ref{sectionone}. We won't go through
the derivation here; you can find it in Taylor \S 8.4 and equation 8.16,
8.17. The result is
\begin{equation} \label{coeffvarianceone}
{\bf s_a}^2 = s^2 diag \{[\alpha]^{-1} \}
\end{equation}
\noindent Or, to put it simply in words: to get the variance of
coefficient $n$ in the matrix ${\bf a}$, multiply $s^2$ by the $n^{th}$
diagonal element of $[\alpha]^{-1}$. You can also write this equation as
\begin{equation} \label{coeffvariancetwo}
{\bf s_a}^2 = diag \{ {\bf \delta Y^T \cdot [\alpha]^{-1} \cdot Y} \}
\end{equation}
\noindent Although the above equations are correct, there is more to the
story because of covariance, which are the off-diagonal elements. We
return to this topic in \S \ref{chisqsection}.
\section{A NUMERICAL EXAMPLE AND ITS SOLUTION IN IDL.} \label{numexample}
If the following sounds like Greek to you, take a look at \S
\ref{matrixmethod} and \ref{sigmas}.
\subsection{Generation and solution of the numerical example}
{\bf (1) First, generate the numerical example.} Suppose that we
make four measurements of the angle $\theta$ and we want to fit to a
parabolic function in time $t$. In the notation of equation~\ref{eqcond},
$s$ would be unity, $t$ the time, and $u$ the time squared, so the
number of unknowns is three ($N=3$). Because there are four independent
measurements ($M=4$) the subscripts run from $m = 0 \rightarrow 3$.
Suppose that the four values of time are 5, 7, 9, 11.
\begin{figure}[h!]
\begin{center}
\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
%\includegraphics{twolvla.ps}
\end{center}
\caption{Our numerical example. Stars are the four datapoints; the solid
line is the fit. We perform two fits: one uses the original definition
of time; the other uses $(time-8)$, in effect moving the $y$-axis to the
dashed line. The two fits give the same line but the coefficients and
their errors differ greatly.\label{lsfitfig}}
\end{figure}
First we create the matrix {\bf X} in IDL
\begin{equation}
{\bf X = fltarr(N,M) = fltarr(3,4)}
\end{equation}
\noindent and then we populate it with numbers. In your own work, you
would normally do this by reading a data file and transferring the
numbers to the matrix using IDL commands; to work through this example,
you might manually type them in. After populating the matrix, in direct
correspondence with equation \ref{matrixdefinition}a we have $s_m = 1$,
$t_m = time_m$, $u_m = time_m^2$:
\begin{mathletters}
\begin{eqnarray}
{\bf X} = \left[
\begin{array}{ccc}
1 & 5 & 25 \\
1 & 7 & 49 \\
1 & 9 & 81 \\
1 & 11 & 121 \\
\end{array} \; \right]
\end{eqnarray}
\end{mathletters}
\noindent Suppose that the four measured values of $\theta$ are
(equation~\ref{matrixdefinition}c)
\begin{mathletters}
\begin{eqnarray}
{\bf Y} = \left[
\begin{array}{c}
142 \\
168 \\
211 \\
251 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent One word of caution here: in IDL, to get these into a column
matrix, which is how we've treated $\bf Y$ above, you have to define
${\bf Y}$ as a two-dimensional array because the second dimension
represents the column. When working in IDL it's more convenient to
define a row vector, which has only one dimension; in IDL you do this
by defining ${\bf Y} = [142,168,211,251]$; you can make it into the
necessary column vector by taking its transpose, i.e.\ ${\bf Y =
transpose(Y)}$. Alternatively, you can define ${\bf Y} = [ [fltarr(0)],
[142,168,211,251]]$.
\noindent Figure \ref{lsfitfig} shows the datapoints, together with the
fitted curve.
{\bf (2) Now solve it in IDL.} In IDL we calculate the normal
equation matrices and denote the $[\alpha]$ in equation \ref{ssdef}a by
${\bf XX}$:
\begin{mathletters}
\begin{equation}
{\bf XX} = {\bf transpose(X) \# \# X} \; ,
\end{equation}
\noindent and we denote the $[\beta]$ in equation \ref{ssdef}b by {\bf
XY}:
\begin{equation}
{\bf XY} = {\bf transpose(X) \# \# Y} \; .
\end{equation}
\end{mathletters}
\noindent In IDL we take the inverse of $[\alpha]$ (same as $\bf XX$) by
\begin{equation}
{\bf XXI} = {\bf invert(XX)} \; .
\end{equation}
The least-square fitted quantities are in the matrix {\bf a}
(equation \ref{adef}), which we obtain in IDL with
\begin{equation}
{\bf a} = {\bf XXI \ \#\# \ XT} \; .
\end{equation}
In IDL we denote the matrix of predicted values $\overline{
\theta_m}$ by ${\bf BARY}$, which is
\begin{equation}
\label{bteqn}
{\bf BARY} = {\bf X \ \#\# \ a} \; .
\end{equation}
\noindent and we can also define the residuals in $\bf Y$ as
\begin{equation}
{\bf DELY} = {\bf Y - BARY}
\end{equation}
\noindent In IDL we denote $s^2$ in equations \ref{samplevarianceone}
and \ref{samplevariancetwo} by $s\_sq$ and write
\begin{equation}
\label{sigsq}
s\_sq = {\bf transpose(DELY) \#\# DELY}/(M-N) \; .
\end{equation}
\noindent It is {\it always} a good idea to plot all three quantities
(the measured values {\bf Y}, the fitted values {\bf BARY}, and the
residuals ${\bf DELY}$) to make sure your fit looks reasonable and to
check for bad datapoints.
To get the error in the derived coefficients we need a way to
select the diagonal elements of a matrix. Obviously, any $N \times N$
matrix has $N$ diagonal elements; a convenient way to get them is
\begin{equation}
diag \ elements \ of \ {\bf XXI}= {\bf XXI[(N+1) * indgen(N)]}
\end{equation}
\noindent In IDL, we define the variances of
the $N$ derived coefficients by {\bf vardc} (think of ``{\bf var}iances of
{\bf d}erived {\bf c}oefficients''). You can get this as in equation
\ref{coeffvarianceone} from
\begin{equation} \label{lscoefferrorone}
{\bf vardc} = s\_sq * {\bf XXI[(N+1)*indgen(N)]} \; .
\end{equation}
\noindent or as in equation \ref{coeffvariancetwo} from
\begin{equation} \label{lscoefferrortwo}
{\bf vardc} = {\bf ( DELY^T \#\# XXI \#\# DELY) [(N+1)*indgen(N)]}/(M-N) \; .
\end{equation}
\noindent Here, note the general IDL feature: if you enclose any array
expression in parenthesis, you can treat the result as an array and
subscript it as you would a single array.
\subsection {Discussion of the numerical example}
For this numerical example, the solution (the array of derived
coefficients) is
\begin{mathletters}
\label{avalues}
\begin{eqnarray}
{\bf a} = \left[
\begin{array}{c}
96.6250 \\
4.49805 \\
0.875122 \\
\end{array} \; \right]
\end{eqnarray}
\noindent and the errors in the derived coefficients [the square root of
the $\sigma^2$'s of the derived coefficients, i.e. $[\sigma_m^2]^{1/2}$
or, in IDL, $sqrt({\bf vardc})$ in equations \ref{lscoefferrorone} and
\ref{lscoefferrortwo}] are:
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
34.0118 \\
8.99995 \\
0.559014 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent These results look {\it horrible}: the uncertainties are large
fractions of the derived coefficients,
The reason: we have specifically chosen an example with terrible
covariance. And the great thing is this can be fixed easily (at least in
this case---certainly not always), without taking more data!
\section{THE COVARIANCE MATRIX AND ITS NORMALIZED COUNTERPART}
\label{ncov}
First we provide a general discussion, then we apply it to the
above numerical example.
\subsection{ General discussion}
The variances in the derived coefficients are obtained from the
diagonal elements of ${\bf XXI}$. The off-diagonal elements represent
the {\it covariances} between the derived coefficients. Covariance means,
specifically, the degree to which the {\it uncertainty} in {\it one}
derived coefficient affects the uncertainty in {\it another} derived
coefficient.
Because the covariance matrix elements relate pairwise the
various coefficients, the units of the matrix elements are all
different. This makes it convenient to reduce all the matrix elements to
a standard set of units---namely, no units at all. So before discussing
the covariance matrix {\it per se}, we first discuss its normalized
counterpart.
The normalized covariance
matrix\footnote{It is a pleasure to thank Doug Finkbiener for
introducing me to this concept.} {\bf ncov} is derived from ${\bf XXI}$
by dividing each element by the square root of the product of the
corresponding diagonal elements. Let ${\bf ncov}$ be the normalized
covariance matrix; then
\begin{equation}
ncov_{ik} = {XXI_{ik} \over \sqrt {XXI_{ii} \ XXI_{kk}}}
\end{equation}
\noindent Or, in IDL, do the following:
\begin{mathletters} \label{idlcovariance}
\begin{equation}
{\bf dc = XXI[(M+1)*indgen(M)]}
\end{equation}
\begin{equation}
{\bf ncov = XXI/{\it sqrt}(dc \# dc)}
\end{equation}
\end{mathletters}
\noindent In the above, ${\bf dc \# dc}$ is an $M \times M$ matrix
consisting of products of the diagonals of ${\bf XXI}$, so dividing
${\bf XXI}$ by $sqrt({\bf dc \# dc})$ generates the normalized version.
Because {\bf ncov} is a {\it normalized} covariance matrix, you
might think that it's non-normalized parent is {\bf XXI}---and you'd be
{\it almost} right. The true covariance matrix ${\bf C}$ (as defined in
{\it Numerical Recipes} for example) is
\begin{equation}
{\bf C} = \sigma_{data}^2 {\bf XXI}
\end{equation}
\noindent because ${\bf C}$ is defined for $\chi^2$ instead of
$\sigma^2$ (see \S \ref{chisqsection}).
In ${\bf ncov}$, the diagonal elements are all unity and the
off-diagonal elements reflect the interdependence of the derived
coefficients on each other. The off-diagonal elements can range from
$-1 \rightarrow 1$. Each matrix element is the correlation coefficient
between the {\it uncertainties} of its two parameters. In particular,
suppose that the data happens to produce a coefficient that differs from
its true value by some positive number. If the normalized matrix element
is negative, then the other coefficient will tend to differ from its
true value by a negative number.
Here's a more detailed discussion of what the covariance means.
Suppose you are least-squares fitting for two derived coefficients
($A_0$ and $A_1$). When you do a least-squares fit to a set of data,
you are fitting one set of data out of a possible infinity of possible
sets that you'd get by repeating the experiment, and your particular set
of data happens to produce specific values of $\overline{ A_0}$ and $\overline{
A_1}$, which differ from the {\it true} values $(A_0^*, A_1^*)$ by
amounts $\delta A_0$ and $\delta A_1$. If their covariance is zero,
then in the infinity of data sets you'd find that $\delta A_0$ is
uncorrelated with $\delta A_1$. But if it is nonzero, these two
quantities would be correlated.
A high covariance is bad because the derived variables depend on
each other. As we shall see in the next subsection, it also can greatly
increase the uncertainty in derived coefficients. Often a high
covariance results from a poor choice of functions that you are fitting
or even a bad choice of the zero point of the independent variable---as
in our numerical example (see the next subsection). And, as in that
example, you can sometimes eliminate the bad covariance by
reformulating the problem---you don't even need to take more data! The
best reformulation involves using a set of orthonormal functions.
However, sometimes your interest is in a specific set of functions that
are {\it not} orthogonal, and in such cases it makes no sense to convert
to orthogonal functions---because you just have to convert back again
and do the error propagation after-the-fact instead of letting the least
squares process do it for you.
\subsection{ The covariance in our numerical example} \label{covarianceone}
Apply equation \ref{idlcovariance} to obtain the covariance
matrix for our numerical example:
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & -.989848 & .969717 \\
-.989848 & 1 & -.993808 \\
.969717 & -.993808 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
The off-diagonal elements are {\it huge}. This is the reason why
our solution is so bad.
In this seemingly innocuous example we have an excellent case of
a poor choice of zero point for the independent variable (the time).
The reason is clear upon a bit of reflection. We are fitting for $y =
A_0 + A_1 t + A_2 t^2$. The coefficient $A_0$ is the $y$-intercept and
$A_1$ is the slope. Inspection of Figure \ref{lsfitfig} makes it very
clear that an error in the slope has a big effect on the $y$-intercept.
Now we retry the example, making the zero point of the time
equal to the mean of all the times, that is we set $(time_m = time_m -
8)$. We get the same fitted line, but the derived coefficients are
completely different---and amazingly better! We get
\begin{mathletters}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
188.625 \\
18.500 \\
0.87500 \\
\end{array} \; \right]
\end{eqnarray}
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
3.58 \\
1.00 \\
0.559 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent In redefining the origin of the independent variable, we have
made the zero intercept completely independent of the slope: changing
the slope has no affect at all on the intercept. You can see this from
the normalized covariance matrix, which has become
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & 0 & -0.78086881 \\
0 & 1 & 0 \\
-0.78086881 & 0 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
\noindent which is nice, but not perfect: Our step is {\it partial}
because the second-order coefficient $A_2$ affects $A_0$ because, over
the range of $[(time - 8) = -3 \rightarrow +3]$, the quantity $[A_2 \;
\Sigma(time_m -8)]$ is nonzero and thereby has the same effect as $A_0$.
We could complete the process of orthogonalization by following
the prescription in BR chapter 7.3, which discusses the general technique of
orthogonalizing the functions in least square fitting. The general case
is a royal pain, so much so that we won't even carry it through for our
example.
For some particular cases, standard pre-defined functions are
orthogonal. For example, if $t_m$ is a set of uniformly spaced points
beteween $(-1 \rightarrow 1)$ and you are fitting a polynomial, then the
appropriate orthogonal set is Legendre polynomials. This is good if your
only goal is to represent a bunch of points by a polynomial function,
because the coefficients of low-order polynomials are independent of the
higher ones. However, it's more work and, moreover, often you are
interested in the coefficients for specific functions that don't happen
to be orthogonal; in such cases, you should just forge ahead.
But {\it always} look at the normalized covariance matrix.
Suppose one pair of off-diagonal elements departs significantly from
zero. Then their corresponding functions are far from being orthogonal
and the variances of the derived coefficients will suffer as a result.
You might be able to eliminate one of the parameters to make the fit
more robust. For example, suppose one function is $t \cos (t)$ and the
other is $\sin(t) \cos (t)$. If the range of $t$ is small, these two
functions are indistinguishable and you should eliminate one from the
fit. If the range of $t$ is large, there is no problem.
For further discussion of covariance, see \S \ref {chicoeffs}.
Also, you might also want to try out another example in Taylor's \S 8.5.
\section{CHI-SQUARE FITTING, AND WEIGHTED FITTING} \label{chisqsection}
Chi-square fitting is very much like our least-squares fitting
above, with one additional wrinkle. The reduced Chi-square ($\chi^2$)
is equal to the ratio of the {\it variance of the fit} ($\sigma^2$) to
the {\it intrinsic variances of the datapoints} ($\sigma_{data}^2$). So
it should be obvious that in Chi-square fitting, you must know the
measurement uncertainties of the individual data points beforehand. If
you want to give the various datapoints weights based on something other
than $\sigma_{data}$, then that is just like Chi-square fitting except
that you can adopt an arbitrary scale factor for the uncertainties
(section~\ref{weightedcase}).
\subsection{The Chi-square when all datapoints have identical dispersions}
\label{sectionchisq}
Suppose, first, and only in this subsection, that all datapoints
have the same intrinsic measurement uncertainty $\sigma_{data}$. This is
a function {\it only} of the {\it instrument} and the {\it data-taking
procedure}: if you know the properties of the instrument, and if you
understand how it works, and if there are no systematic errors, then you
should be able to specify $\sigma_{data}$. Specifically, $\sigma_{data}$
is most certainly {\it not} the same thing as the dispersion of the
points from the least-squares fit.
This case, in which all datapoints have the same intrinsic
measurement uncertainty $\sigma_{data}$, occurs frequently, for example
when the datapoints are taken under identical conditions. Least-squares
assumes a Gaussian distribution of datapoints around the true
value, which is characterized by the dispersion $\sigma_{data}$. In
this case, Chi-square fitting is just like least-squares fitting except
for the following:
{\bf (1)} In the matrix $\bf X$ of equation
\ref{matrixdefinition}a, each row $m$ is a different measurement with a
different intrinsic variance $\sigma_{data,m}$; for now, we assume these
are all the same, equal to $\sigma_{data}$. You are generating a new
matrix $\mathbf {X_\sigma}$, which is identical to {\bf X} except that
each row $m$ is divided by $\sigma_{data,m}$. This new matrix is the
same as NR's {\it design matrix}, which they denote by $X$; we will
change our notation after this subsection to agree with theirs.
{\bf (2)} Divide each datapoint $\theta_m$ in equation
\ref{matrixdefinition}b by $\sigma_{data,m}$. You are generating a new
matrix $\mathbf {Y_\sigma}$, which is identical to {\bf Y} except that
each row is divided by $\sigma_{data}$. This new matrix is the same as
NR's vector {\bf Y}; we will change our notation after this subsection
to agree with theirs.
Look at the effect of these steps carefully. In
equation~\ref{equationofcondition} you've divided both sides by the same
quantity, $\sigma_{data}$. Obviously, this doesn't change the derived
results for {\bf A}.
However, when you calculate the predicted values ${\bf
\overline{ Y_\sigma}}$ in equation \ref{ybardefinition}, using $\bf
X_\sigma$ instead of $\bf X$, the values are divided by $\sigma_{data}$.
Because $\bf Y_\sigma$ is also divided by $\sigma_{data}$, the residuals
$\bf \delta Y_\sigma$ are as well. Thus when you calculate the square of
the sum of the residuals as in equation \ref{samplevariancetwo} you get
$s^2 \over \sigma_{data}^2$, which is the reduced chi-square, denoted as
$\widehat{\chi_{obs}}^2$ :
\begin{equation}
\label{chisq}
\widehat{\chi_{obs}}^2 = { {\bf \delta Y_\sigma^T \cdot \delta Y_\sigma} \over M-N}
= { {\bf \delta Y^T \cdot \delta Y} \over \sigma_{data}^2(M-N)}
\end{equation}
\noindent The reduced chi-square $\widehat{\chi_{obs}}^2$ is equal to
the ordinary chi-square $\chi_{obs}^2$ except that it is divided by the
number of degrees of freedom, ordinarily denoted by $\nu$, which is
equal to $(M-N)$. Moreover,
\begin{equation}
\widehat{\chi_{obs}^2} =
{s^2 \over \sigma_{data}^2} \ .
\end{equation}
This is a very reasonable result. Suppose, for example, that the
least-squares fit model is perfect and the only deviations from the
fitted curve result from measurement error. Then by necessity we have
$s^2 = \sigma_{data}^2$ and $\widehat{\chi_{obs}^2} = 1$. In practice,
of course, different experiments produce somewhat different values of
$s^2$ because of statistical fluctuations; an average gives $\sigma^2 =
\langle s^2 \rangle$.
In the case $\widehat{\chi_{obs}}^2 = 1$ the dispersion of the
observed points $\sigma$ is equal to the intrinsic dispersion of the
datapoints $\sigma_{data}$ and the mathematical model embodied in the
least-squares fit is perfect. That, at least, is the {\it theoretical}
conclusion. In practice, however, your obtaining such a low, good value
for $\chi^2$ might mean instead that you are using {\it too large} a
value for $\sigma_{data}$: you are ascribing more error to your
datapoints than they really have, perhaps by not putting enough faith in
your instrument. Specifically, if the datapoints really have intrinsic
measurement dispersion $\sigma_{data}$ and you specify that they have $2
\sigma_{data}$ when computing $(\mathbf{X_\sigma},\mathbf{Y_\sigma})$,
then you'll find $\widehat{\chi_{obs}^2} = 0.25$.
High values of $\widehat{\chi_{obs}^2}$ indicate that the model
is not perfect and could be improved by the use of a different model,
such as the addition of more parameters---or, alternatively, that you
think your equipment works better than it really does and you are
ascribing {\it less} error to your datapoints than they really have.
We defer the interpretation of $\chi^2$ to below.
\subsection{The case in which datapoints have different dispersions,
or different weights: like a weighted average}
\label{weightedcase}
Henceforth, we change our notation from the previous subsection,
within which $\bf Y$ and $\bf Y_\sigma$ were different; henceforth,
consider $\bf Y \equiv \bf Y_\sigma$ and $\bf X \equiv \bf X_\sigma$.
This means that in the definition of $\bf Y$ and $\bf X$ we follow the
first two steps in \S \ref{sectionchisq}. This makes our notation
conform to that in NM.
In this case, the $\sigma_{data,m}$ are all different. The
$m^{th}$ row of the equation-of-condition matrix {\bf X} and the
$m^{th}$ element of the data vector {\bf Y} get divided by their
corresponding $\sigma_{data,m}$. The equation embodied in each row of
the matrix equation~\ref{equationofcondition} remains unchanged, but the
different rows are weighted differently with respect to each other.
Consider two measurements with intrinsic measurement
uncertainties $(\sigma_{data,1}, \sigma_{data,2})$; suppose
$\sigma_{data,1} < \sigma_{data,2}$. After being divided by their
respective $\sigma_{data}$'s, all of the numbers in row 1 are larger
than those in row 2. In all subsequent matrix operations, these larger
numbers contribute more to all of the matrix-element products and sums.
Thus, the measurement with smaller uncertainty has more influence on the
final result, as it should.
Suppose that the above two measurements were taken under
identical conditions except that measurement 1 received more integration
time than measurement 2; we have ${\sigma_{data,1} \over
\sigma_{data,2}} = \left({\tau_1 \over \tau_2}\right)^{-1/2}$, so the
rows of $\mathbf {X}$ are weighted as $\tau^{1/2}$. This means that
during the computation of $\mathbf {XX}$, the self-products of row 1 are
weighted as $\tau_1$. This means that each datapoint is weighted as
$\tau$, which is exactly what you'd expect! Note that this is also
exactly the same weighting scheme used in a weighted average, in which
the weights are proportional to $\left(1 \over \sigma_{data}\right)^2$.
We conclude that weighting scheme of the first two steps in
section~\ref{sectionchisq} agrees with common sense.
Suppose you don't know the intrinsic measurement dispersion
$\sigma_{data}$, but you {\it do} know the {\it relative} dispersion of
the various measurements. For example, this would be the case if the
datapoints were taken under identical conditions except for integration
time; then $\sigma_{data} \propto \tau^{-1/2}$. In this case, multiply
each row by its weight $w \propto {1 \over \sigma_{data}}$ and proceed
as above.
\subsection{The uncertainties of the derived coefficients }
\label{chicoeffs}
The above two subsections deal with the Chi-square of the fitted
points, and provide information about how well the model fits the data.
It says nothing about the uncertainties in the derived coefficients. To
get this{\footnote{Our discussion follows {\it Numerical Recipes} \S
14.5, the first three italicized subsections, which we denote by NR.},
you re-express equation~\ref{lscoefferror} in terms of the new
parameters, namely
\begin{equation}
\label{lscoefferror1}
{\bf sigdcsq} = \widehat{\chi_{obs}^2} * {\bf SSI_X[(M+1)*indgen(M)]} \; ,
\end{equation}
\noindent where $\widehat{\chi_{obs}^2}$ is from equation~\ref{chisq}.
However, {\it this is most definitely not the whole story} because you
are doing a multiparameter fit, and generally speaking the derived
parameters have nonzero covariance.
Consider the example of the first two coefficients in our above
example, which we discussed a bit above in \S \ref{covarianceone}. In
this example, the fit gives $y = A_0 + A_1 t + A_2 t^2$, where the
numerical values are given in vector form by equation~\ref{avalues}. The
coefficient $A_0$ is the $y$-intercept and $A_1$ is the slope. They have
derived values $A_0 = 96 \pm 34$ and $A_1 = 4 \pm 9$.
Remember what these uncertainties really mean: in an infinity of
similar experiments, you'll obtain an infinity of values of $(A_0,A_1)$
that are normally distributed with dispersions (34,9). Loosely
speaking, this means that $A_0$ lies between $(96-34=62)$ and
$(96+34=130)$ and $A_1$ lies between $-4.5$ and $13.5$. If you are
interested in knowing about $A_0$ {\it without regard to $A_1$}, then
its uncertainty is indeed 96; ditto for $A_1$. In other words,
equations~\ref{lscoefferror} and \ref{lscoefferror1} apply. Thus, if
you don't care about $A_1$, then you can conclude $A_0 = 130$; and if
you don't care about $A_0$, then you can conclude $A_1=13.5$ (both with
$68.3\%$ probability).
However, if you are interested in knowing about {\it both}, you
must include their covariance. In our example, the large negative
covariance follows logically just from looking at a graph: if you fit
some points, all of which lie at positive $t$, then a more negative
derived slope will raise the $y$-intercept.
Specifically, the large negative covariance means that positive
departures of $A_0$ are associated with negative departures of $A_1$.
So even though the {\it individual} values $A_0=130$ and $A_1=13.5$ are
acceptable, you {\it cannot} conclude that the {\it pair} of values
$(A_0,A_1) = (130, 13.5)$ is acceptable, because this pair has large
positive departures of both. In contrast, what {\it is} acceptable here
would be something like $(A_0,A_1) = (130, -4.5)$.
We stress that the acceptable ranges of values depend on what
you are interested in. This is sort of like the observer's influence in
quantum mechanics. If you are interested in $A_1$ alone, then you can
say $A_1 = 4 \pm 9$ and, in making this statement, you have to realize
that, as $A_1$ varies over this range, $A_0$ can vary over (formally, at
least) the range $(\infty \rightarrow -\infty)$: you just don't give a
damn {\it what} happens to $A_0$ because you're not interested. But the
moment you become interested and restrict its possible range, that
influences the possible range for $A_1$, too.
There is no simple relationship between the covariance matrix
elements and the acceptable ranges. For two variables, the best way to
express this matters is to construct the ellipses that define the loci
of constant $\Delta \chi^2$ and present them on a graph as in NR's
Figure 14.5.4. For three variables, these ellipses become ellipsoids;
for four, they become four-dimensional volumes, etc.
To construct these curves/surfaces/volumes/etc., follow the
prescription defined by the six bullets at the end of the section NR's
{\it Probability Distribution of Parameters in the Normal Case}. This
description is a bit curt, but it is accurate. To make sure that you
understand it, take a numerical example and try applying the
prescription to the case of deriving the uncertainties of a single
coefficient (the case $\nu = 1$); that case is the case covered by
equations~\ref{lscoefferror} and \ref{lscoefferror1} above. In this
case, you have the usual normal distribution function for that variable,
in which its formal error (as defined in equations~\ref{lscoefferror}
and \ref{lscoefferror1} above) defines a range within which the derived
value will fall $68.3\%$ of the time in the infinity of
experiments---it's just like any other normal distribution.
\section{NONLINEAR LEAST SQUARES} \label{nonlinearls}
The least squares formulation requires that the data values
depend {\it linearly} on the unknown coefficients. For example, in
equation \ref{one}, the unknown coefficients $A$ and $B$ enter linearly.
Suppose you have a nonlinear dependence, such as for example
wanting to solve for $A$ and $B$ with equations of condition that look
like
\begin{equation} \label{basicone}
\sin(A t_m) + B t_m = \theta_m \; .
\end{equation}
\noindent What do you do here? You linearize the equation, using the
following procedure.
First, assume a trial values for $A$ and $B$; call these $A_0$
and $B_0$. You should pick values that are close to the correct ones. In
our example you wouldn't need to do this for $B$, but it's easier to
treat all coefficients identically. These trial values produce {\it
predicted} values $\theta_{0,m}$:
\begin{equation} \label{trialone}
\sin(A_0 t_m) + B_0 t_m = \theta_{0,m} \; .
\end{equation}
\noindent Subtract equation \ref{trialone} from \ref{basicone}, and
express the differences as derivatives. Letting $\delta A = A - A_0$ and
$\delta B = B - B_0$, this gives
\begin{equation} \label{difference}
\delta A [t_m \cos(A_0 t_m)] + \delta B t_m = \theta_m - \theta_{0,m} \; .
\end{equation}
\noindent This is linear in $(\delta A, delta B)$ so you can solve for
them using standard least squares. Increment the original guessed values
to calculate $A_{0,new}=A_0 + \delta A$ and $B_{0,new}=B_0 + \delta B$,
These won't be exact because higher derivatives (including cross
derivatives) come into play, so you need to use these new values to
repeat the process. This is an iterative procedure and you keep going
until the changes become ``small''. The generalization to an arbitrarily
large number of unknown coefficients is obvious.
We now offer some cautionary and practical remarks.
{\bf (1): Multiple minima:} Nonlinear problems often have
multiple minima in $\sigma^2$. A classical case is fitting multiple
Gaussians to a spectral line profile. Gaussians are most definitely not
orthogonal functions and in some cases several solutions may give almost
comparably good values of $\sigma^2$, each one being a local minimum.
For example, for the case of two blended Gaussians, one can often fit
two narrow Gaussians or the combination of a wide and narrow Gaussian,
the two fits giving almost equal $\sigma^2$. The lower of these is the
real minimum but, given the existence of systematic errors and such, not
necessarily the best solution. The best solution is often determined by
physical considerations; in this case, for example, you might have
physical reasons to fit a broad plus narrow Gaussian, so you'd choose
this one even if it's $\sigma^2$ weren't the true minimum.
{\bf (2): The Initial Guess:} When there are multiple minima,
the one to which the solution converges is influenced by your initial
guess. To fully understand the range of possible solutions, you should
try different initial guesses and see what happens. If the solutions
always converge to the same answer, then you can have some confidence
(but not {\it full} confidence) that the solution is unique.
{\bf (3): Iterative stability:} If your initial guess is too far
from the true solution, then the existence of higher derivatives means
that the computed corrections can be too large and drive the iterative
solution into instability. It is often a good idea to multiiply the
derived correction factors ($\delta A$ and $\delta B$ above) by a factor
${\cal F}$ less than unity, for example ${\cal F} = 0.5$ or 0.75. This
increases the number of iterations required for convergence but often
allows convergence instead of producing instability.
{\bf (4): Convergence criteria:} How do you know when the
solution has converged? One way: for each iteration, calculate the
uncertainties in the derived coefficients. If the uncertainty exceeds
the correction, then you are getting close. An alternate way, which I
usually use: if the fractional correction (e.g.~$\delta A \over A_0$)
decreases below some threshold, say $1\%$, you're close (some
parameters, such as angles, need a threshold that is absolute instead of
fractional). At this point, if you are using ${\cal F} \neq 1$, set
${\cal F} = 1$, do a few more iterations, and you're done.
{\bf (5): Numerical derivatives:} Sometimes the equations of
condition are so complicated that taking the derivatives, as in
obtaining equation \ref{difference}, is a huge job and subject to
mistakes. So you can take numerical derivatives instead of analytic
ones. Be careful, though; it's safest to use double precision and think
a bit about numerical accuracy; take a look at NM's section ******* on
evaluating numerical derivatives.
{\bf (6): Canned nonlinear least squares:} Packages like IDL
offer canned nonlinear least squares routines. They are designed to work
well for a wide range of different problems. However, often you can do
better by tailoring things (such as the factor ${\cal F}$ and
convergence criteria above) for the specific problem at hand. A good
example is Gaussian fitting: IDL's fitting program doesn't converge for
some input data, while for many of these cases the program that I wrote
myself works fine.
{\bf (7): Be careful and LOOK at the solution before accepting
it!} These nonlinear problems can produce surprising results, sometimes
completely meaningless results. Don't rely on them to be automatic or
foolproof!
\section{BRUTE FORCE LEAST SQUARES AND THE CURVATURE MATRIX}
\subsection{Parameter Uncertainties in Brute Force Least Squares for a
Single Variable}
There are times when ``brute'' least squares is appropriate. For
example, if you have a nonlinear problem in which taking derivatives is
complicated, and if the number of unknown coefficients is small, then it
might be easier to search through the coefficient parameter space,
calculate the $\sigma^2$ for each combination of parameters, and find
the minimum. That procedure certainly provides the least squares
solution. It is particularly easy if there is only one coefficient. Let
us consider this case.
How do you determine the uncertainties in the derived
coefficient with this brute force technique? Consider the case of a
simple average (which is the least squares solution for a constant).
Suppose the least-squares value for the unknown coefficient (i.e., the
average) is $x_0$ and that its uncertainty is $\Delta x_0$. Now consider
offsets from $x_0$, namely $x_0 + \delta x$. It is easy to show that
\begin{equation}
\sigma_{\delta x}^2 = \sigma_{x_0}^2 + \delta x^2
\end{equation}
\noindent This is a parabola. Intuitively, one expects that the
``narrower'' the parabola, i.e. the smaller its curvature at $x_0$, the
smaller $\Delta x_0$. The curvature $\cal C$ is just half
the second derivative of this parabola evaluated at $x_0$, i.e.
\begin{equation}
{\cal C} = {1 \over 2} {d^2\sigma_{\delta x}^2 \over d \delta x^2}
\end{equation}
In this particular case, ${\cal C}=1$ and {\it the curvature is
independent of the values of the datapoints!} This independence of the
data {\it values} is {\it generally the case}. Generally, $\cal C$
depends on the {\it locations} of the datapoints (the ensemble of $t_m$
in equation \ref{one}) but not on the {\it measured values} (the
ensemble of $t_m$ in equation \ref{one}); you can see this because the
curvature is proportional to ${\bf SS}$, whose elements depend only on
the locations.
But it is obvious that the data values affect $\Delta x_0$. In
fact, from our experience with least squares, we expect $\Delta x_0
\propto {\sigma_{x_0} \over \sqrt N}$, where $N$ is the number of
datapoints. So we expect something like the following, which is, in
fact, the correct expression:
\begin{equation} \label{onecoeff}
\Delta x_0^2 = {\sigma_{x_0}^2 \over N {\cal C}}
\end{equation}
\subsection{The Curvature Matrix}
One can show that the diagonal elements of our matrix ${\bf SS}$
are the curvatures, i.e. ${\cal C}_n = {{\bf SS}_{nn} \over N}$. This
makes us think that we might denote ${\bf SS}$ by the moniker {\it
curvature matrix}. This is close to the usual definition, for example in
{\it Numerical Recipes}, where their curvature matrix is
$\mathbf{\alpha} = { {\bf SS} \over \sigma_{data}^2} $. This makes
${\cal C}_n = {\sigma_{data}^2 \mathbf{\alpha}_{nn} \over N}$.
If there were only a single variable, or if ${\bf SS}$ were
diagonal, then the diagonal elements of its inverse are just the
reciprocals. In other words, for this special case, ${\bf SSI_{nn} =
SS_{nn}^{-1}}$ and ${\cal C}_{n} = {1 \over N \ SSI_{nn}}$; also,
equation \ref{onecoeff} holds. More generally, {\it for the particular
case} in which ${\bf SS}$ is diagonal, then the undertainty in derived
coefficient $n$ is
\begin{equation} \label{onecoeffgen}
\sigma_n^2 = {\sigma^2 \over N {\cal C}_n}
\end{equation}
However, this is {\it not} true for more than one coefficient
and a nondiagonal covariance matrix. In that case, I {\it think} it's
true that ${\bf SSI_{nn} > SS_{nn}^{-1}}$; this means that ${\cal C}_{n}
> {1 \over N \ SSI_{nn}}$. Thus, equation \ref{onecoeffgen} {\it
under}estimates the error so that
\begin{equation} \label{onecoeffgenp}
\sigma_n^2 > {\sigma^2 \over N {\cal C}_n} \ .
\end{equation}
\noindent Fundamentally this is because the off-diagonal terms add
uncertainty to the derived coefficient because of the correlations among
the coefficient uncertanties.
To derive the coefficient error in brute force least squares,
you can use the curvature and equation \ref{onecoeff} if you have only a
single variable. Otherwise, you need to evaluate the covariance matrix;
if it is diagonal, you could use equation \ref{onecoeffgenp}; but having
already evaluated the covariance matrix, you might as well do the
problem using the usual least squares technique.
\section{YOU WANT MORE? YOU GOT IT!}
\subsection{ The true expert would\dots}
In our numerical example, a true expert (there aren't many of
these, even among professional astronomers) would often do something
more involved, namely to fit an orthogonal set of polynomials (i.e., in
this case the first three terms of Legendre polynomials) instead of just
a cubic. Because this set is orthogonal, the off-diagonal elements of
${\bf ncov}$ would automatically equal zero.
\subsection{ Comparison with Numerical Recipes}
There are two ``bibles'' of least-square fitting. Most
important is Press {\it et al.}'s {\it Numerical Recipes}, which
provides a wide-ranging treatment of computational techniques. Our
treatment follows theirs quite closely (with one important
exception---see next paragraph), but I wrote this before knowing of
their treatment so didn't use their notation.
The exception: their approach is based on $\chi^2$, while ours
is based on $\sigma^2$. Their approach is more flexible because
$\sigma_{data}$ can vary from one measurement to another; ours
implicitly assumes $\sigma_{data}$ is the same for all measurements.
They describe our approach in the discussion of equation (14.1.6) and
warn that it is ``dangerous'' because you aren't comparing the residuals
to the intrinsic inaccuracies of the data. In astronomy, though, more
often than not you don't have an independent assessment of
$\sigma_{data}$ and our approach is the only option. In any case, our
matrices correspond to theirs pairwise as follows (left of the double
arrow is ours, to the right is theirs):
{\begin{mathletters}
\begin {equation}
{\bf S} \longleftrightarrow \sigma_{data} {\bf A}
\end{equation}
\begin{equation}
{\bf T} \longleftrightarrow \sigma_{data} {\bf b}
\end{equation}
\begin{equation}
{\bf S^T \cdot S} = {\bf SS} \longleftrightarrow \sigma_{data}^2 {\bf A^T \cdot A}
= \sigma_{data}^2 [\alpha] = \sigma_{data}^2 \mathbf {\alpha}
\end{equation}
\begin{equation}
{\bf SS}^{-1} = {\bf SSI} \longleftrightarrow {1 \over \sigma_{data}^2}
\mathbf {\alpha^{-1}} = {1 \over \sigma_{data}^2} [C] =
{1 \over \sigma_{data}^2} {\bf C}
\end{equation}
\end{mathletters}
\subsection{Chauvenet}
An ancient (1863!---and in this case, age is beauty!) treatise
by William Chauvenet, the Appendix of {\it A Manual of Spherical and
Practical Astronomy}, treats complicated cases such as least-square
fitting in the presence of known constraints (e.g. if three angles of a
triangle are measured, their sum is constrained to be $180^\circ$). But
it doesn't use matrices.
\section{ADDENDUM: FITTING A LINE WHEN BOTH X AND Y HAVE
UNCERTAINTIES}
We've mentioned that one of the essential assumptions of least
squares is that the independent variables are known with high precision
and the errors occur only in the measured data. Suppose you're fitting
two variables, $t$ and $\theta$, as in equation~\ref{one}. This
essential assumption means that $t$ is known with high precision and all
the uncertainty is in $\theta$, and you are minimizing the squares of
the residuals in the $\theta$-direction only.
If {\it both} variables have uncertainties, then you have to be
careful because the essential assumption is violated. Before doing
anything, read (lightly) Isobe, Feigelson, Akritas, and Babu (1990, ApJ
364, 104). Then adopt one of the two following procedures:
{\bf (1)} You may be able to reformulate the problem by defining
two new variables, one of which has much less fractional uncertainty
than the other. The most straightforward reformulation is to define
$x=(\theta + t)$ and $y=(\theta - t)$. The sum, $x$, has smaller
fractional uncertainty than $y$ so its dispersion {\it might} not matter
as much. In such a redefinition, you should include a scaling factor on
one of the original variables to make the two original dispersions about
equal.
Before proceeding with this option, re-read Isobe et al!
{\bf (2)} More generally, use the prescription in Isobe et al.
Lucky for me, I haven't had to delve into this method---but their
formulation looks good!
\references
\reference {} Bevington, P.R. \& Robinson, D. 1992, Data Reduction and
Error Analysis for the Physica Sciences (WCB/McGraw-Hill).
\reference {} Press, W.H., Flannery, B.P., Teukolsky, S.A., \&
Vetterling, W.T. 1986, Numerical Recipes (second edition), Cambridge
University Press.
\reference {} Taylor, J.R. 1997, An Introduction to Error Analysis,
University Science Books.
\end{document}