\documentclass[psfig,preprint]{aastex}
\begin{document}
\title{LEAST-SQUARES AND CHI-SQUARE FITTING FOR THE BUDDING, \\
AND ALSO THE POST-BUDDING, \\ AFICIONADO \\ \today}
\author{Carl Heiles}
In our never-ending attempt to make your life easier, we present
you with this highly instructive, time-saving, and labor-saving
informative document! Here we give the derivations and, also, the
prescription for doing least-squares the easy way using IDL. This
prescription is given as an example in \S 5, and the budding-aficionado
{\it power-user} can skip the details and go directly there. The
sections after the numerical example are for the more advanced
post-budding aficionado.
Anyone graduating from Berkeley who claims to understand least
square fitting needs to visit \S \ref{numexample}, \ref{ncov},
\ref{chisqsection}, and \ref{nonlinearls}. These become progressively
more involved; \S \ref{numexample} is enough to get started. \S
\ref{numexample} may be obscure without first going into the
introductory material in \S \ref{matrixmethod} and \ref{sigmas}.
\tableofcontents
\section{LEAST-SQUARE FITTING FOR TWO PARAMETERS, AS WITH A STRAIGHT
LINE.}
\subsection{The closed-form expressions in Taylor.}
First consider the least squares fit to a straight line. Let
$\theta_i$ be the $i^{th}$ measurement of the observed quantity (e.g.,
the forward measurement of zenith distance angle); $t_i$ be the time of
the $i^{th}$ measurement; $N$ = the total number of observations, i.e.
$i$ runs from 1 to $N$. Remember that in the least-squares technique,
quantities such as $t_i$ are regarded to be known with high accuracy
while the quantity $\theta_i$ has uncertainties in its measurement.
We expect the zenith distance $\theta_i$ to change linearly with
time as follows:
\begin{equation} \label{one}
A + B t_i = \theta_i \; .
\end{equation}
\noindent Given this, one looks at Taylor's equations (8.10) and (8.11)
and sees how to solve for the unknowns $A$ and $B$; these equations are
the closed form of the solution for the pair of equations (8.8) and
(8.9). Specifically, $A$ and $B$ are obtained by solving the pair of
equations
\begin{mathletters}
\begin{equation}
AN + B \ \Sigma t_i = \Sigma \theta_i
\end{equation}
\begin{equation}
A \ \Sigma t_i + B \ \Sigma t_i^2 = \Sigma t_i \theta_i
\end{equation}
\end{mathletters}
\subsection{Better is the following generalized notation.}
Taylor's closed-form expressions are simply longhand ways to
write matrix math. We will express them as matrices below in \S 3. But
to ease into this gently, we first will carry through an intermediate
stage of notation.
First rewrite equation (1) in the more general form
\begin{equation}
A s_i + B t_i = \theta_i \; .
\end{equation}
\noindent To correspond to equation (1), $s_i = 1$ for all $i$, but for
a more general application $s$ might be some parameter other than time
that varies from one measurement to another. We have $N$ equations like
equation (2). They are known as the {\it equations of condition}
because they are the equations that specify the theoretical model to
which we are fitting the data.
There are $N$ equations of condition and only two unknowns ($A$
and $B$). This is too many equations! We have to end up with a system
in which the number of equations is equal to the number of unknowns.
To accomplish this, from equation (2) we form the {\it normal
equations}. The number of normal equations is equal to the number of
unknowns, so in this case we will have two. To do this, first define
the following notation:
\begin{mathletters}
\label{sseqn}
\begin{equation}
[ss] = \sum_{i=0}^{N-1} s_i s_i \;, \\
\end{equation}
\begin{equation} [st] = \sum_{i=0}^{N-1} s_i t_i \; , \\
\end{equation}
\begin{equation} [s\theta] = \sum_{i=0}^{N-1} s_i \theta_i \;, \; {\rm etc.}
\end{equation}
\end{mathletters}
\noindent {\it Note}: Above and from now on, we consider $i = 0
\rightarrow (N-1)$ instead of $i = 1 \rightarrow N$ for consistency with
IDL's convention. Then the two normal equations are
\begin{mathletters}
\begin{equation}
[ss]A + [st]B = [s\theta] \; ,
\end{equation}
\begin{equation}
[ts]A + [tt]B = [t\theta] \; .
\end{equation}
\end{mathletters}
\noindent These are precisely analogous to Taylor's equations (8.8) and
(8.9). You could write the solution in closed form, as in Taylor's
equations (8.10) and (8.11).
\section{LEAST-SQUARE FITTING FOR MANY PARAMETERS, AS WITH A CUBIC.}
With this notation it's easy to generalize to more ($M$)
unknowns: the method is obvious because in each equation of condition
(like equation (2)) we simply add equivalent additional terms such as $C
c_i$, $D d_i$, etc; and in the normal equations (equation 4) we have
more products and also more normal equations.
Let's take an example with four unknowns ($M=4$), which we will
denote by $A, B, C, D$; this would be like fitting a cubic. With $M=4$
we need at least five data points ($N=5$), so there must be at least
five equations of condition. The generalization of equation (2) is the
$N$ equations
\begin{equation}
\label{eqcond}
A s_i + B t_i + C u_i + D v_i = \theta_i \; ,
\end{equation}
\noindent with $i = 0 \rightarrow (N-1)$. Again, the least squares
fitting process assumes that the $s_i, t_i, u_i, v_i$ are known with
zero uncertainty; all of the uncertainties are in the measurements of
$\theta_i$. We then form the four normal equations; the generalization
of equation (4) is:
\begin{eqnarray}
\label{smeqn}
\left[
\begin{array}{cccc}
{[ss]} & {[st]} & {[su]} & {[sv]} \\
{[ts]} & {[tt]} & {[tu]} & {[tv]} \\
{[us]} & {[ut]} & {[uu]} & {[uv]} \\
{[vs]} & {[vt]} & {[vu]} & {[vv]} \\
\end{array}
\; \right]
\left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array}
\; \right]
\; =
\left[
\begin{array}{c}
{[s \theta]} \\
{[t \theta]} \\
{[u \theta]} \\
{[v \theta]} \\
\end{array}
\; \right]
\end{eqnarray}
\noindent Note that this equation is really a matrix equation, and that
the $M \times M$ matrix on the left is symmetric; this means that you can
actually {\it solve} for $A, B, C, D$!
\section{FAR, FAR BEST AND EASIEST: MATRIX ALGEBRA.} \label{matrixmethod}
The above equations are terribly cumbersome to solve in a
computer code because they require lots of loops. However, it becomes
trivial if we use matrices. Here we designate a {\bf matrix} by {\bf
boldface} type.
We illustrate the matrix method by carrying through the above
$M=4$ example, and we assume that there are 5 independent measurements
($N=5$). We first define the matrices
\begin{mathletters}
\label{satmatrices}
\begin{eqnarray}
{\bf S} = \left[
\begin{array}{cccc}
{s_0} & {t_0} & {u_0} & {v_0} \\
{s_1} & {t_1} & {u_1} & {v_1} \\
{s_2} & {t_2} & {u_2} & {v_2} \\
{s_3} & {t_3} & {u_3} & {v_3} \\
{s_4} & {t_4} & {u_4} & {v_4} \\
\end{array}
\; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array} \; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf T} = \left[
\begin{array}{c}
{\theta_0} \\
{\theta_1} \\
{\theta_2} \\
{\theta_3} \\
{\theta_4} \\
\end{array}
\; \right]
\end{eqnarray}
\end{mathletters}
\noindent so, in matrix form, the equations of condition (equation (5))
are simply the single matrix equation
\begin{equation}
\label{equationofcondition}
{\bf S \cdot A} = {\bf T} \; .
\end{equation}
Again, there are more equations than unknowns so we can't solve
this matrix equation directly. So next we form the normal equations
from these matrices. In matrix form, the normal equations (equation
(6)) reduce to the single equation
\begin{equation}
{\bf SS \cdot A} = {\bf ST} \; ,
\end{equation}
\noindent where
\begin{mathletters}
\label{ssdef}
\begin{eqnarray}
{\bf SS} = \left[
\begin{array}{cccc}
{[ss]} & {[st]} & {[su]} & {[sv]} \\
{[ts]} & {[tt]} & {[tu]} & {[tv]} \\
{[us]} & {[ut]} & {[uu]} & {[uv]} \\
{[vs]} & {[vt]} & {[vu]} & {[vv]} \\
\end{array}
\; \right]
\end{eqnarray}
%\end{mathletters}
\noindent and
%\begin{mathletters}
\begin{eqnarray}
{\bf ST} = \left[
\begin{array}{c}
{[s \theta]} \\
{[t \theta]} \\
{[u \theta]} \\
{[v \theta]} \\
\end{array}
\; \right]
\end{eqnarray}
\end{mathletters}
\noindent Now the number of equations is equal to the number of
unknowns, so the solution of the matrix equation is easy (in IDL, that
is).
In IDL, we generate the matrix {\bf SS} by
\begin{mathletters}
\label{idlssdef}
\begin{equation}
{\bf SS} = {\bf transpose(S) \# \# S} \;
\end{equation}
\noindent and the matrix {\bf ST} by
\begin{equation}
\label{idlssdefb}
{\bf ST} = {\bf transpose(S) \# \# transpose(T)} \; .
\end{equation}
\end{mathletters}
\noindent In IDL, $\#\#$ is the matrix multiplication
operator.\footnote{In equation~\ref{idlssdefb} we could have used {\bf
T} instead of {\bf transpose T}; however, in the unusual case in which
you are solving for a single unknown, {\bf transpose T} works and {\bf
T} doesn't.}
To solve the matrix equation (9) for {\bf A}, first rewrite it
by multiplying both sides by the inverse of ${\bf SS}$ (that is, by
${\bf SS}^{-1})$; this gives
\begin{equation}
\label{adef}
{\bf A} = {\bf SS^{-1} \cdot ST} \; .
\end{equation}
\noindent To compute $\bf SS^{-1}$ in IDL we use IDL's invert function
and we multiply matrices by typing $\#\#$. Thus, in IDL, first
calculate the inverse matrix $\bf SS^{-1}$, which we denote by {\bf
SSI}:
\begin{equation}
\label{ssidef}
{\bf SSI} = {\bf invert(SS)} \; ,
\end{equation}
\noindent and then carry out the multiplication in equation~\ref{adef}:
\begin{equation}
\label{idladef}
{\bf A} = {\bf SSI \ \#\# \ ST} \; .
\end{equation}
\section{UNCERTAINTIES IN THE DERIVED COEFFICIENTS.} \label{sigmas}
How about the uncertainties in the derived quantities contained
in the matrix ${\bf A}$?
The first thing to do is derive the variance $\sigma^2$ (the
square of standard deviation, or mean error, or dispersion, etc) of the
individual data points using the generalization of Taylor's equation
(8.14). The generalization is, simply, to replace the $N-2$ in the
denominator by $N-M$. For example, with the fit to a linear line there
are 2 parameters, Taylor's $A$ and $B$, so $M=2$ and the denominator is
$N-2$. So the generalization is
\begin{equation} \sigma^2 = {1 \over N - M} \sum_{i=0}^{N-1} (\theta_i -
\bar{\theta_i})^2 \; ,
\end{equation}
\noindent where $\bar{\theta_i}$ are the values for $\theta_i$ {\it
predicted by the derived quantities.} Note the difference: $\theta_i$
are the {\it observed} values, while $\bar{\theta_i}$ are the values
{\it predicted by the least squares fit}. The predicted values are
those that are computed from the derived coefficients $A, B, C\dots$ The
$N$ quantities $(\theta_i - \bar{\theta_i})$ are called the {\it
residuals} or {\it deviations} from the fit (Taylor, page 98).
It's easy to calculate the $\bar{\theta_i}$ with matrices. First
define the matrix ${\bf \bar T}$ that contains these values:
\begin{eqnarray}
{\bf \bar T} = \left[
\begin{array}{c}
{\bar{\theta_0}} \\
{\bar{\theta_1}} \\
{\bar{\theta_2}} \\
{\bar{\theta_3}} \\
{\bar{\theta_4}} \\
\end{array}
\; \right]
\end{eqnarray}
{\noindent} Calculating ${\bf \bar T}$ is simple:
\begin{equation}
{\bf \bar T} = {\bf S \cdot A} \; .
\end{equation}
\noindent Note that {\bf S} is already defined (equation 7a) and {\bf A}
was solved for in equation (12) and (14).
Let's do all this in IDL. To do equation (17) in IDL (in IDL we
designate ${\bf \bar T}$ by {\bf BT}):
\begin{equation}
\label{bteqn}
{\bf BT} = {\bf S \ \#\# \ A} \; .
\end{equation}
\noindent To get $\sigma^2$ (equation (15)) in IDL:
\begin{equation}
\label{sigsq}
sigsq = {\bf total}( ({\bf T - BT}) \wedge 2 )/(N-M) \; .
\end{equation}
\noindent $({\bf T - BT})$ is the array of residuals; it is usually a
good idea to plot all three quantities {\bf T}, {\bf BT}, and $({\bf T -
BT})$ to make sure your fit looks reasonable and to check for bad data
points.
OK, now the error of the derived coefficients is given by the
analogy to Taylor's equations (8.15) and (8.16). In IDL, let's
designate the array that contains the squares of errors of the $N$
derived coefficients by {\bf sigdcsq} (think of ``{\bf sig}ma of {\bf
d}erived {\bf c}oefficients {\bf sq}uared''); you get this from the
diagonal elements of {\bf SSI}, whose indices are given in IDL
by\footnote{Specifically, for any M-sized square array {\bf Q} the
vector of its diagonal elements is given by {\bf Q[(M+1)*indgen(M)]}.}
\begin{equation} \label{errcoeff}
\label{lscoefferror}
{\bf sigdcsq} = sigsq * {\bf SSI[(M+1)*indgen(M)]} \; .
\end{equation}
\noindent Or, to put it simply in words: to get the square of the
``error of coefficient $j$'' in the matrix ${\bf A}$, multiply
$\sigma^2$ (that's the single number $sigsq$ in IDL) by the $j^{th}$
diagonal element of the inverse matrix ${\bf
SS^{-1}}$. And it should be obvious that the standard
deviation of the coefficients is just $sqrt({\bf sigdcsq})$.
Although the above equation is correct, there is more to the
story because of covariance. We defer its discussion until the
numerical example.
\section{A NUMERICAL EXAMPLE AND ITS SOLUTION IN IDL.} \label{numexample}
If the following sounds like Greek to you, take a look at \S
\ref{matrixmethod} and \ref{sigmas}.
{\bf (1) First, generate the numerical example.} Suppose that we
make four measurements of the angle $\theta$ and we want to fit to a
parabolic function in time. In the notation of equation~\ref{eqcond},
$s$ would be unity, $t$ the time, and $u$ the time squared, so the
number of unknowns is three ($M=3$). Because there are four independent
measurements ($N=4$) the subscripts run from $i = 0 \rightarrow 3$.
Suppose that the four values of time are 5, 7, 9, 11.
\begin{figure}[h!]
\begin{center}
\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
%\includegraphics{twolvla.ps}
\end{center}
\caption{Our numerical example. Stars are the four datapoints; the solid
line is the fit. We perform two fits: one uses the original definition
of time; the other uses $(time-8)$, in effect moving the $y$-axis to the
dashed line. The two fits give the same line but the coefficients and
their errors differ greatly.\label{lsfitfig}}
\end{figure}
First we create the matrix {\bf S} in IDL
\begin{equation}
{\bf S = fltarr(M,N) = fltarr(3,4)}
\end{equation}
\noindent and then we populate it with numbers. In your own work, you
would normally do this by reading a data file and transferring the
numbers to the matrix using IDL commands; to work through this example,
you might manually type them in. After populating the matrix, in direct
correspondence to equation~\ref{satmatrices}a we have $s_i = 1$, $t_i =
time_i$, $u_i = time_i^2$:
\begin{mathletters}
\begin{eqnarray}
{\bf S} = \left[
\begin{array}{ccc}
1 & 5 & 25 \\
1 & 7 & 49 \\
1 & 9 & 81 \\
1 & 11 & 121 \\
\end{array} \; \right]
\end{eqnarray}
\end{mathletters}
\noindent and suppose that four measured values of $\theta$ are
(equation~\ref{satmatrices}c)
\begin{mathletters}
\begin{eqnarray}
{\bf T} = \left[
\begin{array}{c}
142 \\
168 \\
211 \\
251 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent Figure \ref{lsfitfig} shows the datapoints, together with the
fitted curve.
{\bf (2) Now solve it in IDL.} In IDL we calculate the normal
equation matrices (equations~\ref{ssdef}a, ~\ref{idlssdef}a):
\begin{mathletters}
\begin{equation}
{\bf SS} = {\bf transpose(S) \# \# S} \; ,
\end{equation}
\noindent and (equations~\ref{ssdef}b, ~\ref{idlssdef}b)
\begin{equation}
{\bf ST} = {\bf transpose(S) \# \# transpose(T)} \; .
\end{equation}
\end{mathletters}
\noindent In IDL we take the inverse of {\bf SS} (equation~\ref{ssidef}):
\begin{equation}
{\bf SSI} = {\bf invert(SS)} \; .
\end{equation}
The least-square fitted quantities are in the matrix {\bf A}
(equations~\ref{adef}, ~\ref{idladef}), which we obtain in IDL with
\begin{equation}
{\bf A} = {\bf SSI \ \#\# \ ST} \; .
\end{equation}
To get the uncertainties, follow the prescription in \S 4. More
specifically, the IDL statements of equations~\ref{bteqn}, \ref{sigsq},
and \ref{lscoefferror} provide the matrix of squares of the
uncertainties, {\bf sigdcsq}.
For this numerical example, the solution (the array of derived
coefficients) is
\begin{mathletters}
\label{avalues}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
96.6250 \\
4.49805 \\
0.875122 \\
\end{array} \; \right]
\end{eqnarray}
\noindent and the errors in the derived coefficients [the square root of
the $\sigma^2$'s of the derived coefficients, i.e. $(\sigma_j^2)^{1/2}$
or, in IDL, $sqrt({\bf sigdcsq})$ in equation~\ref{lscoefferror}] are:
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
34.0118 \\
8.99995 \\
0.559014 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent These results look {\it horrible}: the uncertainties are large
fractions of the derived coefficients,
The reason: we have specifically chosen an example with terrible
covariance. And the great thing is this can be fixed easily (at least in
this case---not always), without taking more data! Read all about this
in the next section!
\section{THE COVARIANCE MATRIX AND ITS NORMALIZED COUNTERPART}
\label{ncov}
First we provide a general discussion, then we apply it to the
above numerical example.
\subsection{ General discussion}
The uncertainties in the derived coefficients are obtained from
the diagonal elements of ${\bf SSI}$. What do the {\it off-diagonal}
elements represent?
The off-diagonal elements represent the {\it covariance} between
the derived coefficients. Covariance means, specifically, the degree to
which the {\it uncertainty} in {\it one} derived coefficient affects the
uncertainty in {\it another} derived coefficient. Because the covariance
matrix elements relate pairwise the various coefficients, the units of
the matrix elements are all different. This makes it convenient to
reduce all the matrix elements to a standard set of units---namely, no
units at all. So before discussing the covariance matrix {\it per se},
we first discuss its normalized counterpart.
The normalized covariance
matrix\footnote{It is a pleasure to thank Doug Finkbiener for
introducing me to this concept.} {\bf ncov} is derived from ${\bf SSI}$
by dividing each element by the square root of the product of the
corresponding diagonal elements. Let ${\bf ncov}$ be the normalized
covariance matrix; then
\begin{equation}
ncov_{ik} = {SSI_{ik} \over \sqrt {SSI_{ii} \ SSI_{kk}}}
\end{equation}
\noindent Or, in IDL, do the following:
\begin{mathletters} \label{idlcovariance}
\begin{equation}
{\bf dc = SSI[(M+1)*indgen(M)]}
\end{equation}
\begin{equation}
{\bf ncov = SSI/{\it sqrt}(dc \# dc)}
\end{equation}
\end{mathletters}
\noindent In the above, ${\bf dc \# dc}$ is an $M \times M$ matrix
consisting of products of the diagonals of ${\bf SSI}$, so dividing
${\bf SSI}$ by $sqrt({\bf dc \# dc})$ generates the normalized version.
Because {\bf ncov} is a {\it normalized} covariance matrix, you
might think that it's non-normalized parent is {\bf SSI}---and you'd be
{\it almost} right. The true covariance matrix ${\bf C}$ (as defined in
{\it Numerical Recipes} for example) is
\begin{equation}
{\bf C} = \sigma_{data}^2 {\bf SSI}
\end{equation}
\noindent because ${\bf C}$ is defined for $\chi^2$ instead of
$\sigma^2$ (see \S \ref{chisqsection}).
In ${\bf ncov}$, the diagonal elements are all unity and the
off-diagonal elements reflect the interdependence of the derived
coefficients on each other. The off-diagonal elements can range from
$-1 \rightarrow 1$. Each matrix element is the correlation coefficient
between the uncertainty of its two parameters. In particular, suppose
that the data happens to produce a coefficient that differs from its
true value by some positive number. If the normalized matrix element is
negative, then the other coefficient will tend to differ from its true
value by a negative number.
Here's a more detailed discussion of what the covariance means.
Suppose you are least-squares fitting for two derived coefficients
($A_0$ and $A_1$). When you do a least-squares fit to a set of data,
you are fitting one set of data out of a possible infinity of possible
sets that you'd get by repeating the experiment, and your particular set
of data happens to produce specific values of $A_0$ and $A_1$, which
differ from the {\it true} values by amounts $\delta A_0$ and $\delta
A_1$. If their covariance is zero, then in the infinity of data sets
you'd find that $\delta A_0$ is uncorrelated with $\delta A_1$. But if
it is nonzero, these two quantities would be correlated.
A high covariance is bad because the derived variables depend on
each other. As we shall see in the next subsection, it also can greatly
increase the uncertainty in derived coefficients. Often a high
covariance results from a poor choice of functions that you are fitting
or even a bad choice of the zero point of the independent variable---as
in our numerical example (see the next subsection). And, as in that
example, you can often (not always!) eliminate the bad covariance by
reformulating the problem---you don't even need to take more data! The
best reformulation involves using a set of orthonormal functions.
\subsection{ The covariance in our numerical example} \label{covarianceone}
Apply equation \ref{idlcovariance} to obtain the covariance
matrix for our numerical example:
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & -.989848 & .969717 \\
-.989848 & 1 & -.993808 \\
.969717 & -.993808 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
The off-diagonal elements are {\it huge}. This is the reason why
our solution is so bad.
In this seemingly innocuous example we have an excellent case of
a poor choice of zero point for the independent variable (the time).
The reason is clear upon a bit of reflection. We are fitting for $y =
A_0 + A_1 t + A_2 t^2$. The coefficient $A_0$ is the $y$-intercept and
$A_1$ is the slope. Inspection of Figure \ref{lsfitfig} makes it very
clear that an error in the slope has a big effect on the $y$-intercept.
Now we retry the example, making the zero point of the time
equal to the mean of all the times, that is we set $(time_i = time_i -
8)$. We get the same fitted line, but the derived coefficients are
completely different---and amazingly better! We get
\begin{mathletters}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
188.625 \\
18.500 \\
0.87500 \\
\end{array} \; \right]
\end{eqnarray}
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
3.58 \\
1.00 \\
0.559 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
In redefining the origin of the independent variable, we have
made the zero intercept completely independent of the slope: changing
the slope has no affect at all on the intercept. We are taking a partial
step towards fitting coefficients of the orthogonal set of Legendre
polynomials---it's always a good idea to fit orthogonal functions if you
can express the problem in those terms. Our step is {\it partial}: the
second-order coefficient $A_2$ affects $A_0$ because, over the range of
$[(time - 8) = -3 \rightarrow +3]$, the quantity $[A_2 \; \Sigma(time_i
-8)]$ is nonzero and thereby has the same effect as $A_0$.
For further discussion of covariance, see \S \ref {chicoeffs}.
Also, you might also want to try out another example in Taylor's \S 8.5.
\section{CHI-SQUARE FITTING, AND WEIGHTED FITTING} \label{chisqsection}
Chi-square fitting is very much like our least-squares fitting
above, with one additional wrinkle. Chi-square ($\chi^2$) is equal to
the ratio of the variance of the {\it fit} ($\sigma^2$) to the intrinsic
variances of the datapoints ($\sigma_{data}^2$). So it should be
obvious that In Chi-square fitting, you must know the measurement
uncertainties of the individual data points beforehand. If you want to
give the various datapoints different weights, then that is just like
Chi-square fitting except that you can adopt an arbitrary scale factor
for the uncertainties (section~\ref{weightedcase}).
\subsection{The Chi-square when all datapoints have identical dispersions}
\label{sectionchisq}
Suppose, first, that all datapoints have the same intrinsic
measurement uncertainty $\sigma_{data}$.{\footnote{Our discussion follows
{\it Numerical Recipes} \S 14.4.3, {\it Solution by Use of the Normal
Equations}, which we denote by NR.} This is a function {\it only}
of the {\it instrument} and the {\it data-taking procedure}: if you know
the properties of the instrument, and if you understand how it works,
and if there are no systematic errors, then you should be able to
specify $\sigma_{data}$. Specifically, $\sigma_{data}$ is most certainly
{\it not} the same thing as the dispersion of the points from the
least-squares fit.
This case, in which all datapoints have the same intrinsic
measurement uncertainty $\sigma_{data}$, occurs frequently, for example
when the datapoints are taken under identical conditions. Least-squares
assumes a Gaussian distribution of datapoints around the true
value, which is characterized by the dispersion $\sigma_{data}$. In
this case, Chi-square fitting is just like least-squares fitting except
for the following:
{\bf (1)} Divide each equation of condition by $\sigma_{data}$.
You are generating a new matrix $\mathbf {S_X}$, which is identical to
{\bf S} except that each row is divided by $\sigma_{data}$. This new
matrix is the same as NR's {\it design matrix}.
{\bf (2)} Divide each datapoint by $\sigma_{data}$. You are
generating a new matrix $\mathbf {T_X}$, which is identical to {\bf T}
except that each row is divided by $\sigma_{data}$. This new matrix is
the same as NR's vector {\bf b}.
{\bf (3)} In our equation~\ref{sigsq}, obtain the {\it reduced}
observed chi-square $\widehat{\chi_{obs}}^2$ by replacing {\bf T} by
$\mathbf {T_X}$ and {\bf BT} by $\mathbf{BT_X} = \mathbf {S_X \#\#
A_X}$ (cf equation~\ref{bteqn}). That is, in IDL
\begin{equation}
\label{chisq}
\widehat{\chi_{obs}^2} = {\bf total}( ({\bf T_X - BT_X}) \wedge 2 )/(N-M) \; .
\end{equation}
\noindent The reduced chi-square $\widehat{\chi_{obs}}^2$ is equal to the
ordinary chi-square $\chi_{obs}^2$ except that it is divided by the
number of degrees of freedom, ordinarily denoted by $\nu$, which is
equal to $(N-M)$.
Look at the effect of all these three steps somewhat carefully.
In steps {\bf (1)} and {\bf (2)}, you've taken our entire matrix
equation~\ref{equationofcondition} and divided both sides by the same
quantity, $\sigma_{data}$. Obviously, this doesn't change the derived
results for {\bf A}---as it shouldn't. So $\mathbf{A_X} = \mathbf{A}$.
However, in equation~\ref{chisq} the vectors $\mathbf {T_X}$ and
$\mathbf {BT_X}$ are multiplied by the factor $1 \over \sigma_{data}$,
so equation~\ref{chisq} calculates $\widehat{\chi_{obs}^2} =
{\sigma^2 \over \sigma_{data}^2}$.
This is exactly what we {\it should} get! Suppose, for example,
that the least-squares fit model is perfect and the only deviations from
the fitted curve result from measurement error. Then by necessity we
have $\sigma = \sigma_{data}$ and $\widehat{\chi_{obs}^2} = 1$.
In the case $\widehat{\chi_{obs}^2} = 1$ the dispersion of the
observed points $\sigma$ is equal to the intrinsic dispersion of the
datapoints $\sigma_{data}$ and the mathematical model embodied in the
least-squares fit is perfect. That, at least, is the {\it theoretical}
conclusion. In practice, however, you're obtaining such a low, good
value for $\chi^2$ might mean instead that you are using {\it too large}
a value for $\sigma_{data}$: you are ascribing more error to your
datapoints than they really have, perhaps by not putting enough faith in
your instrument. Specifically, if the datapoints really have intrinsic
measurement dispersion $\sigma_{data}$ and you specify that they have $2
\sigma_{data}$ when computing $(\mathbf{S_X},\mathbf{T_X})$, then you'll
find $\widehat{\chi_{obs}^2} = 0.25$.
High values of $\widehat{\chi_{obs}^2}$ indicate that the model
is not perfect and could be improved by the use of a different model,
such as the addition of more parameters---or, alternatively, that you
think your equipment works better than it really does and you are
ascribing {\it less} error to your datapoints than they really have.
We don't discuss the interpretation of $\chi^2$ here, because
that is done very adequately in textbooks on statistics. In particular,
we recommend Taylor's chapter on Chi-square (\S 12), which provides a
nice introduction and discussion.
\subsection{The case in which datapoints have different dispersions,
or different weights: like a weighted average}
\label{weightedcase}
In this case, each row of the equation-of-condition matrix {\bf
S} and each element of the data vector {\bf T} get divided by their
corresponding $\sigma_{data}$. The equation embodied in each row of the
matrix equation~\ref{equationofcondition} remains unchanged, but the
different rows are weighted differently with respect to each other.
Consider two measurements with intrinsic measurement
uncertainties $(\sigma_1, \sigma_2)$; suppose $\sigma_1 < \sigma_2$.
After being divided by their respective $\sigma_{data}$'s, all of the
numbers in row 1 are larger than those in row 2. In all subsequent
matrix operations, these larger numbers contribute more to all of the
matrix-element products and sums. Thus, the measurement with smaller
uncertainty has more influence on the final result, as it should.
Suppose that the above two measurements were taken under
identical conditions except that measurement 1 received more integration
time than measurement 2; we have ${\sigma_1 \over \sigma_2} = \left({t_1
\over t_2}\right)^{-1/2}$, so the rows of $\mathbf {S_X}$ are weighted
as $t^{1/2}$. This means that during the computation of $\mathbf
{SS_X}$, the self-products of row 1 are weighted as $t$ (and each row
multiplies {\it only} itself; see equations~\ref{sseqn} and
\ref{smeqn}). This means that each datapoint is weighted as $t$, which
is exactly what you'd expect! Note that this is also exactly the same
weighting scheme used in a weighted average, in which the weights are
proportional to $\left(1 \over \sigma_{data}\right)^2$. We conclude
that weighting scheme of the three steps in section~\ref{sectionchisq}
agrees with common sense.
Suppose you don't know the intrinsic measurement dispersion
$\sigma_{data}$, but you {\it do} know the {\it relative} dispersion of
the various measurements. For example, this would be the case if the
datapoints were taken under identical conditions except for integration
time; then $\sigma_{data} \propto t^{-1/2}$. In this case, multiply
each row by its weight $w \propto {1 \over \sigma_{data}}$ and proceed
as above.
\subsection{The uncertainties of the derived coefficients }
\label{chicoeffs}
The above two subsections deal with the Chi-square of the fitted
points, and provide information about how well the model fits the data.
It says nothing about the uncertainties in the derived coefficients. To
get this{\footnote{Our discussion follows {\it Numerical Recipes} \S
14.5, the first three italicized subsections, which we denote by NR.},
you re-express equation~\ref{lscoefferror} in terms of the new
parameters, namely
\begin{equation}
\label{lscoefferror1}
{\bf sigdcsq} = \widehat{\chi_{obs}^2} * {\bf SSI_X[(M+1)*indgen(M)]} \; ,
\end{equation}
\noindent where $\widehat{\chi_{obs}^2}$ is from equation~\ref{chisq}.
However, {\it this is most definitely not the whole story} because you
are doing a multiparameter fit, and generally speaking the derived
parameters have nonzero covariance.
Consider the example of the first two coefficients in our above
example, which we discussed a bit above in \S \ref{covarianceone}. In
this example, the fit gives $y = A_0 + A_1 t + A_2 t^2$, where the
numerical values are given in vector form by equation~\ref{avalues}. The
coefficient $A_0$ is the $y$-intercept and $A_1$ is the slope. They have
derived values $A_0 = 96 \pm 34$ and $A_1 = 4 \pm 9$.
Remember what these uncertainties really mean: in an infinity of
similar experiments, you'll obtain an infinity of values of $(A_0,A_1)$
that are normally distributed with dispersions (34,9). Loosely
speaking, this means that $A_0$ lies between $(96-34=62)$ and
$(96+34=130)$ and $A_1$ lies between $-4.5$ and $13.5$. If you are
interested in knowing about $A_0$ {\it without regard to $A_1$}, then
its uncertainty is indeed 96; ditto for $A_1$. In other words,
equations~\ref{lscoefferror} and \ref{lscoefferror1} apply. Thus, if
you don't care about $A_1$, then you can conclude $A_0 = 130$; and if
you don't care about $A_0$, then you can conclude $A_1=13.5$ (both with
$68.3\%$ probability).
However, if you are interested in knowing about {\it both}, you
must include their covariance. In our example, the large negative
covariance follows logically just from looking at a graph: if you fit
some points, all of which lie at positive $t$, then a more negative
derived slope will raise the $y$-intercept.
Specifically, the large negative covariance means that positive
departures of $A_0$ are associated with negative departures of $A_1$.
So even though the {\it individual} values $A_0=130$ and $A_1=13.5$ are
acceptable, you {\it cannot} conclude that the {\it pair} of values
$(A_0,A_1) = (130, 13.5)$ is acceptable, because this pair has large
positive departures of both. In contrast, what {\it is} acceptable here
would be something like $(A_0,A_1) = (130, -4.5)$.
We stress that the acceptable ranges of values depend on what
you are interested in. This is sort of like the observer's influence in
quantum mechanics. If you are interested in $A_1$ alone, then you can
say $A_1 = 4 \pm 9$ and, in making this statement, you have to realize
that, as $A_1$ varies over this range, $A_0$ can vary over (formally, at
least) the range $(\infty \rightarrow -\infty)$: you just don't give a
damn {\it what} happens to $A_0$ because you're not interested. But the
moment you become interested and restrict its possible range, that
influences the possible range for $A_1$, too.
There is no simple relationship between the covariance matrix
elements and the acceptable ranges. For two variables, the best way to
express this matters is to construct the ellipses that define the loci
of constant $\Delta \chi^2$ and present them on a graph as in NR's
Figure 14.5.4. For three variables, these ellipses become ellipsoids;
for four, they become four-dimensional volumes, etc.
To construct these curves/surfaces/volumes/etc., follow the
prescription defined by the six bullets at the end of the section NR's
{\it Probability Distribution of Parameters in the Normal Case}. This
description is a bit curt, but it is accurate. To make sure that you
understand it, take a numerical example and try applying the
prescription to the case of deriving the uncertainties of a single
coefficient (the case $\nu = 1$); that case is the case covered by
equations~\ref{lscoefferror} and \ref{lscoefferror1} above. In this
case, you have the usual normal distribution function for that variable,
in which its formal error (as defined in equations~\ref{lscoefferror}
and \ref{lscoefferror1} above) defines a range within which the derived
value will fall $68.3\%$ of the time in the infinity of
experiments---it's just like any other normal distribution.
\section{NONLINEAR LEAST SQUARES} \label{nonlinearls}
The least squares formulation requires that the data values
depend {\it linearly} on the unknown coefficients. For example, in
equation \ref{one}, the unknown coefficients $A$ and $B$ enter linearly.
Suppose one had a nonlinear dependence, such as for example
\begin{equation} \label{basicone}
\sin(A t_i) + B t_i = \theta_i \; .
\end{equation}
\noindent What do you do here? You linearize the equation, using the
following procedure.
First, assume a trial values for $A$ and $B$; call these $A_0$
and $B_0$. You should pick values that are close to the correct ones. In
our example you wouldn't need to do this for $B$, but it's easier to
treat all coefficients identically. These trial values produce {\it
predicted} values $\theta_{0,i}$:
\begin{equation} \label{trialone}
\sin(A_0 t_i) + B_0 t_i = \theta_{0,i} \; .
\end{equation}
\noindent Subtract equation \ref{trialone} from \ref{basicone}, and
express the differences as derivatives. Letting $\delta A = A - A_0$ and
$\delta B = B - B_0$, this gives
\begin{equation} \label{difference}
\delta A [t_i \cos(A_0 t_i)] + \delta B t_i = \theta_i - \theta_{0,i} \; .
\end{equation}
\noindent Now solve for $\delta A$ and $\Delta B$. These won't be exact
because higher derivatives (including cross derivatives) come into play,
so you need to repeat the process. This is an iterative procedure and
you keep going until the changes become ``small''. The generalization to
an arbitrarily large number of unknown coefficients is obvious.
We now offer some cautionary and practical remarks.
{\bf (1): Multiple minima:} Nonlinear problems often have
multiple minima in $\sigma^2$. A classical case is fitting multiple
Gaussians to a spectral line profile. Gaussians are most definitely not
orthogonal functions and in some cases several solutions may give almost
comparably good values of $\sigma^2$, each one being a local minimum.
For example, for the case of two blended Gaussians, one can often fit
two narrow Gaussians or the combination of a wide and narrow Gaussian,
the two fits giving almost equal $\sigma^2$. The lower of these is the
real minimum but, given the existence of systematic errors and such, not
necessarily the best solution. The best solution is often determined by
physical considerations; in this case, for example, you might have
physical reasons to fit a broad plus narrow Gaussian, so you'd choose
this one even if it's $\sigma^2$ weren't the true minimum.
{\bf (2): The Initial Guess:} When there are multiple minima,
the one to which the solution converges is influenced by your initial
guess. To fully understand the range of possible solutions, you should
try different initial guesses and see what happens. If the solutions
always converge to the same answer, then you can have some confidence
(but not {\it full} confidence) that the solution is unique.
{\bf (3): Iterative stability:} If your initial guess is too far
from the true solution, then the existence of higher derivatives means
that the computed corrections can be too large and drive the iterative
solution into instability. It is often a good idea to multiiply the
derived correction factors ($\delta A$ and $\delta B$ above) by a factor
${\cal F}$ less than unity, for example ${\cal F} = 0.5$ or 0.75. This
increases the number of iterations required for convergence but often
allows convergence instead of producing instability.
{\bf (4): Convergence criteria:} How do you know when the
solution has converged? One way: for each iteration, calculate the
uncertainties in the derived coefficients. If the uncertainty exceeds
the correction, then you are getting close. An alternate way, which I
usually use: if the fractional correction (e.g.~$\delta A \over A_0$)
decreases below some threshold, say $1\%$, you're close (some
parameters, such as angles, need a threshold that is absolute instead of
fractional). At this point, if you are using ${\cal F} \neq 1$, set
${\cal F} = 1$, do a few more iterations, and you're done.
{\bf (5): Numerical derivatives:} Sometimes the equations of
condition are so complicated that taking the derivatives, as in
obtaining equation \ref{difference}, is a huge job and subject to
mistakes. So you can take numerical derivatives instead of analytic ones.
Be careful, though; it's safest to use double precision and think a bit
about numerical accuracy.
{\bf (6): Canned nonlinear least squares:} Packages like IDL
offer canned nonlinear least squares routines. They are designed to work
well for a wide range of different problems. However, often you can do
better by tailoring things (such as the factor ${\cal F}$ and
convergence criteria above) for the specific problem at hand. A good
example is Gaussian fitting: IDL's fitting program doesn't converge for
some input data, while for many of these cases the program that I wrote
myself works fine.
{\bf (7): Be careful and LOOK at the solution before accepting
it!} These nonlinear problems can produce surprising results, sometimes
completely meaningless results. Don't rely on them to be automatic or
foolproof!
\section{BRUTE FORCE LEAST SQUARES AND THE CURVATURE MATRIX}
\subsection{Parameter Uncertainties in Brute Force Least Squares for a
Single Variable}
There are times when ``brute'' least squares is appropriate. For
example, if you have a nonlinear problem in which taking derivatives is
complicated, and if the number of unknown coefficients is small, then it
might be easier to search through the coefficient parameter space,
calculate the $\sigma^2$ for each combination of parameters, and find
the minimum. That procedure certainly provides the least squares
solution. It is particularly easy if there is only one coefficient. Let
us consider this case.
How do you determine the uncertainties in the derived
coefficient with this brute force technique? Consider the case of a
simple average (which is the least squares solution for a constant).
Suppose the least-squares value for the unknown coefficient (i.e., the
average) is $x_0$ and that its uncertainty is $\Delta x_0$. Now consider
offsets from $x_0$, namely $x_0 + \delta x$. It is easy to show that
\begin{equation}
\sigma_{\delta x}^2 = \sigma_{x_0}^2 + \delta x^2
\end{equation}
\noindent This is a parabola. Intuitively, one expects that the
``narrower'' the parabola, i.e. the smaller its curvature at $x_0$, the
smaller $\Delta x_0$. The curvature $\cal C$ is just half
the second derivative of this parabola evaluated at $x_0$, i.e.
\begin{equation}
{\cal C} = {1 \over 2} {d^2\sigma_{\delta x}^2 \over d \delta x^2}
\end{equation}
In this particular case, ${\cal C}=1$ and {\it the curvature is
independent of the values of the datapoints!} This independence of the
data {\it values} is {\it generally the case}. Generally, $\cal C$
depends on the {\it locations} of the datapoints (the ensemble of $t_i$
in equation \ref{one}) but not on the {\it measured values} (the
ensemble of $t_i$ in equation \ref{one}); you can see this because the
curvature is proportional to ${\bf SS}$, whose elements depend only on
the locations.
But it is obvious that the data values affect $\Delta x_0$. In
fact, from our experience with least squares, we expect $\Delta x_0
\propto {\sigma_{x_0} \over \sqrt N}$, where $N$ is the number of
datapoints. So we expect something like the following, which is, in
fact, the correct expression:
\begin{equation} \label{onecoeff}
\Delta x_0^2 = {\sigma_{x_0}^2 \over N {\cal C}}
\end{equation}
\subsection{The Curvature Matrix}
One can show that the diagonal elements of our matrix ${\bf SS}$
are the curvatures, i.e. ${\cal C}_n = {{\bf SS}_{nn} \over N}$. This
makes us think that we might denote ${\bf SS}$ by the moniker {\it
curvature matrix}. This is close to the usual definition, for example in
{\it Numerical Recipes}, where their curvature matrix is
$\mathbf{\alpha} = { {\bf SS} \over \sigma_{data}^2} $. This makes
${\cal C}_n = {\sigma_{data}^2 \mathbf{\alpha}_{nn} \over N}$.
If there were only a single variable, or if ${\bf SS}$ were
diagonal, then the diagonal elements of its inverse are just the
reciprocals. In other words, for this special case, ${\bf SSI_{nn} =
SS_{nn}^{-1}}$ and ${\cal C}_{n} = {1 \over N \ SSI_{nn}}$; also,
equation \ref{onecoeff} holds. More generally, {\it for the particular
case} in which ${\bf SS}$ is diagonal, then the undertainty in derived
coefficient $n$ is
\begin{equation} \label{onecoeffgen}
\sigma_n^2 = {\sigma^2 \over N {\cal C}_n}
\end{equation}
However, this is {\it not} true for more than one coefficient
and a nondiagonal covariance matrix. In that case, I {\it think} it's
true that ${\bf SSI_{nn} > SS_{nn}^{-1}}$; this means that ${\cal C}_{n}
> {1 \over N \ SSI_{nn}}$. Thus, equation \ref{onecoeffgen} {\it
under}estimates the error so that
\begin{equation} \label{onecoeffgenp}
\sigma_n^2 > {\sigma^2 \over N {\cal C}_n} \ .
\end{equation}
\noindent Fundamentally this is because the off-diagonal terms add
uncertainty to the derived coefficient because of the correlations among
the coefficient uncertanties.
To derive the coefficient error in brute force least squares,
you can use the curvature and equation \ref{onecoeff} if you have only a
single variable. Otherwise, you need to evaluate the covariance matrix;
if it is diagonal, you could use equation \ref{onecoeffgenp}; but having
already evaluated the covariance matrix, you might as well do the
problem using the usual least squares technique.
\section{YOU WANT MORE? YOU GOT IT!}
\subsection{ The true expert would\dots}
In our numerical example, a true expert (there aren't many of
these, even among professional astronomers) would often do something
more involved, namely to fit an orthogonal set of polynomials (i.e., in
this case the first three terms of Legendre polynomials) instead of just
a cubic. Because this set is orthogonal, the off-diagonal elements of
${\bf ncov}$ would automatically equal zero.
\subsection{ Comparison with Numerical Recipes}
There are two ``bibles'' of least-square fitting. Most
important is Press {\it et al.}'s {\it Numerical Recipes}, which
provides a wide-ranging treatment of computational techniques. Our
treatment follows theirs quite closely (with one important
exception---see next paragraph), but I wrote this before knowing of
their treatment so didn't use their notation.
The exception: their approach is based on $\chi^2$, while ours
is based on $\sigma^2$. Their approach is more flexible because
$\sigma_{data}$ can vary from one measurement to another; ours
implicitly assumes $\sigma_{data}$ is the same for all measurements.
They describe our approach in the discussion of equation (14.1.6) and
warn that it is ``dangerous'' because you aren't comparing the residuals
to the intrinsic inaccuracies of the data. In astronomy, though, more
often than not you don't have an independent assessment of
$\sigma_{data}$ and our approach is the only option. In any case, our
matrices correspond to theirs pairwise as follows (left of the double
arrow is ours, to the right is theirs):
{\begin{mathletters}
\begin {equation}
{\bf S} \longleftrightarrow \sigma_{data} {\bf A}
\end{equation}
\begin{equation}
{\bf T} \longleftrightarrow \sigma_{data} {\bf b}
\end{equation}
\begin{equation}
{\bf S^T \cdot S} = {\bf SS} \longleftrightarrow \sigma_{data}^2 {\bf A^T \cdot A}
= \sigma_{data}^2 [\alpha] = \sigma_{data}^2 \mathbf {\alpha}
\end{equation}
\begin{equation}
{\bf SS}^{-1} = {\bf SSI} \longleftrightarrow {1 \over \sigma_{data}^2}
\mathbf {\alpha^{-1}} = {1 \over \sigma_{data}^2} [C] =
{1 \over \sigma_{data}^2} {\bf C}
\end{equation}
\end{mathletters}
\subsection{Chauvenet}
An ancient (1863!---and in this case, age is beauty!) treatise
by William Chauvenet, the Appendix of {\it A Manual of Spherical and
Practical Astronomy}, treats complicated cases such as least-square
fitting in the presence of known constraints (e.g. if three angles of a
triangle are measured, their sum is constrained to be $180^\circ$). But
it doesn't use matrices.
\section{ADDENDUM: FITTING A LINE WHEN BOTH X AND Y HAVE
UNCERTAINTIES}
We've mentioned that one of the essential assumptions of least
squares is that the independent variables are known with high precision
and the errors occur only in the measured data. Suppose you're fitting
two variables, $t$ and $\theta$, as in equation~\ref{one}. This
essential assumption means that $t$ is known with high precision and all
the uncertainty is in $\theta$, and you are minimizing the squares of
the residuals in the $\theta$-direction only.
If {\it both} variables have uncertainties, then you have to be
careful because the essential assumption is violated. Before doing
anything, read (lightly) Isobe, Feigelson, Akritas, and Babu (1990, ApJ
364, 104). Then adopt one of the two following procedures:
{\bf (1)} You may be able to reformulate the problem by defining
two new variables, one of which has much less fractional uncertainty
than the other. The most straightforward reformulation is to define
$x=(\theta + t)$ and $y=(\theta - t)$. The sum, $x$, has smaller
fractional uncertainty than $y$ so its dispersion {\it might} not matter
as much. In such a redefinition, you should include a scaling factor on
one of the original variables to make the two original dispersions about
equal.
Before proceeding with this option, re-read Isobe et al!
{\bf (2)} More generally, use the prescription in Isobe et al.
Lucky for me, I haven't had to delve into this method---but their
formulation looks good!
\end{document}