\documentclass[psfig,preprint]{aastex}
\begin{document}
\title{LEAST-SQUARES AND CHI-SQUARE FITTING FOR AY250 \\
\today}
\author{Carl Heiles}
In our never-ending attempt to make your life easier, we present
you with this highly instructive, time-saving, and labor-saving
informative document! Here we give heuristic derivations, discussions,
examples, and the prescription for doing least-squares the easy way
using matrix techniques generally, and specifically in IDL. This
prescription is given as an example in \S\ref{numexample}, and the {\it
power-user} can skip the details and go directly there.
This document is an update, correction, clarification, and
elaboration of a similar document made for the undergraduate lab class.
Here we update the notation to conform with {\it Numerical Recipes}
(NM), and occasionally refer to the books Taylor (1997; T97), Bevington
and Robinson (1992; BR), and Press et al (19**; Numerical Recipes, NM).
We begin with least squares in the classic sense, meaning we minimize
the sum of squares instead of minimizing $\chi^2$.
In astronomy, more often than not you don't have an independent
assessment of the intrinsic uncertainty in the data, which means you
cannot evaluate $\chi^2$, and the least squares approach is the only
option. However, often in astronomy you do want to weight observations
differently, e.g.\ because of integration time, and this requires an
approach similar to the $\chi^2$ one. In later sections we generalize to
the $\chi^2$ and this other weighted-observations case.
\tableofcontents
\section{LEAST-SQUARE FITTING FOR TWO PARAMETERS, AS WITH A STRAIGHT
LINE.} \label{sectionone}
\subsection{The closed-form expressions for a straight-line fit}
First consider the least squares fit to a straight line. Let
$\theta_m$ be the $m^{th}$ measurement of the observed quantity (in this
example, $\theta_m$ is zenith distance; $t_m$ be the time of the
$m^{th}$ measurement; $M$ = the total number of observations, i.e.\ $m$
runs from 0 to $M-1$. Remember that in the least-squares technique,
quantities such as $t_m$ are regarded to be known with high accuracy
while the quantity $\theta_m$ has uncertainties in its measurement.
We expect the zenith distance $\theta_m$ to change linearly with
time as follows:
\begin{equation} \label{one}
A + B t_m = \theta_m \; .
\end{equation}
\noindent Given this, one does the maximum liklihood estimate assuming
Gaussian statistics. When all measurements have the same intrinsic
uncertainty, this is the same as looking for the solution that minimizes
the sum of the squares of the residuals (which we will define later).
This leads to the pair of equations (Taylor 8.8, 8.9), called the {\it
normal equations}
\begin{mathletters} \label{normalone}
\begin{equation}
AN + B \ \sum t_m = \sum \theta_m
\end{equation}
\begin{equation}
A \ \sum t_m + B \ \sum t_m^2 = \sum t_m \theta_m
\end{equation}
\end{mathletters}
\noindent Two equations and two unknowns---easy to solve! The
closed-form equations for $(A,B)$ are Taylor's equations 8.10 to 8.12.
\subsection{Better is the following generalized notation.}
We want a way to generalize this approach to include any
functional dependence on $t$ and even other variables, and to have an
arbitrarily large number of unknown coefficients instead of just the two
$(A,B)$. This is very easy using matrix math. We will ease into this
matrix technique gently, by first carrying through an intermediate
stage of notation.
First generalize the straight-line fit slightly by having two
functional dependences instead of one. We have something other than the
time $t_m$; call it $s_m$. For example, we could have $s_m = \cos (t_m)$
or $s_m = t_m^2$; or we could have $s_m = x_m$, where $x_m$ is the
position from which the observation was taken. To correspond to equation
\ref{one}, $s_m = 1$. Then we rewrite equation \ref{one} to include this
extra dependence
\begin{equation} \label{two}
A s_m + B t_m = \theta_m \; .
\end{equation}
\noindent There are still only two unknown parameters, so this is an
almost trivial generalization; later we'll generalize to more
parameters.
We have $M$ equations like equation \ref{two}, one for each
measurement. They are known as the {\it equations of condition} because
they are the equations that specify the theoretical model to which we
are fitting the data. There are $M$ equations of condition and only two
unknowns ($A$ and $B$). This is too many equations! We have to end up
with a system in which the number of equations is equal to the number of
unknowns.
To accomplish this, from equation \ref{two} we form the {\it
normal equations}. The number of normal equations is equal to the
number of unknowns, so in this case we will have two. We could carry
through the same ML derivation to derive equations equivalent to
equation \ref{normalone}; the result is
\begin{mathletters} \label{ones}
\begin{equation}
A\ \sum s_m^2 + B \ \sum s_m t_m = \sum s_m\theta_m
\end{equation}
\begin{equation}
A \ \sum s_m t_m + B \ \sum t_m^2 = \sum t_m \theta_m \ .
\end{equation}
\end{mathletters}
\noindent We can rewrite these equations using the notation $[st] =
\sum s_m t_m$, etc.:
\begin{mathletters} \label{normaltwo}
\begin{equation}
A [ s^2 ] + B [ s t ] = [ s \theta ]
\end{equation}
\begin{equation}
A [ s t ] + B [ t^2 ] = [ t \theta ] \ .
\end{equation}
\end{mathletters}
\noindent This is, of course, precisely analogous to equation
\ref{normalone}. And now it's clear how to generalize to more
parameters!
\section{LEAST-SQUARE FITTING FOR MANY PARAMETERS, AS WITH A CUBIC}
\label{sectiontwo}
With this notation it's easy to generalize to more ($N$)
unknowns: the method is obvious because in each equation of condition
(like equation \ref{two}) we simply add equivalent additional terms such
as $C u_m$, $D v_m$, etc; and in the normal equations (equation
\ref{normaltwo}) we have more products and also more normal equations.
Let's take an example with four unknowns ($N=4$), which we will
denote by $A, B, C, D$; this would be like fitting a cubic. With $N=4$
we need at least five data points ($M=5$), so there must be at least
five equations of condition. The generalization of equation \ref{ones}
is the $M$ equations
\begin{equation} \label{eqcond}
A s_m + B t_m + C u_m + D v_m = \theta_m \; ,
\end{equation}
\noindent with $m = 0 \rightarrow (M-1)$. Again, the least squares
fitting process assumes that the $s_m, t_m, u_m, v_m$ are known with
zero uncertainty; all of the uncertainties are in the measurements of
$\theta_m$. We then form the four normal equations; the generalization
of equation \ref{normaltwo} written in matrix notation is:
\begin{eqnarray} \label{smeqn}
\left[
\begin{array}{cccc}
{[ ss ]} & {[ st ]} & {[ su ]} & {[ sv ]} \\
{[ ts ]} & {[ tt ]} & {[ tu ]} & {[ tv ]} \\
{[ us ]} & {[ ut ]} & {[ uu ]} & {[ uv ]} \\
{[ vs ]} & {[ vt ]} & {[ vu ]} & {[ vv ]} \\
\end{array}
\; \right]
\left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array}
\; \right]
\; =
\left[
\begin{array}{c}
{[ s \theta ]} \\
{[ t \theta ]} \\
{[ u \theta ]} \\
{[ v \theta ]} \\
\end{array}
\; \right]
\end{eqnarray}
\noindent The $N \times N$ matrix on the left is symmetric. With $N$
equations and $N$ unknowns, you can actually {\it solve} for $A, B, C,
D$!
\section{FAR, FAR BEST AND EASIEST: MATRIX ALGEBRA.} \label{matrixmethod}
The above equations are terribly cumbersome to solve in a
computer code because they require lots of loops. However, it becomes
trivial if we use matrices. Here we designate a {\bf matrix} by {\bf
boldface} type.
We illustrate the matrix method by carrying through the above
$N=4$ example, and we assume that there are 5 independent measurements
($M=5$). We first define the matrices
\begin{mathletters} \label{matrixdefinition}
\begin{eqnarray}
{\bf X_\sigma} = \left[
\begin{array}{cccc}
{s_0} & {t_0} & {u_0} & {v_0} \\
{s_1} & {t_1} & {u_1} & {v_1} \\
{s_2} & {t_2} & {u_2} & {v_2} \\
{s_3} & {t_3} & {u_3} & {v_3} \\
{s_4} & {t_4} & {u_4} & {v_4} \\
\end{array}
\; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf a_\sigma} = \left[
\begin{array}{c}
A \\
B \\
C \\
D \\
\end{array} \; \right]
\end{eqnarray}
%\end{mathletters}
%\begin{mathletters}
\begin{eqnarray}
{\bf Y_\sigma} = \left[
\begin{array}{c}
{\theta_0} \\
{\theta_1} \\
{\theta_2} \\
{\theta_3} \\
{\theta_4} \\
\end{array}
\; \right]
\end{eqnarray}
\end{mathletters}
\noindent so, in matrix form, the equations of condition (equation
\ref{eqcond}) reduce to the single matrix equation
\begin{equation}
\label{equationofcondition}
{\bf X_\sigma} \cdot a = {\bf Y_\sigma} \; .
\end{equation}
\noindent The notation for these equations corresponds to NM's. We write
them with subscripts $\sigma$ to emphasize that they are calculated
without dividing by $\sigma_{data}$; we generalize to their case below.
Our matrix $\mathbf{X_\sigma}$ corresponds to NM's ``design
matrix'' $\bf X$ of Figure 15.4.1, except that our elements are not
divided by $\sigma_{data,m}$, and the matrix equation of condition
(equation \ref{equationofcondition}) is identical to the expression
inside the square brackets of NM's equation 15.4.6. The differences
arise because we have omitted the factors involving $\sigma_{data,m}$,
the intrinsic measurement uncertainties; we include them below in \S
\ref{chisqsection}.
Again, there are more equations than unknowns so we can't solve
this matrix equation directly. So next we form the normal equations
from these matrices. In matrix form, the normal equations (equation
\ref{smeqn}) reduce to the single equation
\begin{equation} \label{matrixnormal}
\mathbf{[\alpha_\sigma] \cdot a} = \mathbf{[\beta_\sigma]} \; ,
\end{equation}
\noindent (NM equation 15.4.10), where
\begin{mathletters}
\label{ssdef}
\begin{equation}
[\alpha_\sigma] = {\bf X_\sigma^T \cdot X_\sigma}
\end{equation}
\begin{equation}
[\beta_\sigma] = {\bf X_\sigma^T \cdot Y_\sigma}
\end{equation}
\end{mathletters}
\noindent The matrix $[\alpha_\sigma]$ is known as the {\it curvature matrix}
because each element is twice the curvature of $\sigma^2$ plotted
against the corresponding product of variables, as we discuss below in
\S \ref{chicoeffsnoco}.
The number of equations is equal to the number of unknowns, so
the solution of the matrix equation is easy---just rewrite it by
multiplying both sides by the inverse of $[\alpha_\sigma]$ (that is, by
$[\alpha_\sigma]^{-1})$, which gives
\begin{equation}
\label{adef}
{\bf a} = [\alpha_\sigma]^{-1} \mathbf{ \cdot [\beta_\sigma]} \; .
\end{equation}
\noindent All of these matrix operations are trivially easy in IDL (\S
\ref{numexample}).
\section{UNCERTAINTIES IN THE DERIVED COEFFICIENTS.} \label{sigmas}
How about the uncertainties in the derived quantities contained
in the matrix ${\bf a}$?
The first thing to do is derive the sample variance $s^2$ (the
square of standard deviation, or mean error, or dispersion, etc) of the
individual data points using the generalization of the usual definition
for a straight average of $x$, $s^2 = [ \sum_0^{M-1} (x_m - \overline{
x_m})^2/(M-1)$. The generalization is, simply, to replace the $M-1$ in
the denominator by $\nu = M-N$. In the straight-average case, $N=1$ so
this fits. Here $\nu$ is known as the number of {\it degrees of freedom}
and $N$, the number of unknown coefficients, is known as the nmber of
{\it constraints}. So we have
\begin{equation} \label{samplevarianceone}
s^2 = {1 \over M - N} \sum_{m=0}^{M-1} (\theta_m -
\overline{\theta_m})^2 \; ,
\end{equation}
\noindent where $\overline{\theta_m}$ are the values for $\theta_m$ {\it
predicted by the derived quantities ${\bf a}$}. Note the difference:
$\theta_m$ are the {\it observed} values, while $\overline{\theta_m}$ are the
values {\it predicted by the least squares fit}. The predicted values
are those that are computed from the derived coefficients $A, B, C$\dots
The $M$ quantities
\begin{equation}
\delta \theta_m = \theta_m - \overline{\theta_m}
\end{equation}
\noindent are called the {\it residuals} or {\it deviations} from the
fit.
It's easy to calculate the $\overline{\theta_m}$ with matrices.
First define the matrix ${\bf \overline{ Y_\sigma}}$ that contains these
values:
\begin{eqnarray}
{\bf \overline{ Y_\sigma}} = \left[
\begin{array}{c}
\overline{\theta_0} \\
\overline{\theta_1} \\
\overline{\theta_2} \\
\overline{\theta_3} \\
\overline{\theta_4} \\
\end{array}
\; \right]
\end{eqnarray}
{\noindent} Calculating ${\bf \overline{ Y}}$ is simple:
\begin{equation} \label{ybardefinition}
{\bf \overline{ Y_\sigma}} = {\bf a \cdot X_\sigma} \; .
\end{equation}
\noindent Note that ${\bf X_\sigma}$ is already defined (equation
\ref{matrixdefinition}) and ${\bf a}$ was solved for in equation
\ref{adef}. It's convenient to define the residual matrix
\begin{equation}
{\bf \delta Y_\sigma } = {\bf Y_\sigma - \overline{ Y_\sigma}}
\end{equation}
\noindent so we can write
\begin{equation} \label{samplevariancetwo}
s^2 = {1 \over M - N} {\bf \delta Y_\sigma^T \cdot \delta Y_\sigma} \; .
\end{equation}
This is the sample variance of the datapoints, not the variances
in the derived coefficients. We can obtain these as before, by
generalizing the results from the two-parameter case like the
straight-line fit discussed in \S \ref{sectionone}. We won't go through
the derivation here; you can find it in Taylor \S 8.4 and equation 8.16,
8.17. The result is
\begin{equation} \label{coeffvarianceone}
{\bf s_a}^2 = s^2 diag \{[\alpha_\sigma]^{-1} \}
\end{equation}
\noindent Or, to put it simply in words: to get the variance of
coefficient $n$ in the matrix ${\bf a}$, multiply $s^2$ by the $n^{th}$
diagonal element of $[\alpha_\sigma]^{-1}$. You can also write this
equation as
\begin{equation} \label{coeffvariancetwo}
{\bf s_a}^2 = diag \{ {\bf \delta Y_\sigma^T \cdot [\alpha_\sigma]^{-1}
\cdot Y_\sigma} \}
\end{equation}
\noindent Although the above equations are correct, there is more to the
story because of covariance, which are the off-diagonal elements. We
return to this topic in \S \ref{chisqsection}.
\section{A NUMERICAL EXAMPLE AND ITS SOLUTION IN IDL.} \label{numexample}
If the following sounds like Greek to you, take a look at \S
\ref{matrixmethod} and \ref{sigmas}.
\subsection{Generation and solution of the numerical example}
{\bf (1) First, generate the numerical example.} Suppose that we
make four measurements of the angle $\theta$ and we want to fit to a
parabolic function in time $t$. In the notation of equation~\ref{eqcond},
$s$ would be unity, $t$ the time, and $u$ the time squared, so the
number of unknowns is three ($N=3$). Because there are four independent
measurements ($M=4$) the subscripts run from $m = 0 \rightarrow 3$.
Suppose that the four values of time are 5, 7, 9, 11.
\begin{figure}[h!]
\begin{center}
\leavevmode
%\includegraphics[height=7.5in, width=6.0in]{twolvl3b.ps}
\includegraphics[scale=.55, angle=90]{lsfitfig.ps}
%\includegraphics{twolvla.ps}
\end{center}
\caption{Our numerical example. Stars are the four datapoints; the solid
line is the fit. We perform two fits: one uses the original definition
of time; the other uses $(time-8)$, in effect moving the $y$-axis to the
dashed line. The two fits give the same line but the coefficients and
their errors differ greatly.\label{lsfitfig}}
\end{figure}
First we create the matrix {\bf X} in IDL
\begin{equation}
{\bf X = fltarr(N,M) = fltarr(3,4)}
\end{equation}
\noindent and then we populate it with numbers. In your own work, you
would normally do this by reading a data file and transferring the
numbers to the matrix using IDL commands; to work through this example,
you might manually type them in. After populating the matrix, in direct
correspondence with equation \ref{matrixdefinition}a we have $s_m = 1$,
$t_m = time_m$, $u_m = time_m^2$:
\begin{mathletters}
\begin{eqnarray}
{\bf X} = \left[
\begin{array}{ccc}
1 & 5 & 25 \\
1 & 7 & 49 \\
1 & 9 & 81 \\
1 & 11 & 121 \\
\end{array} \; \right]
\end{eqnarray}
\end{mathletters}
\noindent Suppose that the four measured values of $\theta$ are
(equation~\ref{matrixdefinition}c)
\begin{mathletters}
\begin{eqnarray}
{\bf Y} = \left[
\begin{array}{c}
142 \\
168 \\
211 \\
251 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent One word of caution here: in IDL, to get these into a column
matrix, which is how we've treated $\bf Y$ above, you have to define
${\bf Y}$ as a two-dimensional array because the second dimension
represents the column. When working in IDL it's more convenient to
define a row vector, which has only one dimension; in IDL you do this
by defining ${\bf Y} = [142,168,211,251]$; you can make it into the
necessary column vector by taking its transpose, i.e.\ ${\bf Y =
transpose(Y)}$. Alternatively, you can define ${\bf Y} = [ [fltarr(0)],
[142,168,211,251]]$.
\noindent Figure \ref{lsfitfig} shows the datapoints, together with the
fitted curve.
{\bf (2) Now solve it in IDL.} In IDL we calculate the normal
equation matrices and denote the $[\alpha]$ in equation \ref{ssdef}a by
${\bf XX}$:
\begin{mathletters}
\begin{equation}
{\bf XX} = {\bf transpose(X) \# \# X} \; ,
\end{equation}
\noindent and we denote the $[\beta]$ in equation \ref{ssdef}b by {\bf
XY}:
\begin{equation}
{\bf XY} = {\bf transpose(X) \# \# Y} \; .
\end{equation}
\end{mathletters}
\noindent In IDL we take the inverse of $[\alpha]$ (same as $\bf XX$) by
\begin{equation}
{\bf XXI} = {\bf invert(XX)} \; .
\end{equation}
The least-square fitted quantities are in the matrix {\bf a}
(equation \ref{adef}), which we obtain in IDL with
\begin{equation}
{\bf a} = {\bf XXI \ \#\# \ XT} \; .
\end{equation}
In IDL we denote the matrix of predicted values $\overline{
\theta_m}$ by ${\bf BARY}$, which is
\begin{equation}
\label{bteqn}
{\bf BARY} = {\bf X \ \#\# \ a} \; .
\end{equation}
\noindent and we can also define the residuals in $\bf Y$ as
\begin{equation}
{\bf DELY} = {\bf Y - BARY}
\end{equation}
\noindent In IDL we denote $s^2$ in equations \ref{samplevarianceone}
and \ref{samplevariancetwo} by $s\_sq$ and write
\begin{equation}
\label{sigsq}
s\_sq = {\bf transpose(DELY) \#\# DELY}/(M-N) \; .
\end{equation}
\noindent It is {\it always} a good idea to plot all three quantities
(the measured values {\bf Y}, the fitted values {\bf BARY}, and the
residuals ${\bf DELY}$) to make sure your fit looks reasonable and to
check for bad datapoints.
To get the error in the derived coefficients we need a way to
select the diagonal elements of a matrix. Obviously, any $N \times N$
matrix has $N$ diagonal elements; a convenient way to get them is
\begin{equation}
diag \ elements \ of \ {\bf XXI}= {\bf XXI[(N+1) * indgen(N)]}
\end{equation}
\noindent In IDL, we define the variances of
the $N$ derived coefficients by {\bf vardc} (think of ``{\bf var}iances of
{\bf d}erived {\bf c}oefficients''). You can get this as in equation
\ref{coeffvarianceone} from
\begin{equation} \label{lscoefferrorone}
{\bf vardc} = s\_sq * {\bf XXI[(N+1)*indgen(N)]} \; .
\end{equation}
\noindent or as in equation \ref{coeffvariancetwo} from
\begin{equation} \label{lscoefferrortwo}
{\bf vardc} = {\bf ( transpose(DELY) \#\# XXI \#\# DELY)
[(N+1)*indgen(N)]}/(M-N) \; .
\end{equation}
\noindent Here, note the general IDL feature: if you enclose any array
expression in parenthesis, you can treat the result as an array and
subscript it as you would a single array.
\subsection {Discussion of the numerical example}
For this numerical example, the solution (the array of derived
coefficients) is
\begin{mathletters}
\label{avalues}
\begin{eqnarray}
{\bf a} = \left[
\begin{array}{c}
96.6250 \\
4.49805 \\
0.875122 \\
\end{array} \; \right]
\end{eqnarray}
\noindent and the errors in the derived coefficients [the square root of
the $\sigma^2$'s of the derived coefficients, i.e. $[\sigma_m^2]^{1/2}$
or, in IDL, $sqrt({\bf vardc})$ in equations \ref{lscoefferrorone} and
\ref{lscoefferrortwo}] are:
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
34.0118 \\
8.99995 \\
0.559014 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent These results look {\it horrible}: the uncertainties are large
fractions of the derived coefficients,
The reason: we have specifically chosen an example with terrible
covariance. And the great thing is this can be fixed easily (at least in
this case---certainly not always), without taking more data!
\section{THE COVARIANCE MATRIX AND ITS NORMALIZED COUNTERPART}
\label{ncov}
First we provide a general discussion, then we apply it to the
above numerical example.
\subsection{ General discussion}
The variances in the derived coefficients are obtained from the
diagonal elements of ${\bf XXI}$. The off-diagonal elements represent
the {\it covariances} between the derived coefficients. Covariance means,
specifically, the degree to which the {\it uncertainty} in {\it one}
derived coefficient affects the uncertainty in {\it another} derived
coefficient.
Because the covariance matrix elements relate pairwise the
various coefficients, the units of the matrix elements are all
different. This makes it convenient to reduce all the matrix elements to
a standard set of units---namely, no units at all. So before discussing
the covariance matrix {\it per se}, we first discuss its normalized
counterpart.
The normalized covariance
matrix\footnote{It is a pleasure to thank Doug Finkbiener for
introducing me to this concept.} {\bf ncov} is derived from ${\bf XXI}$
by dividing each element by the square root of the product of the
corresponding diagonal elements. Let ${\bf ncov}$ be the normalized
covariance matrix; then
\begin{equation}
ncov_{ik} = {XXI_{ik} \over \sqrt {XXI_{ii} \ XXI_{kk}}}
\end{equation}
\noindent Or, in IDL, do the following:
\begin{mathletters} \label{idlcovariance}
\begin{equation}
{\bf dc = XXI[(M+1)*indgen(M)]}
\end{equation}
\begin{equation}
{\bf ncov = XXI/{\it sqrt}(dc \# dc)}
\end{equation}
\end{mathletters}
\noindent In the above, ${\bf dc \# dc}$ is an $M \times M$ matrix
consisting of products of the diagonals of ${\bf XXI}$, so dividing
${\bf XXI}$ by $sqrt({\bf dc \# dc})$ generates the normalized version.
Because {\bf ncov} is a {\it normalized} covariance matrix, you
might think that it's non-normalized parent is {\bf XXI}---and you'd be
{\it almost} right. The true covariance matrix ${\bf C}$ (as defined in
{\it Numerical Recipes} for example) is
\begin{equation}
{\bf C} = \sigma_{data}^2 {\bf XXI}
\end{equation}
\noindent because ${\bf C}$ is defined for $\chi^2$ instead of
$\sigma^2$ (see \S \ref{chisqsection}).
In ${\bf ncov}$, the diagonal elements are all unity and the
off-diagonal elements reflect the interdependence of the derived
coefficients on each other. The off-diagonal elements can range from
$-1 \rightarrow 1$. Each matrix element is the correlation coefficient
between the {\it uncertainties} of its two parameters. In particular,
suppose that the data happens to produce a coefficient that differs from
its true value by some positive number. If the normalized matrix element
is negative, then the other coefficient will tend to differ from its
true value by a negative number.
Here's a more detailed discussion of what the covariance means.
Suppose you are least-squares fitting for two derived coefficients
($A_0$ and $A_1$). When you do a least-squares fit to a set of data,
you are fitting one set of data out of a possible infinity of possible
sets that you'd get by repeating the experiment, and your particular set
of data happens to produce specific values of $\overline{ A_0}$ and $\overline{
A_1}$, which differ from the {\it true} values $(A_0^*, A_1^*)$ by
amounts $\delta A_0$ and $\delta A_1$. If their covariance is zero,
then in the infinity of data sets you'd find that $\delta A_0$ is
uncorrelated with $\delta A_1$. But if it is nonzero, these two
quantities would be correlated.
A high covariance is bad because the derived variables depend on
each other. As we shall see in the next subsection, it also can greatly
increase the uncertainty in derived coefficients. Often a high
covariance results from a poor choice of functions that you are fitting
or even a bad choice of the zero point of the independent variable---as
in our numerical example (see the next subsection). And, as in that
example, you can sometimes eliminate the bad covariance by
reformulating the problem---you don't even need to take more data! The
best reformulation involves using a set of orthonormal functions.
However, sometimes your interest is in a specific set of functions that
are {\it not} orthogonal, and in such cases it makes no sense to convert
to orthogonal functions---because you just have to convert back again
and do the error propagation after-the-fact instead of letting the least
squares process do it for you.
\subsection{ The covariance in our numerical example} \label{covarianceone}
Apply equation \ref{idlcovariance} to obtain the covariance
matrix for our numerical example:
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & -.989848 & .969717 \\
-.989848 & 1 & -.993808 \\
.969717 & -.993808 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
The off-diagonal elements are {\it huge}. This is the reason why
our solution is so bad.
In this seemingly innocuous example we have an excellent case of
a poor choice of zero point for the independent variable (the time).
The reason is clear upon a bit of reflection. We are fitting for $y =
A_0 + A_1 t + A_2 t^2$. The coefficient $A_0$ is the $y$-intercept and
$A_1$ is the slope. Inspection of Figure \ref{lsfitfig} makes it very
clear that an error in the slope has a big effect on the $y$-intercept.
Now we retry the example, making the zero point of the time
equal to the mean of all the times, that is we set $(time_m = time_m -
8)$. We get the same fitted line, but the derived coefficients are
completely different---and amazingly better! We get
\begin{mathletters}
\begin{eqnarray}
{\bf A} = \left[
\begin{array}{c}
188.625 \\
18.500 \\
0.87500 \\
\end{array} \; \right]
\end{eqnarray}
\begin{eqnarray}
{\bf \sigma_A} = \left[
\begin{array}{c}
3.58 \\
1.00 \\
0.559 \\
\end{array} \; \right] \; .
\end{eqnarray}
\end{mathletters}
\noindent In redefining the origin of the independent variable, we have
made the zero intercept completely independent of the slope: changing
the slope has no affect at all on the intercept. You can see this from
the normalized covariance matrix, which has become
\begin{eqnarray}
{\bf ncov} = \left[
\begin{array}{ccc}
1 & 0 & -0.78086881 \\
0 & 1 & 0 \\
-0.78086881 & 0 & 1 \\
\end{array} \; \right] \; .
\end{eqnarray}
\noindent which is nice, but not perfect: Our step is {\it partial}
because the second-order coefficient $A_2$ affects $A_0$ because, over
the range of $[(time - 8) = -3 \rightarrow +3]$, the quantity $[A_2 \;
\Sigma(time_m -8)^2]$ is always positive and is thereby correlated with
$A_0$.
We could complete the process of orthogonalization by following
the prescription in BR chapter 7.3, which discusses the general technique of
orthogonalizing the functions in least square fitting. The general case
is a royal pain, so much so that we won't even carry it through for our
example.
For some particular cases, standard pre-defined functions are
orthogonal. For example, if $t_m$ is a set of uniformly spaced points
beteween $(-1 \rightarrow 1)$ and you are fitting a polynomial, then the
appropriate orthogonal set is Legendre polynomials. This is good if your
only goal is to represent a bunch of points by a polynomial function,
because the coefficients of low-order polynomials are independent of the
higher ones. However, it's more work and, moreover, often you are
interested in the coefficients for specific functions that don't happen
to be orthogonal; in such cases, you should just forge ahead.
But {\it always} look at the normalized covariance matrix.
Suppose one pair of off-diagonal elements departs significantly from
zero. Then their corresponding functions are far from being orthogonal
and the variances of the derived coefficients will suffer as a result.
You might be able to eliminate one of the parameters to make the fit
more robust. For example, suppose one function is $t \cos (t)$ and the
other is $\sin(t) \cos (t)$. If the range of $t$ is small, these two
functions are indistinguishable and have a large covariance; you should
eliminate one from the fit. If the range of $t$ is large, there is no
problem.
For further discussion of covariance, see \S \ref {chicoeffs}.
Also, you might also want to try out another example in Taylor's \S 8.5.
\section{CHI-SQUARE FITTING, AND WEIGHTED FITTING} \label{chisqsection}
Chi-square fitting is very much like our least-squares fitting
above, with one additional wrinkle. The reduced Chi-square
($\widehat{\chi}^2$) is equal to the ratio of the {\it variance of the
datapoint residuals} ($\sigma^2$) to the {\it adopted intrinsic
variances of the datapoints} ($\sigma_{data}^2$). So it should be
obvious that in Chi-square fitting, you must know the measurement
uncertainties $\sigma_{data}$ of the individual data points beforehand.
If you want to give the various datapoints weights based on something
other than $\sigma_{data}$, then that is just like Chi-square fitting
except that you can adopt an arbitrary scale factor for the
uncertainties (section~\ref{weightedcase}).
Chi-squared fitting treats uncertainties of the derived
parameters in a surprising way. Getting the
coefficient uncertainties with chi-squared fitting is a tricky business
because \begin{enumerate}
\item With the standard treatments, the errors in the derived parameters
don't depend on the residuals of the datapoints from the fit (!).
\item The errors in the derived parameters depend on their mutual
covariances.
\end{enumerate}
\noindent We treat these issues in the following two subsections.
\subsection{The uncertainties of the derived coefficients: discussion
ignoring covariance}
\label{chicoeffsnoco}
If you read any standard textbook on least squares fitting, you
will discover that the error in a derived parameter is equal to its
diagonal element in the covariance matrix. That is, the formula is just
like our equations \ref{coeffvarianceone} and \ref{coeffvariancetwo} but
with two differences: the $\sigma$ subscripts disappear and there is no
$s^2$. See NM equation 14.3.15; BR 7.25, Taylor 8.16, 8.17. This is not
as bad as it sounds because, without the $\sigma$ subscripts, the
covariance matrix $[alpha]^{-1}$ contains the factor $\sigma_{data}^2$.
If you've chosen $\sigma_{data}^2$ accurately, and if the model for the
fit is good so that the fitted curve accurately represents the points,
then $s^2 \approx \sigma_{data}^2$ so the results are equivalent.
However, realize that in Chi-squared fitting the covariance
matrix $[\alpha]^{-1}$ depends only on $\bf X$, i.e.\ on the times at
which you took the data---{\it not on the datapoint values}. Because
$[\alpha]^{-1}$ is {\it independent of the datapoint values},
your derived parameter errors are also independent of the datapoint
values. So they don't depend on the datapoint variance $\sigma^2$ or
the $\chi^2$.
This means that the standardly-derived errors in the parameters
depend only on what you chose for $\sigma_{data}$. If you have an
inflated opinion of the quality of your instrument or observing prowess
and choose an unrealistically low value for $\sigma_{data}$, then your
derived parameter errors are too low by the same factor. Or if your
estimate of $\sigma_{data}$ is accurate but you are fitting a poor
model, i.e.\ fitting a quadratic when you should be fitting a sine wave,
then the residuals will be large and your parameter errors will also be
much larger than you calculate.
How many observational papers in the ApJ provide parameter
errors based on Chi-squared fitting that are unrealistically low? Or,
for that matter, high? [probably fewer!]
To understand the derived-parameter uncertainty issue, we first
illustrate the situation with the simplest of weighted chi square fits,
the weighted mean. We also treat the case in which all assumed variances
of the datapoints are identical ($\sigma_{data,m} = \sigma_{data}$) so
that we can focus on the essentials of the problem; this restriction in
no way affects our derived results. This will provide insight that we
can extend to the multivariate chi-squared case.
\subsubsection{ The weighted mean: the simplest chi-squared fit}
First, recall the formulae for an ordinary {\it unweighted}
average in which the value of each point is $y_m$ and the residual of
each point from the weighted mean is $\delta y_m$; we also neglect the
difference between $M$ and $M-1$ to make the comparison to the weighted
case clearer:
\begin{mathletters}
\begin{equation}
mean = {\sum y_m \over M}
\end{equation}
\begin{equation}
\sigma^2 = {\sum \delta y_m^2 \over M}
\end{equation}
\begin{equation} \label{meanvariance}
\sigma_{mean}^2 = {\sum \delta y_m^2 \over M^2} = {\sigma^2 \over M} \ ,
\end{equation}
\end{mathletters}
\noindent $\sigma_{mean}^2$ is the variance of the mean and $\sigma^2$
is the variance of the datapoints arond the mean. Recall that in this
case the mean is the least-squares fit to the data, so to use
least-squares jargon we can also describe $\sigma_{mean}$ as the error
in the derived coefficient for this single-parameter least-squares fit.
Now for a weighted average in which the weight of each point is
$w_m = {1 \over \sigma_{data,m}^2} = {1 \over \sigma_{data}^2}$. (Recall we
are considering the case where all $\sigma_{data,m}$ are identical,
equal to $\sigma_{data}$.) In a nonweighted average the quantity that is
minimized is $\sum \delta y_m^2$; in a weighted average the quantity minimized
is $\sum w_m \delta y_m^2$. So you'd think that the three equations
corresponding to the above become
\begin{mathletters} \label{wgtideal}
\begin{equation}
mean_w = {\sum w_my_m \over \sum w_m} = { \sum y_m \over M}
\end{equation}
\begin{equation} \label{wvariance}
\sigma_w^2 = {\sum w_m \delta y_m^2 \over \sum w_m} = {\sum \delta y_m^2 \over M}
\end{equation}
\begin{equation} \label{wmeanvariancecc}
\sigma_{mean}^2 = {\sum w_m \delta y_m^2 \over M \sum w_m} =
{\sigma_w^2 \over M} = {\sum {\delta y_m^2} \over M^2 } \ ,
\end{equation}
\end{mathletters}
\noindent where the rightmost expressions use our assumption that
$\sigma_{data,m}$ are all identical. In fact, however, the last of these
equations is always written (e.g.\ BR equation 4.19; Taylor equation
7.12)
\begin{equation} \label{wmeanvariancebb}
\sigma_{mean}^2 = {1 \over \sum w_m} = { \sigma_{data}^2 \over M} \ ,
\end{equation}
Note the excruciatingly painful difference between equation
\ref{wgtideal}c and equation \ref{wmeanvariancebb}: the former depends
on the {\it variance of the datapoint residuals}, as you'd think it
should, while the latter depends on only the {\it adopted intrinsic
variance of the data}, which is chosen by the guy doing the fit. If you
do an unweighted average, and derive a certain variance, and next do a
weighted average in which you choose some value for $\sigma_{data}$ that
happens to be wrong, the two fits give different results for
$\sigma_{mean}^2$. This is crazy.
To get around this difficulty, we follow the procedure in BR
equations 4.20 to 4.26. This introduces an arbitrary multiplicative
factor for the weights and goes through the ML calculation to derive,
instead of equation \ref{wmeanvariancebb}, the far superior
\begin{equation} \label{wmeanvarianceaa}
\sigma_{mean}^2 = {\widehat{\chi}^2 \over \sum w_m} =
{\chi^2 \over M\sum w_m} =
{ \sigma_{w}^2 \over M} \ ,
\end{equation}
\noindent which is precisely the same as our intuitive guess, equation
\ref{wmeanvariancecc}. The difference is the numerator, which contains
the reduced chi-squared $\widehat{\chi}^2$; for this case where all
$\sigma_{data,m}$ are identical, $\widehat{\chi}^2 = {\sigma_w^2 \over
\sigma_{data}^2}$. Here
\begin{equation} \label{chisqone}
\chi^2 = \sum w_m \delta y_m^2 = \sum {\delta y_m^2 \over \sigma_{data,m}^2}
\end{equation}
\subsubsection{The multivariate Chi-square fit}
\label{sectionchisq}
In chi-squared fitting, you know beforehand the intrinsic
uncertainty in each measurement $\sigma_{data,m}$. This is a function
{\it only} of the {\it instrument} and the {\it data-taking procedure}:
if you know the properties of the instrument, and if you understand how
it works, and if there are no systematic errors, then you should be able
to specify each $\sigma_{data,m}$. Specifically, $\sigma_{data,m}$ is
most certainly {\it not} the same thing as the dispersion of the points
from the fit.
In this case, Chi-square fitting is just like least-squares
fitting except for the following:
{\bf (1)} In the matrix $\bf X_\sigma$ of equation
\ref{matrixdefinition}a, each row $m$ is a different measurement with a
different intrinsic variance $\sigma_{data,m}$. You are generating a
new matrix $\mathbf {X}$, which is identical to ${\bf X_\sigma}$ except
that each row $m$ is divided by $\sigma_{data,m}$. This new matrix is
the same as NR's {\it design matrix} (Figure 15.4.1), which they denote
by $\bf A$.
{\bf (2)} Divide each datapoint $\theta_m$ in equation
\ref{matrixdefinition}b by $\sigma_{data,m}$. You are generating a new
matrix $\mathbf {Y}$, which is identical to ${\bf Y_\sigma}$ except that
each row is divided by $\sigma_{data}$. This new matrix is the same as
NR's vector {\bf b}.
You've divided each row, i.e.\ the equation of condition for
each row $m$, by a common factor, so the solution of the equation is
unchanged. Now suppose that all datapoints have the same intrinsic
measurement uncertainty $\sigma_{data}$. In the full matrix
equation~\ref{equationofcondition} you've divided both sides by the same
quantity, $\sigma_{data}$. Obviously, this doesn't change the derived
results for {\bf a}. (If the $\sigma_{data,m}$ differ from each other,
then $\bf a$ is affected, of course.) Let's summarize the relationships
between the least-squares ($\sigma$-subscripted) and $\chi^2$ matrices
for this specific case where all $\sigma_{data,m}$ are identical:
\begin{mathletters}
\begin{equation}
{\bf X} = {\bf X_\sigma} \over \sigma_{data}
\end{equation}
\begin{equation}
{\bf Y} = {\bf Y_\sigma} \over \sigma_{data}
\end{equation}
\begin{equation}
[\alpha] = { [\alpha_\sigma] \over \sigma_{data}^2 }
\end{equation}
\begin{equation}
[\alpha]^{-1} = \sigma_{data}^2 [\alpha_\sigma]^{-1}
\end{equation}
\end{mathletters}
When you calculate the predicted values ${\bf \overline{ Y}}$ in
equation \ref{ybardefinition}, using $\bf X$ instead of $\bf X_\sigma$,
the values are divided by $\sigma_{data}$. Because $\bf Y$ is also
divided by $\sigma_{data}$, the residuals $\bf \delta Y$ are as well.
Thus when you calculate the square of the sum of the residuals as in
equation \ref{samplevariancetwo} you get $s^2 \over \sigma_{data}^2$,
which is the reduced chi-square, denoted as $\widehat{\chi}^2$ :
\begin{equation}
\label{reducedchisq}
\widehat{\chi}^2 = { {\bf \delta Y^T \cdot \delta Y} \over M-N}
= { {\bf \delta Y_\sigma^T \cdot \delta Y_\sigma} \over \sigma_{data}^2(M-N)}
\end{equation}
\noindent and, obviously, you could calculate the non-reduced $\chi^2$
\begin{equation}
\label{chisq}
\chi^2 = {\bf \delta Y^T \cdot \delta Y}
= { {\bf \delta Y_\sigma^T \cdot \delta Y_\sigma} \over \sigma_{data}^2}
\end{equation}
\noindent This is precisely analagous to equation \ref{chisqone} for the
weighted mean example above. The reduced chi-square $\widehat{\chi}^2$
is equal to the ordinary chi-square $\chi^2$ except that it is divided
by the number of degrees of freedom, ordinarily denoted by $\nu$, which
is equal to $(M-N)$.
Finally, we have the analogy of equation \ref{wmeanvarianceaa}
expressed in matrix form as in equation \ref{coeffvarianceone}:
\begin{equation} \label{coeffvariancegood}
{\bf s_a}^2 = \widehat{\chi}^2 diag \{[\alpha]^{-1} \} =
diag \{ {\bf \delta Y^T \cdot [\alpha]^{-1}
\cdot \delta Y } \}
\end{equation}
\noindent This is in contrast to the result quoted in textbooks (e.g. NM
equation 14.2.15*******, BR equation 7.25), which omits the
$\widehat{\chi}^2$ factor:
\begin{equation} \label{coeffvarianceawful}
{\bf s_a}^2 = diag \{[\alpha]^{-1} \}
\end{equation}
\noindent and, as in the standard textbook solution for the weighted
mean case, provides parameter errors that are independent of the
datapoint residuals.
Our result, equation \ref{coeffvariancegood}, is very
reasonable. Suppose, for example, that the least-squares fit model is
perfect and the only deviations from the fitted curve result from
measurement error. Then by necessity we have $s^2 \approx
\sigma_{data}^2$ and $\widehat{\chi}^2 \approx 1$. (Different
experiments produce somewhat different values of $s^2$ because of
statistical fluctuations; an average gives $\sigma^2 = \langle s^2
\rangle$). In this situation, though, equations \ref{coeffvariancegood}
and \ref{coeffvarianceawful} are identical.
In the case $\widehat{\chi}^2 \approx 1$ the dispersion of the
observed points $\sigma$ is equal to the intrinsic dispersion of the
datapoints $\sigma_{data}$ and the mathematical model embodied in the
least-squares fit is perfect. That, at least, is the {\it theoretical}
conclusion. In practice, however, your obtaining such a low, good value
for $\widehat{\chi}^2$ might mean instead that you are using {\it too
large} a value for $\sigma_{data}$: you are ascribing more error to your
datapoints than they really have, perhaps by not putting enough faith in
your instrument. Specifically, if the datapoints really have intrinsic
measurement dispersion $\sigma_{data}$ and you specify that they have $2
\sigma_{data}$ when computing $(\mathbf{X},\mathbf{Y})$, then you'll
find $\widehat{\chi}^2 = 0.25$.
High values of $\widehat{\chi}^2$ indicate that the model
is not perfect and could be improved by the use of a different model,
such as the addition of more parameters---or, alternatively, that you
think your equipment works better than it really does and you are
ascribing {\it less} error to your datapoints than they really have. And
in this case, using equation \ref{coeffvarianceawful} instead of
\ref{coeffvariancegood} is disasterous.
We defer the more detailed interpretation of $\chi^2$ to below.
\subsubsection{The case in which datapoints have different dispersions,
or different weights: like a weighted average}
\label{weightedcase}
Here the $\sigma_{data,m}$ are all different. The $m^{th}$ row
of the equation-of-condition matrix ${\bf X_\sigma}$ and the $m^{th}$
element of the data vector ${\bf Y_\sigma}$ get divided by their
corresponding $\sigma_{data,m}$. The equation embodied in each row of
the matrix equation~\ref{equationofcondition} remains unchanged, but the
different rows are weighted differently with respect to each other.
Consider two measurements with intrinsic measurement
uncertainties $(\sigma_{data,1}, \sigma_{data,2})$; suppose
$\sigma_{data,1} < \sigma_{data,2}$. After being divided by their
respective $\sigma_{data}$'s, all of the numbers in row 1 are larger
than those in row 2. In all subsequent matrix operations, these larger
numbers contribute more to all of the matrix-element products and sums.
Thus, the measurement with smaller uncertainty has more influence on the
final result, as it should.
Suppose that the above two measurements were taken under
identical conditions except that measurement 1 received more integration
time than measurement 2; we have ${\sigma_{data,1} \over
\sigma_{data,2}} = \left({\tau_1 \over \tau_2}\right)^{-1/2}$, so the
rows of $\mathbf {X}$ are weighted as $\tau^{1/2}$. This means that
during the computation of $[\alpha] = \mathbf {X^T \cdot X}$, the
self-products of row 1 are weighted as $\tau_1$. This means that each
datapoint is weighted as $\tau$, which is exactly what you'd expect!
Note that this is also exactly the same weighting scheme used in a
weighted average, in which the weights are proportional to $\left(1
\over \sigma_{data}\right)^2$. We conclude that weighting scheme of the
first two steps in section~\ref{sectionchisq} agrees with common sense.
Suppose you don't know the intrinsic measurement dispersion
$\sigma_{data}$, but you {\it do} know the {\it relative} dispersion of
the various measurements. For example, this would be the case if the
datapoints were taken under identical conditions except for integration
time; then $\sigma_{data} \propto \tau^{-1/2}$. In this case, multiply
each row by its weight $w \propto {1 \over \sigma_{data}}$ and proceed
as above.
\subsection{The uncertainties of the derived coefficients: discussion
including covariance }
\label{chicoeffs}
Consider the first two coefficients in our above example, which
we discussed a bit above in \S \ref{covarianceone}. In this example,
the fit gives $y = A_0 + A_1 t + A_2 t^2$, where the numerical values
are given in vector form by equation~\ref{avalues}. The coefficient
$A_0$ is the $y$-intercept and $A_1$ is the slope. They have derived
values $A_0 = 96 \pm 34$ and $A_1 = 4 \pm 9$.
Remember what these uncertainties really mean: in an infinity of
similar experiments, you'll obtain an infinity of values of $(A_0,A_1)$
that are normally distributed with dispersions (34,9). Loosely
speaking, this means that $A_0$ lies between $(96-34=62)$ and
$(96+34=130)$ and $A_1$ lies between $-4.5$ and $13.5$.
Suppose you are interested in knowing about $A_0$ {\it without
regard to $A_1$}. By this we mean that as $A_0$ is varied from its
optimum value of 96, $\chi^2$ increases from its minimum value. As we
vary $A_0$, if we allow $A_1$ to take on whatever value it needs to for
the purpose of minimizing $\chi^2$, then this is what we mean by
``knowing about $A_0$ without regard to $A_1$. For this case, the
uncertainty of $A_0$ is indeed 34. Ditto for $A_1$. In other words,
equations \ref{coeffvarianceone}, \ref{coeffvariancetwo}, and
\ref{coeffvariancegood} apply.
However, if you are interested in knowing about {\it both}, you
must include their covariance. In our example, the large negative
covariance follows logically just from looking at a graph: if you fit
some points, all of which lie at positive $t$, then a more negative
derived slope will raise the $y$-intercept.
Specifically, the large negative covariance means that positive
departures of $A_0$ are associated with negative departures of $A_1$.
So even though the {\it individual} values $A_0=130$ and $A_1=13.5$ are
acceptable, you {\it cannot} conclude that the {\it pair} of values
$(A_0,A_1) = (130, 13.5)$ is acceptable, because this pair has large
positive departures of both. In contrast, what {\it is} acceptable here
would be something like $(A_0,A_1) = (130, -4.5)$.
We stress that the acceptable ranges of values depend on what
you are interested in. This is sort of like the observer's influence in
quantum mechanics. If you are interested in $A_1$ alone, then you can
say $A_1 = 4 \pm 9$ and, in making this statement, you have to realize
that, as $A_1$ varies over this range, $A_0$ can vary over (formally, at
least) the range $(\infty \rightarrow -\infty)$: you just don't give a
damn {\it what} happens to $A_0$ because you're not interested. But the
moment you become interested and restrict its possible range, that
influences the possible range for $A_1$, too.
There is no simple relationship between the covariance matrix
elements and the acceptable ranges. For two variables, the best way to
express this matters is to construct the ellipses that define the loci
of constant $\Delta \chi^2$ and present them on a graph as in NR's
Figure 14.5.4. For three variables, these ellipses become ellipsoids;
for four, they become four-dimensional volumes, etc.
% To construct these curves/surfaces/volumes/etc., follow the
%prescription defined by the six bullets at the end of the section NR's
%{\it Probability Distribution of Parameters in the Normal Case}. This
%description is a bit curt, but it is accurate. To make sure that you
%understand it, take a numerical example and try applying the
%prescription to the case of deriving the uncertainties of a single
%coefficient (the case $\nu = 1$); that case is the case covered by
%equations~\ref{lscoefferror} and \ref{lscoefferror1} above. In this
%case, you have the usual normal distribution function for that variable,
%in which its formal error (as defined in equations~\ref{lscoefferror}
%and \ref{lscoefferror1} above) defines a range within which the derived
%value will fall $68.3\%$ of the time in the infinity of
%experiments---it's just like any other normal distribution.
\section{NONLINEAR LEAST SQUARES} \label{nonlinearls}
The least squares formulation requires that the data values
depend {\it linearly} on the unknown coefficients. For example, in
equation \ref{one}, the unknown coefficients $A$ and $B$ enter linearly.
Suppose you have a nonlinear dependence, such as for example
wanting to solve for $A$ and $B$ with equations of condition that look
like
\begin{equation} \label{basicone}
\sin(A t_m) + B t_m = \theta_m \; .
\end{equation}
\noindent What do you do here? You linearize the equation, using the
following procedure.
First, assume a trial values for $A$ and $B$; call these $A_0$
and $B_0$. You should pick values that are close to the correct ones. In
our example you wouldn't need to do this for $B$, but it's easier to
treat all coefficients identically. These trial values produce {\it
predicted} values $\theta_{0,m}$:
\begin{equation} \label{trialone}
\sin(A_0 t_m) + B_0 t_m = \theta_{0,m} \; .
\end{equation}
\noindent Subtract equation \ref{trialone} from \ref{basicone}, and
express the differences as derivatives. Letting $\delta A = A - A_0$ and
$\delta B = B - B_0$, this gives
\begin{equation} \label{difference}
\delta A [t_m \cos(A_0 t_m)] + \delta B t_m = \theta_m - \theta_{0,m} \; .
\end{equation}
\noindent This is linear in $(\delta A, delta B)$ so you can solve for
them using standard least squares. Increment the original guessed values
to calculate $A_{0,new}=A_0 + \delta A$ and $B_{0,new}=B_0 + \delta B$,
These won't be exact because higher derivatives (including cross
derivatives) come into play, so you need to use these new values to
repeat the process. This is an iterative procedure and you keep going
until the changes become ``small''. The generalization to an arbitrarily
large number of unknown coefficients is obvious.
We now offer some cautionary and practical remarks.
{\bf (1): Multiple minima:} Nonlinear problems often have
multiple minima in $\sigma^2$. A classical case is fitting multiple
Gaussians to a spectral line profile. Gaussians are most definitely not
orthogonal functions and in some cases several solutions may give almost
comparably good values of $\sigma^2$, each one being a local minimum.
For example, for the case of two blended Gaussians, one can often fit
two narrow Gaussians or the combination of a wide and narrow Gaussian,
the two fits giving almost equal $\sigma^2$. The lower of these is the
real minimum but, given the existence of systematic errors and such, not
necessarily the best solution. The best solution is often determined by
physical considerations; in this case, for example, you might have
physical reasons to fit a broad plus narrow Gaussian, so you'd choose
this one even if it's $\sigma^2$ weren't the true minimum.
{\bf (2): The Initial Guess:} When there are multiple minima,
the one to which the solution converges is influenced by your initial
guess. To fully understand the range of possible solutions, you should
try different initial guesses and see what happens. If the solutions
always converge to the same answer, then you can have some confidence
(but not {\it full} confidence) that the solution is unique.
{\bf (3): Iterative stability:} If your initial guess is too far
from the true solution, then the existence of higher derivatives means
that the computed corrections can be too large and drive the iterative
solution into instability. It is often a good idea to multiiply the
derived correction factors ($\delta A$ and $\delta B$ above) by a factor
${\cal F}$ less than unity, for example ${\cal F} = 0.5$ or 0.75. This
increases the number of iterations required for convergence but often
allows convergence instead of producing instability.
{\bf (4): Convergence criteria:} How do you know when the
solution has converged? One way: for each iteration, calculate the
uncertainties in the derived coefficients. If the uncertainty exceeds
the correction, then you are getting close. An alternate way, which I
usually use: if the fractional correction (e.g.~$\delta A \over A_0$)
decreases below some threshold, say $1\%$, you're close (some
parameters, such as angles, need a threshold that is absolute instead of
fractional). At this point, if you are using ${\cal F} \neq 1$, set
${\cal F} = 1$, do a few more iterations, and you're done.
{\bf (5): Numerical derivatives:} Sometimes the equations of
condition are so complicated that taking the derivatives, as in
obtaining equation \ref{difference}, is a huge job and subject to
mistakes. So you can take numerical derivatives instead of analytic
ones. Be careful, though; it's safest to use double precision and think
a bit about numerical accuracy; take a look at NM's section ******* on
evaluating numerical derivatives.
{\bf (6): Canned nonlinear least squares:} Packages like IDL
offer canned nonlinear least squares routines. They are designed to work
well for a wide range of different problems. However, often you can do
better by tailoring things (such as the factor ${\cal F}$ and
convergence criteria above) for the specific problem at hand. A good
example is Gaussian fitting: IDL's fitting program doesn't converge for
some input data, while for many of these cases the program that I wrote
myself works fine.
{\bf (7): Be careful and LOOK at the solution before accepting
it!} These nonlinear problems can produce surprising results, sometimes
completely meaningless results. Don't rely on them to be automatic or
foolproof!
\section{BRUTE FORCE CHI SQUARED AND THE CURVATURE MATRIX}
\subsection{Parameter Uncertainties in Brute Force Chi square fitting for a
Single Variable}
There are times when ``brute'' least squares is appropriate. For
example, if you have a nonlinear problem in which taking derivatives is
complicated, and if the number of unknown coefficients is small, then it
might be easier to search through the coefficient parameter space,
calculate the $\sigma^2$ for each combination of parameters, and find
the minimum. That procedure certainly provides the least squares
solution. It is particularly easy if there is only one coefficient. Let
us consider this case.
How do you determine the uncertainties in the derived
coefficient with this brute force technique? Consider the case of a
simple weighted average (which is the least squares solution for a
constant). Suppose the $\chi^2$ value for the unknown coefficient (i.e.,
the weighted average) is $x_0$ and that its variance is
$\sigma_{x_0}^2$. Now consider offsets from $x_0$, namely $x_0 + \delta
x$. Because $\chi^2$ is minimized at $x_0$, it is clear that
\begin{equation}
\Delta \chi^2 = \chi^2_{x_0 + \delta x} - \chi^2_{x_0} =
{1 \over 2} {d^2\chi^2 \over d \delta x^2} \delta x^2
\end{equation}
\noindent
Generally, the curvature is the relevant element of the
curvature matrix $[\alpha]$. In the particular case of a weighted
average, there is just one element so
\begin{equation}
[\alpha_\sigma] = [\alpha_\sigma]_{00} = \sum \left( {1 \over
\sigma_{data,m}^2} \right)
\end{equation}
In this particular case, {\it the curvature is independent of
the values of the datapoints!} This independence of the data {\it
values} is {\it always the case}. Generally, the curvature depends on
the {\it locations} of the datapoints (the ensemble of $t_m$ in equation
\ref{one}) but not on the {\it measured values} (the ensemble of
$\theta_m$ in equation \ref{one}); you can see this because the
curvature is equal to $[\alpha]$, whose elements depend only on the
locations.
The curvature is important because multiplying the curvature by
$\sigma_{x_0}^2$ gives a special and unique value of for $\Delta
\chi^2$. In particular, for the case of a single variable as we are now
discussing, $\delta \chi^2 = 1$ when $\delta x^2 = \sigma_{x0}^2$. This
is, in fact, the best way to consider the basic definition of the
variance: it is the value that, when multiplied by the curvature, gives
$\Delta \chi^2=1$. Actually, this condition only holds for the
single-variable case; for more variables, the variances need to $\Delta
\chi^2 > 1$. This is tied up with the pdf of $\chi^2$. We'll discuss
this more general case later.
Are you bothered by the fact that the curvature is independent
of the data values? You should be: it is obvious that the data values
affect $\sigma_{x_0}$. However, things aren't so bad. Even though the
data values don't appear explicitly, their variance is actually present
in the statement $\delta \chi^2 = 1$. This is because $\chi^2$ is
inversely proportional to the adopted intrinsic variance
$\sigma_{data,m}^2$---so the smaller $\sigma_{data}$, the smaller
$\delta x$ needs to be to make $\Delta \chi^2 = 1$.
But this leaves open the big problem that $\sigma_{data}$ is the
adopted uncertainty, which is chosen beforehand by you---you're supposed
be such a good experimentalist that you really do know the intrinsic
uncertainty in your measured values, and moreover you are assuming that
there are no other sources of uncertainty---such as ``cosmic scatter''
or an inappropriate model to which you are fitting the data. If your
adopted values of $\sigma_{data,m}$ are off by a common scale factor,
then parameter errors obtained from setting $\Delta \chi^2 = 1$ will be
incorrect by the square of that same factor. You can correct for this by
scaling the variance $\sigma_{x_0}^2$ by the reduced chi squared
$\widehat{\chi}^2$, as discussed in \S \ref{sectionchisq}. Or, to be
kosher, after having run through the problem once with the adopted
$\sigma_{data,m}$, calculate the $\widehat{\chi}^2$; multiply all
$\sigma_{data,m}$ by $\widehat{\chi}$; and redo the problem so that the
new $\widehat{\chi}^2 = 1$ and then the derived variance
$\sigma_{x_0}^2$ is also correct.
If you are doing a nonweighted least squares fit, then you can
dispense with defining $\chi^2$ and derive the uncertainty from the
datapoints themselves. In that case...
In fact, from our experience with least squares, we expect $\Delta
x_0 \propto {\sigma_{x_0} \over \sqrt N}$, where $N$ is the number of
datapoints. So we expect something like the following, which is, in
fact, the correct expression:
\begin{equation} \label{onecoeff}
\Delta x_0^2 = {\sigma_{x_0}^2 \over N {\cal C}}
\end{equation}
{\bf CHECK THIS !!!!!!!!!!!!!}
\subsection{The Curvature Matrix}
One can show that the diagonal elements of our matrix ${\bf SS}$
are the curvatures, i.e. ${\cal C}_n = {{\bf SS}_{nn} \over N}$. This
makes us think that we might denote ${\bf SS}$ by the moniker {\it
curvature matrix}. This is close to the usual definition, for example in
{\it Numerical Recipes}, where their curvature matrix is
$\mathbf{\alpha} = { {\bf SS} \over \sigma_{data}^2} $. This makes
${\cal C}_n = {\sigma_{data}^2 \mathbf{\alpha}_{nn} \over N}$.
If there were only a single variable, or if ${\bf SS}$ were
diagonal, then the diagonal elements of its inverse are just the
reciprocals. In other words, for this special case, ${\bf SSI_{nn} =
SS_{nn}^{-1}}$ and ${\cal C}_{n} = {1 \over N \ SSI_{nn}}$; also,
equation \ref{onecoeff} holds. More generally, {\it for the particular
case} in which ${\bf SS}$ is diagonal, then the undertainty in derived
coefficient $n$ is
\begin{equation} \label{onecoeffgen}
\sigma_n^2 = {\sigma^2 \over N {\cal C}_n}
\end{equation}
However, this is {\it not} true for more than one coefficient
and a nondiagonal covariance matrix. In that case, I {\it think} it's
true that ${\bf SSI_{nn} > SS_{nn}^{-1}}$; this means that ${\cal C}_{n}
> {1 \over N \ SSI_{nn}}$. Thus, equation \ref{onecoeffgen} {\it
under}estimates the error so that
\begin{equation} \label{onecoeffgenp}
\sigma_n^2 > {\sigma^2 \over N {\cal C}_n} \ .
\end{equation}
\noindent Fundamentally this is because the off-diagonal terms add
uncertainty to the derived coefficient because of the correlations among
the coefficient uncertanties.
To derive the coefficient error in brute force least squares,
you can use the curvature and equation \ref{onecoeff} if you have only a
single variable. Otherwise, you need to evaluate the covariance matrix;
if it is diagonal, you could use equation \ref{onecoeffgenp}; but having
already evaluated the covariance matrix, you might as well do the
problem using the usual least squares technique.
\section{ Comparison with Numerical Recipes}
I learned least squares from Appendix A of Chauvenet (1863). He
didn't use $\chi^2$ and didn't use matrix techniques, but \S
\ref{sectionone} and \ref{sectiontwo} follows his development quite
closely. I wrote the first version of this document before knowing of
NM's treatment, which explains my orientation towards least squares
instead of chi-square. I'm fortunate in this approach because it made me
realize the pitfalls one can get into with chi-square, as I discuss in
\S \ref{chisqsection}.
On the other hand, NM describe the least squares approach with
some disdain in the discussion of equation (14.1.6********) and warn
that it is ``dangerous'' because you aren't comparing the residuals to
the intrinsic inaccuracies of the data. In astronomy, though, more often
than not you don't have an independent assessment of $\sigma_{data}$.
But you might know the relative weights, and this is a plus for
Chi-square fitting. In any case, heed our warnings about chi-square
fitting in \S \ref{chisqsection}.
In this writeup I have revised my old notation to agree,
partially, with NM's. This effort wasn't completely successful because I
didn't read NM very carefully before starting. To make it easier to
cross-reference this document with NM, I provide the following table of
correspondences (left of the double arrow is ours, to the right is
theirs):
{\begin{mathletters}
\begin {equation}
{\bf X} \longleftrightarrow {\bf A}
\end{equation}
\begin{equation}
{\bf Y} \longleftrightarrow {\bf b}
\end{equation}
\begin{equation}
{\bf X^T \cdot X} = {\bf XX} = [\alpha] \longleftrightarrow
{\bf A^T \cdot A} = [\alpha]
\end{equation}
\begin{equation}
{\bf XX}^{-1} = {\bf XXI} = [\alpha]^{-1} \longleftrightarrow
[\alpha]`<^{-1} = [C] = {\bf C}
\end{equation}
\end{mathletters}
\section{ADDENDUM: FITTING A LINE WHEN BOTH X AND Y HAVE
UNCERTAINTIES}
We've mentioned that one of the essential assumptions of least
squares is that the independent variables are known with high precision
and the errors occur only in the measured data. Suppose you're fitting
two variables, $t$ and $\theta$, as in equation~\ref{one}. This
essential assumption means that $t$ is known with high precision and all
the uncertainty is in $\theta$, and you are minimizing the squares of
the residuals in the $\theta$-direction only.
If {\it both} variables have uncertainties, then you have to be
careful because the essential assumption is violated. Before doing
anything, read (lightly) Isobe, Feigelson, Akritas, and Babu (1990, ApJ
364, 104). Then adopt one of the two following procedures:
{\bf (1)} You may be able to reformulate the problem by defining
two new variables, one of which has much less fractional uncertainty
than the other. The most straightforward reformulation is to define
$x=(\theta + t)$ and $y=(\theta - t)$. The sum, $x$, has smaller
fractional uncertainty than $y$ so its dispersion {\it might} not matter
as much. In such a redefinition, you should include a scaling factor on
one of the original variables to make the two original dispersions about
equal.
Before proceeding with this option, re-read Isobe et al!
{\bf (2)} More generally, use the prescription in Isobe et al.
Lucky for me, I haven't had to delve into this method---but their
formulation looks good!
\references
\reference {} Bevington, P.R. \& Robinson, D. 1992, Data Reduction and
Error Analysis for the Physica Sciences (WCB/McGraw-Hill).
\reference {} Chauvenet, W. 1863, A Manual of Spherical and Practical
Astronomy, Dover Press.
\reference {} Press, W.H., Flannery, B.P., Teukolsky, S.A., \&
Vetterling, W.T. 1986, Numerical Recipes (second edition), Cambridge
University Press.
\reference {} Taylor, J.R. 1997, An Introduction to Error Analysis,
University Science Books.
\end{document}