\documentclass[psfig,preprint]{aastex}
\makeatletter
\renewcommand\theequation{\thesection.\arabic{equation}}
\@addtoreset{equation}{section}
\renewcommand\thefigure{\thesection.\arabic{figure}}
\@addtoreset{figure}{section}
\renewcommand\thetable{\thetable.\arabic{table}}
\@addtoreset{table}{section}
\makeatother
\begin{document}
\setcounter{section}{13}
\title{LEAST-SQUARES AND CHI-SQUARE FOR THE BUDDING AFICIONADO: \\
ART AND PRACTICE }
\author{\copyright Carl Heiles \today }
\section{FITTING WHEN MORE THAN ONE MEASURED PARAMETERS HAVE UNCERTAINTIES}
\label{bothsection}
We've mentioned that one of the essential assumptions of least
squares is that the independent variables are known with high precision
and the errors occur only in the measured data. Suppose you're fitting
two variables, $t$ and $y$, as in equation~\ref{one}. This
essential assumption means that $t$ is known with high precision and all
the uncertainty is in $y$, and you are minimizing the squares of
the residuals in the $y$-direction only. If {\it both} variables
have uncertainties, then you have to be careful because the essential
assumption is violated. If you go ahead with a standard least-squares
fit when there are errors in both coordinates, the slope will be
systematically too small.
Thanks to Jefferys (1980), the ML formulation of this problem is
straightforward. Nevertheless, there is a lot of confusion on such
fitting and not-inconsiderable propagation of myth.
Before
reviewing Jefferys' formulation, let's see two approaches: \begin{enumerate}
\item Taylor \S 8.4 argues that you can account for $x$-variance
$\sigma_{x_m}^2$ by increasing the $y$-variance by the usual error
propagation, i.e.\ define an equivalent $y$-variance
$\sigma_{y_m}^2(equiv) = [\sigma_{y_m}^2 + (a_1\sigma_{x_m})^2]$, where
$a_1$ is the slope. This is equivalent to our results below.
\item Isobe et al.\ (1990, ApJ 364, 104) discuss the case
incorrectly. Look in particular at their Section V, where they make 5
numbered recommendations. Two of these are incorrect: \begin{enumerate}
\item Number 3 says, in essence, that if you have measurement
errors in $y$ but not in $x$, and want to predict $x$ from $y$ in some
future dataset, that you should least-squares fit the $x$ values (which
have no errors) to the $y$. This is {\it flat wrong}. Again, it leads to
a slope that is systematically too small. The proper procedure is to
fit $y$ to $x$ in the standard way, which is consistent with the ML
formulation and gives the right answer; then use the resulting
parameters, whose errors you know about, to predict $x$ from $y$ in the
future.
\item Number 4 says that if both $x$ and $y$ have errors, and your
main focus is finding the true slope, you should use their ``bisector''
method. I won't explain this because this concept is wrong.
\end{enumerate}
\end{enumerate}
\subsection{A preliminary: Why the slope is systematically small}
Why is the derived slope systematically too small if you use the
standard least-squares technique when both variables have errors? To see
this, take a look back at equation \ref{normaltwo}, where we explicitly
write the normal equations for fitting a straight line of the form $A
s_m + B t_m = y_m$. To focus the discussion and make it easy,
replace that problem with a single-parameter solution for only the slope
$B$, and use the usual variables $(x,y)$ in place of $(t,y)$. Then
we are fitting the set of $M$ equations
\begin{mathletters}
\begin{equation}
B x_m = y_m \ .
\end{equation}
\noindent The set of two normal equations becomes just the single equation
\begin{equation}
B [x^2] = [ xy ] \ ,
\end{equation}
\noindent or, writing out the sums explicitly,
\begin{equation} \label{eqnforbone}
B = {\sum_{m=0}^{M-1} {x^*}_m y_m \over \sum_{m=0}^{M-1} {x^*}_m^2
} \ .
\end{equation}
\end{mathletters}
Here we use the star to designate the perfectly-known
independent variable ${x^*}_m$. It is important to realize that the $x_m$
that appear in this equation are the perfectly-known ones ${x^*}_m$; this
is a fundamental tenet of least-squares fitting, which comes from the
concept and principle of maximum likelihood ML.
Because $B$ is defined by the ${x^*}_m$ and we are asking what
happens when we use the imperfectly known $x_m$ instead, let us reduce
the problem to the essence and imagine that $y_m$ is perfectly known,
i.e.\ $y_m={y^*}_m$; and that
\begin{equation} \label{deltaxeqnfirst}
{x^*}_m = x_m - \delta x_m \ ,
\end{equation}
\noindent where $\delta x_m$ is the observational error in point $m$. If
we do standard least squares on this situation, then we (incorrectly)
rewrite equation \ref{eqnforbone} to read
\begin{mathletters}
\begin{equation} \label{eqnforbtwo}
B = {\sum_{m=0}^{M-1} x_m y_m \over \sum_{m=0}^{M-1} x_m^2 } \ ,
\end{equation}
\noindent that is, using $x_m$ instead of ${x^*}_m$ (because we don't
know what ${x^*_m}$ is!). Substituting equation \ref{deltaxeqnfirst}, and
remembering that $y_m = {y^*}_m = B {x^*}_m$, we have
\begin{equation} \label{eqnforbthree}
B = {\sum_{m=0}^{M-1} {y^*}_m ({x^*}_m + \delta x_m) \over
\sum_{m=0}^{M-1} ({x^*}_m^2 + 2 {{x^*}}_m \delta x_m + \delta x_m^2)} \ .
\end{equation}
\end{mathletters}
Now all terms having $\delta x_m$ sum to zero because the errors
are distributed symmetrically around zero. But the denominator contains
$\delta x_m^2$. The denominator is irrevocably increased by this term,
which decreases the derived value of $B$ from its true value. Yes, this
is only a second-order effect, but it matters---after all, $\chi^2$ is a
second-order quantity! Try some numerical experiments!
\subsection{Jefferys' Method: Introduction}
Elsewhere we have regarded $y$ as the dependent variable with measured
errors and $x$ (and its multiple counterparts) as independent variables
that have no measured errors. But sometimes {\it all} of the measured
parameter's have uncertainties. In this case the distinction between
``dependent'' and ``independent'' variables disappears and we need to
treat them symmetrically. Moreover, we can have an arbitrarily large
number of variables. This means we need to generalize our
notation---and our mathematical technique.
Our treatment follows Jefferys (1980) and we will adopt his notation in some
measure. To explain the method we will follow his example and present
the general case and also show the application to a specific example.
Figure \ref{jeff_fig} compares a conventional fit to $y=a_0 + a_1t$ with
a proper fit when both variables have uncertainties. The right-hand
panel is the conventional fit in which the measurement uncertainties in
$t$ are set equal to zero; the left-hand panel includes the
uncertainties in $t$. Some important differences are apparent:
\begin{enumerate}
\item Errors on the left panel are completely specified by errorbars.
On the right panel we could use vertical errorbars for $y$ and
horizontal ones for $t$, but these would be sufficiently only if the
errors were uncorrelated (as they are for points 0, 2, and 4). However,
when they are correlated we must use error {\it ellipses} that specify
$\chi^2=1$.
\item For the left panel, the best-fit points have the same $t$ values
as the datapoints but different $y$ values. For the right panel the
best-fit values differ from the data values for {\it both} variables.
\item If you look carefully, you'll see that the fitted slope on the left is a
bit smaller than that on the right. This systematic bias
arises from ignoring the errors in $t$.
\end{enumerate}
\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.8]{jeff_fig.ps}
\end{center}
\caption{Comparing a conventional fit for to $y=a_0 + a_1t$ (left panel)
to a proper one when both measured variables have errors. On the right,
the ellipses denote the correlated errors in $(t,y)$; these are the
generalization of the errorbars on the left. The right-hand slope is a
bit steeper than the left-hand one.
\label{jeff_fig}}
\end{figure}
\subsection{The Data Matrix and Vector}
First, a word on notation. We distinguish between scalars, vectors, and
matrices as follows: for Roman letters, vectors are {\bf lower-case
bold} and matrices are {\bf UPPER-CASE BOLD}. For Greek letters,
scalars are the letter itself (e.g.\ $\alpha$), vectors are
single-underlined (e.g.\ $\underline{\phi}$), and matrices are
double-underlined (e.g.\
$\underline{\underline{\sigma}})$.
For the general case we have $M$ experiments. In each experiment we
measure $J$ parameters. We want to combine these $(M \times J)$
measurements to derive $N$ coefficients. We denote the matrix of
measured parameters by ${\bf X}$, which is $(M \times J)$ [using
conventional matrix notation, not IDL notation, in which in the vertical
dimension (number of rows) is $M$ and the horizontal dimension (number
of columns) is $J$]. We will want to concatenate this matrix to a
one-dimensional vector ${\bf x}$ of length $(MJ)$. Specifically, the $M
\times J$ datapoint matrix is
\begin{mathletters}
\begin{eqnarray}
\mathbf {X} = \left[
\begin{array}{ccccc}
x_{0,0} & x_{0,1} & x_{0,2} & \dots & x_{0,J-1} \\
x_{1,0} & x_{1,1} & x_{1,2} & \dots & x_{1,J-1} \\
x_{2,0} & x_{2,1} & x_{2,2} & \dots & x_{2,J-1} \\
. & . & . & \dots & . \\
. & . & . & \dots & . \\
. & . & . & \dots & . \\
x_{M-1,0} & x_{M-1,1} & x_{M-1,2} & \dots & x_{M-1,J-1} \\
\end{array}
\ \right]
\end{eqnarray}
\noindent We don't use this big matrix in the solution. Instead, we
turn it into a vector in which the first $J$ elements are the data for
$m=0$, the second $J$ elements are for $m=1$, etc. So the vector has
dimensions $(MJ) \times 1$, like this:
\begin{eqnarray}
\mathbf {x} = \left[
\begin{array}{c}
x_{0,0} \\ x_{0,1} \\ x_{0,2} \\ . \\ . \\ . \\ x_{0,J-1} \\
x_{1,0} \\ x_{1,1} \\ x_{1,2} \\ . \\ . \\ . \\ x_{1,J-1} \\
x_{2,0} \\ x_{2,1} \\ x_{2,2} \\ . \\ . \\ . \\ x_{2,J-1} \\
\dots \\
\dots \\
\dots \\
x_{M-1,0} \\ x_{M-1,1} \\ x_{M-1,2} \\ . \\ . \\ . \\ x_{M-1,J-1} \\
\end{array}
\ \right]
\end{eqnarray}
\end{mathletters}
\noindent One important reason for writing the whole set of data as a
vector instead of a matrix is to make it possible to write the
covariance matrix for all measured data in the conventional form, as we
now discuss.
\subsection{ The Data Covariance Matrix and Defining Chi-Square}
The whole object of fitting is to minimize the chi-square. When all
measured parameters have uncertainties, their uncertainties can be
correlated. We have to generalize the definition of chi-square
accordingly.
First, suppose that the
observational errors in the datapoints are uncorrelated. Then the
intrinsic variance of each datapoint is described by a single number.
In our example, for uncorrelated observational errors we'd have the
variances in the $y$ values be $(\sigma_{y0}^2, \sigma_{y1}^2, \dots$),
and similarly for the $t$ values; this would give
\begin {equation}
\chi^2 = \sum_m {\delta y_m^2 \over \sigma_{ym}^2}+ {\delta t_m^2 \over \sigma_{tm}^2}
\end{equation}
However, it is common that errors are correlated. For example, if we
were fitting $y$ to a polynomial in $t$, then the errors in the various
powers of $t$ would certainly be correlated. More generally, then, the
{\it covariances} among the different measured values are nonzero.
These covariances are the off-diagonal terms in the covariance matrix.
Thus, if we denote the covariance matrix for the measured $\bf x$ values
by the $(MJ \times (MJ)$ matrix $\underline{\underline{\sigma}}$, then
the most general case has $\underline{\underline{\sigma}}$ with no
nonzero elements.
Less general, but much more common, is the situation shown in Figure
\ref{jeff_fig}. Here, the covariances among the $J$ measured
parameters nonzero are for a {\it particular} experiment $m$, but the
covariances {from one experiment $m$ to another are zero}; in other
words, each experiment is completely independent of the others. In this
less general but very common case, the covariance matrix looks like
\begin{eqnarray} \label{sigma}
\underline{\underline{\sigma}} =
\left[
\begin{array}{cccc}
\underline{\underline{\sigma_0}} & {\mathbf 0} & {\mathbf 0} & \dots \\
{\mathbf 0} & \underline{\underline{\sigma_1}} & {\mathbf 0} & \dots \\
{\mathbf 0} & {\mathbf 0} & \underline{\underline{ \sigma_2}} & \dots \\
. \\
. \\
. \\
\end{array}
\; \right]
\end{eqnarray}
\noindent Here, each element (including the $\mathbf 0$ elements) is
itself a $J \times J$ matrix. For our specific example of \S
\ref{example}, $J=2$ so $\underline{\underline{\sigma_0}}$ is a
covariance matrix of the form
\begin{eqnarray} \label{sigma0}
\underline{\underline{\sigma_0}} =
{\left[
\begin{array}{cc}
\mathnormal{\sigma_{yy}} & \sigma_{yt} \\
\sigma_{yt} & \sigma_{tt} \\
\end{array}
\; \right]}
\end{eqnarray}
\noindent Generally, the chi-square is given by (e.g.\ Cowan equation 2.39)
\begin{equation} \label{chisqj}
\chi^2 = \mathbf{ \delta x^T \cdot \underline{\underline {\sigma^{-1}}} \cdot \delta x}
\end{equation}
\subsection{Formulation of the Problem and its Solution with Lagrange
Multipliers}
We will be referring to various versions of the data parameters
$\mathbf{x}$ and derived parameters $\mathbf{a}$: {\it measured}, {\it
best-fit}, and (for the iterative solution) {\it guessed}. The
subscript $d$ denotes the set of {\it measured datapoints}, of
which there are $(JM)$. The subscript $*$ denotes the set of {\it
best-fit} quantities; these parameters include not only the datapoints
$\mathbf x$, but also the derived parameters $\mathbf a$. We will be
doing an iterative fit using {\it guessed} values of both the data and
derived parameters, represented by the subscript $g$.
We begin by writing {\it exact} equations for each measurement. The
fitted vales, subscripted with stars, satisfy the {\it exact} equations
of condition
\begin{mathletters}
{\boldmath
\begin{equation} \label{star}
f(x_*, a_*) = 0
\end{equation}
}
\noindent This is an $M$-long vector of functions ${\bf f(x, a)= 0}$
(one row for each measurement). This set of $M$ equations doesn't do us
much good because we don't know the best-fit (starred)
values. Consequently, for the datapoints we define the difference
between the best-fit and measured data values
\begin{equation}
\mathbf {\delta x = x_d - x_*}
\end{equation}
\noindent This is the {\it negative} of Jefferys' definition of the
corresponding quantity $\mathbf {\hat{v}}$ in his section II.
With this, the equation \ref{star} becomes
\begin{equation} \label{constraints}
\mathbf{ f(x_d - \delta x, a_*) = 0} \ .
\end{equation}
\end{mathletters}
\noindent Our goal is to solve these $M$ equations for the $(MJ)$
differences $\mathbf {\delta x}$ and the $N$ parameters $\mathbf {a_*}$
and, simultaneously, minimize $\chi^2$.
This is a classic minimization problem: we minimize
$\chi^2$ with respect to the $(MJ + N)$ values of $\mathbf{\delta x}$ and
$\mathbf{a}$, subject to the $M$ constraints of equation
\ref{constraints}. Such problems are solved using Lagrange
multipliers. Here, the $M$ Lagrange multipliers form the vector $\mathbf
\underline{\underline{\lambda}}$. We define the Lagrangian $\cal L$ as
\begin{equation}
{\cal L} = \left[ {1 \over 2} \mathbf {\delta x^T \cdot
\underline{\underline{\sigma^{-1}}} \cdot \delta x}\right] +
\left[ \mathbf {f^T(x_d - \delta x, a) \cdot \underline{\lambda}} \right] \ ;
\end{equation}
\noindent the $1 \over 2$ arises because, for a Gaussian pdf for the
errors, the residuals are distributed as $e^{-{\chi^2 \over 2}}$ (e.g.\
Cowan equation 2.28). We differentiate $\cal L$ with respect to each of
the unknowns $\mathbf {\delta x}$ and $\mathbf a$. The solution
provides $\bf a$, the vector of $N$ derived coefficients (so $\bf a$ is
defined as elsewhere in this tutorial) together with the $MJ$ fitted
datapoints. To proceed, we need the derivatives of $\bf f$ in equation
\ref{star} with respect to the vectors $\bf x$ and $\bf a$, as we now
discuss.
\subsection{The Derivative Matrices}
For the analysis we will need the derivatives of $\mathbf{f}$ with
respect to the vectors $\mathbf{x}$ and $\mathbf{a}$. The derivatives
will always be evaluated at the {\it guessed} values $\mathbf{x_g}$ and
$\mathbf{a_g}$. The derivative with respect to $\mathbf x$ is a $M
\times MJ$ matrix and looks like this [We take the case $(M=4,J=3)$ for
transparency; subscripts of $x$ are in the order $(m,j)$]:
%{\footnotesize
{\scriptsize
\begin{eqnarray}
\mathbf{ \left.{ \partial f(x, a) \over \partial x}\right|_{x_g, a_g} = }
\left[
\begin{array}{cccccccccccc}
{ \mathbf {\partial f} \over \partial x_{0,0}} &
{ \mathbf {\partial f} \over \partial x_{0,1}} &
{ \mathbf {\partial f} \over \partial x_{0,2}} &
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 &
{ \mathbf {\partial f} \over \partial x_{1,0}} &
{ \mathbf {\partial f} \over \partial x_{1,1}} &
{ \mathbf {\partial f} \over \partial x_{1,2}} &
0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 &
{ \mathbf {\partial f} \over \partial x_{2,0}} &
{ \mathbf {\partial f} \over \partial x_{2,1}} &
{ \mathbf {\partial f} \over \partial x_{2,2}} &
0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &
{ \mathbf {\partial f} \over \partial x_{M-1,0}} &
{ \mathbf {\partial f} \over \partial x_{M-1,1}} &
{ \mathbf {\partial f} \over \partial x_{M-1,2}} \\
\end{array}
\right]
\end{eqnarray}
}
\noindent where all the derivatives are evaluated at $(\mathbf{ x_g,
a_g})$. Much easier is the derivative with respect to $\mathbf a$, which is a $M
\times N$ matrix and looks like
\begin{eqnarray}
\mathbf{ \left.{ \partial f(x, a) \over \partial a}\right|_{x_g, a_g} = }
\left[
\begin{array}{ccccc}
\mathbf{ \left. \partial f \over \partial a_0 \right|_{x_{g,0},a_g} } &
\mathbf{ \left. \partial f \over \partial a_1 \right|_{x_{g,0},a_g} } &
\mathbf{ \left. \partial f \over \partial a_2 \right|_{x_{g,0},a_g} } &
... &
\mathbf{ \left. \partial f \over \partial a_{N-1} \right|_{x_{g,0},a_g} } \\
\mathbf{ \left. \partial f \over \partial a_0 \right|_{x_{g,1},a_g} } &
\mathbf{ \left. \partial f \over \partial a_1 \right|_{x_{g,1},a_g} } &
\mathbf{ \left. \partial f \over \partial a_2 \right|_{x_{g,1},a_g} } &
... &
\mathbf{ \left. \partial f \over \partial a_{N-1} \right|_{x_{g,1},a_g} } \\
\mathbf{ \left. \partial f \over \partial a_0 \right|_{x_{g,2},a_g} } &
\mathbf{ \left. \partial f \over \partial a_1 \right|_{x_{g,2},a_g} } &
\mathbf{ \left. \partial f \over \partial a_2 \right|_{x_{g,2},a_g} } &
... &
\mathbf{ \left. \partial f \over \partial a_{N-1} \right|_{x_{g,2},a_g} } \\
& & . & & \\
& & . & & \\
& & . & & \\
\end{array}
\right]
\end{eqnarray}
\noindent where again, all the derivatives are evaluated at $(\mathbf{ x_g,
a_g})$.
\subsection{The Specific Example} \label{example}
We illustrate the above with this specific example, for which we fit a
first-order polynomial to $(t,y)$ of the form $[y = a_0 + a_1 t]$; the
coefficients are $(a_0, a_1)$ and $[{\bf f(x,a)}= y - a_0 - a_1t]$. For this
example, the matrix of measured parameters is
\begin{eqnarray}
\mathbf {X} = \left[
\begin{array}{cc}
y_{d0} & t_{d0} \\
y_{d1} & t_{d1} \\
y_{d2} & t_{d2} \\
. & \\
. & \\
. & \\
\end{array}
\ \right]
\end{eqnarray}
\noindent and the concatenated vector version is
\begin{eqnarray}
\mathbf {x} = \left[
\begin{array}{c}
y_{d0} \\
t_{d0} \\
y_{d1} \\
t_{d1} \\
y_{d2} \\
t_{d2} \\
. \\
. \\
. \\
\end{array}
\ \right]
\end{eqnarray}
\noindent The exact vector-of-functions equation $\mathbf{ f(x_*,a_*) =
0}$ uses the starred (best-fit) values, and is
\begin{eqnarray} \label{star0}
\mathbf{ f(x_*,a_*)} =
\left[
\begin{array}{c}
y_{*0} - a_{*0} - a_{*1} t_{*0} \\
y_{*1} - a_{*0} - a_{*1} t_{*1} \\
. \\
. \\
. \\
\end{array}
\; \right]
\; =
\left[
\begin{array}{c}
0 \\
0 \\
. \\
. \\
. \\
\end{array}
\; \right]
\end{eqnarray}
\noindent The derivative matrix of $\mathbf{f}$
with respect to the vector $\mathbf{x}$ is
\begin{eqnarray}
\mathbf{\left.{ \partial f(x, a) \over \partial x}\right|_{x_g, a_g} } =
\left[
\begin{array}{ccccccc}
1 & -a_{g1} & 0 & 0 & 0 & 0 & \dots \\
0 & 0 & 1 & -a_{g1} & 0 & 0 & \dots \\
0 & 0 & 0 & 0 & 1 & -a_{g1} & \dots \\
& & & . & & & \\
& & & . & & & \\
& & & . & & & \\
\end{array}
\; \right] \ ,
\end{eqnarray}
\noindent always evaluated at the {\it guessed} values of the parameters
(subscript $g$).
The derivative matrix of $\mathbf{f}$
with respect to the vector $\mathbf{a}$ is
\begin{eqnarray}
\mathbf{\left.{ \partial f(x, a) \over \partial a}\right|_{x_g, a_g} } =
\left[
\begin{array}{cc}
-1 & -t_{g0} \\
-1 & -t_{g1} \\
-1 & -t_{g2} \\
. \\
. \\
. \\
\end{array}
\; \right] \ ,
\end{eqnarray}
\noindent again always evaluated at the {\it guessed} values of the parameters
(subscript $g$).
\subsection{The Solution to the Lagrangian: Two Matrix Equations}
Jefferys does all this this\footnote{And more: he includes the
possibility for additional constraints, the second line in his equation
(7). This introduces additional complications and we ignore this
possibility in the interest of pedagogical simplification. For example,
his set of equations (10) has four equations, not two as we write here.}
for us and, after some algebraic manipulation, provides the two
following matrix equations (from the set of four in his equation 10):
\begin{mathletters}
\begin{equation} \label{feqn0}
\mathbf{\underline{\underline{\sigma^{-1}}} \cdot \delta x +
\left.{\partial f^T(x, a) \over \partial x} \right|_{x_d, a_*}
\cdot \underline{\lambda} = 0}
\end{equation}
\noindent This single matrix equation embodies $MJ$ individual
equations. Here we write $\mathbf{\partial f^T \over \partial x}$ instead
of $\mathbf{\partial f^T \over \partial \delta x}$ for clarity, and use
the fact that they are the negative of each other.
\begin{equation} \label{feqn1}
\mathbf{\left.{\partial f^T(x, a) \over \partial a} \right|_{x_d, a_*} \cdot
\underline{\lambda} = 0}
\end{equation}
\end{mathletters}
\noindent This matrix equation embodies $N$ individual equations. These
two matrix equations embody $MJ + N$ individual equations, which is
equal to the number of unknowns, so we can solve for them! In both of
these equations, the dimensions of the factors are ({\it Note the
transposes!}):
\begin{mathletters}
\begin{equation}
\underline{\underline{\sigma}} : (MJ) \times (MJ)
\end{equation}
\begin{equation}
\mathbf{\delta x} : MJ \times 1
\end{equation}
\begin{equation}
\mathbf{f^T} : 1 \times M
\end{equation}
\begin{equation} \label{remark0}
\mathbf{ {\partial f^T(x, a) \over \partial x} } : (MJ) \times M
\end{equation}
\begin{equation}
\underline{\lambda}: M \times 1
\end{equation}
\begin{equation} \label{remark1}
\mathbf{ {\partial f^T(x, a) \over \partial a} } : N \times M
\end{equation}
\end{mathletters}
\noindent Note that in \ref{remark0} above the dimension is $(MJ) \times M$: $M$
elements in $\mathbf f$, differentiated by $(MJ)$ different variables
$\mathbf x$; similarly for \ref{remark1} above.
\subsection{ Solving Equations \ref{feqn0} and \ref {feqn1} Iteratively}
Generally, these equations are nonlinear and we solve them iteratively using
guessed solutions. We denote the guessed values with subscript $g$, so
we have
\begin{mathletters} \label{iteration}
\begin{equation}
\mathbf{ a_g} = {\rm guessed \ values \ for} \ \mathbf{a_*}
\end{equation}
\begin{equation}
\mathbf{ x_g} = {\rm guessed \ values \ for} \ \mathbf{x_*}
\end{equation}
{\large \bf ITERATION STEP 1:} We define the difference
between the measured data quantities and their currently-guessed
counterparts
\begin{equation}
\mathbf{ \Delta x_g \equiv x_d - x_g}
\end{equation}
\end{mathletters}
\noindent Above, our $\mathbf{\Delta x_g}$ is the {\it negative} of Jefferys'
$\mathbf{\hat{v}}$. The nonlinear fit solves for corrections to these guesses,
which we denote by $\mathbf{\Delta x_{new}}$ (the negative of Jefferys'
$\mathbf{ \hat{ v}_{new}}$) and $\mathbf{\Delta a_{new}}$ (which is
identical to Jefferys' $\mathbf{ \hat{\underline{\delta}} }$).
{\large \bf ITERATION STEP 2:} We define the following: \begin{enumerate}
\begin{mathletters} \label{definitions}
\item The $M \times M$ weight matrix $\mathbf W$ is from Jefferys' equation (15):
\begin{equation}
\mathbf{ W \equiv \left[
\left.{\partial f(x, a) \over \partial x}\right|_{x_g,a_g} \cdot
\underline{\underline{\sigma}}
\cdot \left.{\partial f^T(x, a) \over \partial x}\right|_{x_g,a_g} \right] }^{-1}
\end{equation}
\item The $[0,0]$ element of his equation (17), which is equivalent to
our $\underline{\underline{\alpha}}$ (the curvature matrix) elsewhere in
this document:
\begin{equation} \label{alphaeqnj}
\mathbf{ \underline{\underline{\alpha}} \equiv
\left.{ \partial f^T(x, a) \over \partial a}\right|_{x_g, a_g} \cdot W
\cdot \left.{\partial f(x, a) \over \partial a}\right|_{x_g, a_g} }
\end{equation}
\noindent $\mathbf{ \underline{\underline{\alpha}}}$ is $N \times N$.
\item The modified equations of condition from his equation (18) (we
have a $+$ instead of his $-$ because $\mathbf{\Delta x_g = -\hat{v}}$)
\begin{equation}
\mathbf{ \underline{\phi_g}
\equiv
f(x_g, a_g) + \left(\left.{\partial f(x, a) \over \partial x}\right|_{x_g,a_g} \right)
\cdot \Delta x_g }
\end{equation}
\noindent $\mathbf{ \underline{\phi_g}}$ is $M \times 1$.
\end{mathletters}
\end{enumerate}
{\large \bf ITERATION STEP 3:} The solutions follow directly. The
matrix equation for $\mathbf{\Delta a_{new}}$ is Jefferys equation (17)
\begin{equation}
\mathbf{ \underline{\underline{\alpha}} \cdot \Delta a_{new} =
- \left.{\partial f^T(x, a) \over \partial a}\right|_{x_g, a_g}
\cdot W \cdot \underline{\phi_g} }
\end{equation}
\noindent which is solved for $\mathbf{\Delta a_{new}}$ in the conventional
way by premultiplying both sides by
$\underline{\underline{\alpha}}^{-1}$. The matrix equation for the new,
corrected $\mathbf{\Delta x_g}$ is Jefferys equation (19)
\begin{equation}
\mathbf{ \Delta x_{new} = -\underline{\underline{\sigma}} \cdot
\left.{\partial f^T(x, a) \over
\partial x}\right|_{x_g,a_g} \cdot W \cdot
\left( \underline{\phi_g} + \left. {\partial f \over \partial a}\right|_{x_g,a_g}
\cdot \Delta a_{new} \right) }
\end{equation}
{\large \bf ITERATION STEP 4:} The previous two equations provide
corrections to the values $\mathbf{x_d}$ and $\mathbf{a_g}$. One applies
them (Jefferys equation 20):
\begin{mathletters}
\begin{equation}
\mathbf{ a_{g, new} = a_g + \Delta a_{new}}
\end{equation}
\begin{equation}
\mathbf{ x_{g, new} = x_d + \Delta x_{new} }
\end{equation}
\end{mathletters}
\noindent {\it Note that $\mathbf{ \Delta x_{new}}$ is applied to $\mathbf{ x_d}$,
not to $\mathbf{x_g}$}.
{\large \bf ITERATION STEP 5:} Return to {\bf Iteration Step 1}, using
these new values $\bf a_{g,new}$ and $\bf x_{g,new}$ in place of
$\mathbf{a_g}$ and $\mathbf {x_g}$. Iterate until convergence. Convergence
occurs when all derived corrections $\mathbf {\Delta a_{new}}$ and
$\mathbf {\Delta x_{new}}$ become negligible. [How do you define
``negligible''\dots]
\subsection{Taking all those derivatives!}
Why not use numerical derivatives? It's easier with complicated functions!
\subsection{The Initial Guess}
Getting started is the hard part unless the problem is
relatively easy. One way is to use conventional least squares to derive
the initial guesses $\mathbf {a_g}$. To do this, you designate one
particular variable as the dependent variable and all the others as the
independent ones (for which you assume no errors). Do the conventional
least squares solution for the parameters $\mathbf a$ and use these as
the initial guesses. A good choice for the dependent variable is the
one with the largest errors, if there is one; otherwise the choice is
more-or-less arbitrary. For the initial guesses of the data parameters
$\mathbf {x_g}$, just use the measured data values $\mathbf {x_d}$. If
this scheme doesn't work, you're on your own!
\subsection{The Covariance Matrix (and errors) of the Derived Parameters}
The matrix $\underline{\underline{\alpha}}$, which is defined above in
equation \ref{alphaeqnj}, is the conventionally-defined curvature matrix
and its inverse $\underline{\underline{\alpha^{-1} }}$ is the
conventionally-defined covariance matrix. Because we have formulated
this problem as a chi-squared one, the elements of this matrix give the
covariance directly. {\it Thus, the errors in the derived parameters can
be taken as the square-root of the diagonal elements of
$\underline{\underline{\alpha^{-1} }}$}.
One usually wants to calculate the chi-squared of the solution.
This involves not only the best-fit parameters $\mathbf a_*$ but also
the best fit datapoints $\mathbf x_*$. To do this, use equation
\ref{chisqj} using $\mathbf{ \Delta x_g}$ in place of $\mathbf {\delta x}$.
This is the same as Jefferys' equation (43).
The expectation value of $\chi^2$
is the number of degrees of freedom. In one sense this is just like the usual
least-squares solution: it's equal to the number of datapoints minus
the number of derived parameters. Here the number of datapoints is not
the number of experiments $M$; rather, it's $JM$. So the number of
degrees of freedom is $JM - N$.
\end{document}