
Fornew readers and those who request to be “好友 good friends”, please read my 公告栏 first.
Because of software problems, there are difficulties displaying equations and figures in this blog article. Until they are fixed, you can always open the original WORD document attached here.
Filtering Tutorial (Final).docx
Explaining Filtering (Estimation) In OneHour, Ten Minutes, One Minute, And One Sentence.
By
YuChi Ho
Harvard and Tsinghua University
Kalman filtering is arguably thecrowning achievement of modern system science and control (Kalman for his contribution received what might considered as the Nobel prize of Engineering, the Draper Prize, in 2008) More generally, estimation in the presence of noise and disturbance is a key ingredient in any scheme for the control and optimization of complex systems. In the past, I have touched on and explained various aspects of this general topics in my blog articles (http://blog.sciencenet.cn/blog156514253.html , http://blog.sciencenet.cn/blog1565426323.html ) as well as in published references years ago. However, these were done scattershot and some time ago. It seems worthwhile to collect all these material in one place and present them in a holistic manner with the latest references. At the same time it is an attempt to explain this material to the average scientific public without specialization.
For most of this article only highschool algebra background and no probability knowledge are needed for understanding. This include nonlinear filtering and estimation. As for probabilistic concepts and interpretations, the material contained in my five part tutorial on probability and stochastic processes (http://blog.sciencenet.cn/blog1565664051.html , http://blog.sciencenet.cn/blog1565665359.html ,http://blog.sciencenet.cn/blog1565666599.html ,http://blog.sciencenet.cn/blog1565669532.html ,http://blog.sciencenet.cn/blog1565671318.html ) in earlier blog articles are MORE THAN enough. While an appreciation of the elementary Bayes rule in probability theory will be useful, it will also be explainedas part of this article
We start with a nonprobabilisticand intuitive explanation Kalman filter that requires no previous knowledge or background. Consider the following simplest problem of estimation. We are givena series of measurement z_{1}, z_{2}, . . . , z_{k} (often denoted by the shorthand, of an unknown constant x. We assume the additive model
z_{i}= x + v_{i} i=1,2, . . .k (1)
where v_{i }are measurement noises. If nothing else is known, then everyone will agree that a reasonable estimate of x given the k measurements can be given by
(2)
in the sense we try to average outthe unknown measurement noises which can be either positive or negative in value (note if the noises have an unknown mean or average value, then this mean is no different than the unknown x and cannot be distinguished from x. We have what is technically called an unobservable situation. Exercise 1:can you explain the above remark to another reader? What will Eq.(2) determine if k>infinity?)
Now we can rewrite Eq.(2) by simple algebraic manipulation to get
(3)
Eq.(3) which is simply Eq.(2) expressed in recursive form has an interesting interpretation. It says that thebest estimate of x after k measurement is the best estimate of x after k1measurements plus a correction term. The correction term is the difference between what you expect to measure based on k1 measurement, i.e., and what you actually measure z_{k}. The difference can be due to two sources:
(i) is different from x, in which case you should use the difference to modify your new estimate
(ii) =x actually in which case you should ignore the difference which is due to noise v_{k}only.
The weighting factor, 1/k, for correction term expressed the proper tradeoff between these two considerations. At the beginning when k is small, one does not have good estimate of x and thus should pay attention to the correction term. As k become large, one has more and more confidence in our estimate and less attention should be paid to the correction (remember the Law of Large Numbers if you are already familiar with probabilities). In the limit when k>infinity and case (ii) is operative, we completely ignore the correction term.
If we label the correction 1/k as P_{k},then again simply algebraic manipulation can write the recursive form of P_{k }as
= (4)
Believe it or not, Eqs.(34) can berecognized as the “Kalman filtering” equations for this simple case . For those familiar with the literature on “stochastic approximation”, Eq.(3) is simplythe formula for successive approximation of the root of the function f(x) whichin this case is f(x)= x using the idea of gradient descent. Eq.(1) can also be recognized as the solution of the simple problem of least square fit
(5)
We can generalize this idea of least square (or stochastic approximation, or law of large numbers, or Kalman filter) in many, sophisticated and useful ways. Let us see how this can bedone.
First of all, suppose x is a vector, then Eqs.(34) need not be changed at all. In fact, we can generalize Eq.(1) to
=Hx+ i1, . . . , k (1’)
Where H is a matrix of scalefactors. Note here x and z need not be of the same dimension. For example, x can a sixvector of position and velocityof a vehicle in space and z is a one dimensional angular measurement of the vehicle with respect to a point on the surface of the earth (in this case H isa 1x6 row vector). Eqs.(34) now becomes
(3’)
(4’)
Where P_{k} is now a nxnmatrix and superscript “T” means “transpose” (note the validity of (3’4’) can be easily seen if we treat everything as scalars).
Eqs.(3’4’) can be further generalized by introducing dynamics. Suppose the unknownx evolves according to
x_{i}=Фx_{i1}, i=1,2, . . ., k with x_{1}=x (5)
We want to know how Eqs. (3'4') should be modified. Here the algebraic manipulation becomes a little messy but the principle (i.e. to do a least square fit) is still the same. We wish tofind a that is the least square fit to the single unknown x based on all the measurements. To simplify the manipulation we shall just solve the case when k=2, I.e., we wish to solve
( (6)
Directly minimizing for the optimalx as we get
+ ; (3”)
and
(4”)
Thus Eqs. (3'4') becomes Eqs.(3"4") Eq. (5) can also be further generalized by
x_{i}=Фx_{i1}+w_{i1 } (5’)
where w_{i} are disturbancenoises in the dynamics equation. Finally, let x be a nvector, and the noises and disturbances carry different weights “R” and “Q”, then we merely need to add transposes to some of the matrices, change the number 1 to the identity matrix, and insert Q and R weight matrices to get
+ ; (3*)
(4*)
Eqs.(3*4*) are of course the full blown Kalman filter equations. I cannot emphasize too strongly that the truth of the above statement requires no matrix theory but only messy matrix algebraic manipulations. Eqs. (3*4*) can also be derived directly again with messy matrix manipulation by minimizing the general least square criterion
(6*)
Where the matrices (or scalefactor) Q and R represent relative weighting of the noises w and v.
Solving this minimization problem(except for the involved matrix algebraic manipulation but no new concepts) again directly yield the well known general Kalman Filtering formula of Eqs. (3*4*)
A block diagram of Kalman filteringimplementation is shown below:
BLOCKDIAGRAM OF KALMAN FILTER
In this sense the essence of theKalman theory can be captured by the sentence: ”subtract out every known effect and average the left over.” This is the only thing and the best thing that can be done. Kalman filter issimply a mathematical implementation of the above sentence, a glorified least square fit, and an ultrasophisticated application of the law of large numbers. On probabilistic basis, avery satisfying Bayesian decision theoretic interpretation [Ho & Lee1965, Bryson and Ho 1969] is also possible. We shall do this below after extending the concept of least square fit to nonlinear cases.
( see Deterministic Nonlinearfiltering http://blog.sciencenet.cn/blog1565426323.html)
Now the Kalman filter strictlyspeaking is applicable only to linear systems with Gaussian disturbances and measurement noises. Since many real world problems are nonlinear and nonGaussian, the question about nonlinear filtering and estimation naturally arise and the related question as to “what shall we do in practice?”
The most popular approach to nonlinear estimation is what is known as
EXTENDED KALMAN FILTERING  Here you simply start with a guess estimate/trajectory of the nonlinear system. Linearize around this trajectory and apply the usual Kalman filtering formula to this linearized system. You canalso recursively update the estimated trajectory with new data andrelinearize. All very efficient. The only problem is that Kalman filtering is based on a unimodal (Gaussian) distribution. If your problem actually involves/ results inmultimodal distribution, and/or you have very bad nonlinearities that cannot be easily approximated by successive linearization, then this approach often will breakdown or lead to very bad estimates.
The second approach is NONLINEAR LEAST SQUARE FIT THE DATA– This approach goes back to our original thesis that optimal filtering is basically least square fit. Thus this approach is quite stable and works. The only problem is the fact that it is not real time. For each new measurement you need to solve a new two point boundary value problem. For problems where you only have few measurements and long intervals between observations, this approach is quite workable (e.g. on long space voyages). Otherwise, unless you update very infrequently, the computational load is quite high and not practical in real time.
The third approach is PARTICLE FILERING which we shall postpone its discussion to the probabilistic part of this tutorial
OK, Now in my opinion you know what are worth knowing about practical filtering using no probabilistic knowledge.
Now we shall rederive the formula above using probabilistic considerations:
Whenwe say a quantity x is a random variable, we mean that the value it actually takes on is unpredictable before the quantity is sampled (or looked at). We characterize this variability using the probability density function or pdf, p(x) which is the probability that the sample value may take on between thevalue of x and x+dx. They say probability is a precise way to deal with uncertainly. The pdf p(x) is this precise way since it is a deterministic function. Now a multivariable function is computationally cumbersome if not impossible to work with
http://blog.sciencenet.cn/blog156526889.html (This is a dirty secret of applied mathematics that is not often mentioned in papers and textbooks). To simplify description of r.v., we often use the rough characterization of a pdf by its mean and variance ( the average value and the spread of values around the mean). These basic concepts have all been previously explained in my probability tutorial (see the five tutorial articles mentioned earlier in part one).
In system studies, we are interested in what might be called inputoutput relationship. If we have a random input to a system, then the output will be random also. Given the input pdf we want to know the output pdf. In particular,we are interested in p(y) given p(x) when y=f(x). There are two probabilistic analogs of the Eqs.(3’4’) and Eqs. (5 or 5’).
(i) The propagation equation.Suppose x_{k}=f(x_{k1}, w_{k1}) which is a nonlinear input/output generalization of (5). Then given
p(x_{k1}) and p(w_{k1}),we get
(5”)
where are in principle known given p(and the relationship x_{k}=f(x_{k1}, w_{k1}).
(ii) The updating equation.Suppose z_{k} = h(x_{k}, v_{k}) which is a nonlinear generalization of Eq.(1’), then the Bayes rule says that the probabilistic analog of Eq.(4, or 4’, or 4*) is
p(x) (4”)
Where p(x) isknown as the prior pdf of x and p(xz) is the posterior pdf of x , i.e. x after or conditioned on the measurement z. Once again knowing p(v_{k}) andthe relationship z_{k} = h(x_{k}, v_{k}), we can in principle determine p(zx) and p(z)=.
Note we say “in principle” in connection with Eq. (4” and 5”). It is worthwhile to emphasize once more the dirty secret of applied mathematics http://blog.sciencenet.cn/blog156526889.html , namely it is basically computationally impossible to deal with multivariable functions in general. Eqs.(4”5”) are deceptively simple looking, but infeasible computationally if x, z, w and v are nvectors. Thus, we look for a parametric description of a pdf to enable thecomputation of Eqs. (4”5”). We introduce the concept of conjugate or reproducing pdfs.For example, if p(x_{k1}) and p(x_{k}x_{k1}) in Eq.(5”)are Gaussian distributed then p(x_{k}) in (5”) will also be Gaussian. We can then express Eq.(5”) in terms of a finite dimensional equation relating the posterior pdf parameters to that of the prior pdf. In fact, if p(x_{k1}) is Gaussian with mean ,
and x_{k}and x_{k1} are related linearly via Eq(5’) then p(x_{k}), can be verified via messy algebraic manipulation, that it is again Gaussian with mean and covariance given by Eqs.(3*4*). Similarly, we can show that the p(x)and p(xz) of Eq.(5) is also a reproducing or conjugate pair. If p(x) isGaussian, the p(xz) will also be Gaussian with the prior and posterior mean and covariance related by Eqs.(3*4*). In other words, Kalman filter is simplythe propagation and updating of equations (4”5”) specialized to the Gaussian reproducing case. In our mind’s eye we can visualize the propagation and updating of the pdf of the state of a dynamic system as illustrated in the Figure below:
A more pictorial diagram of the above figure can be found in my 45 year old textbook on pages 322328 and ingeneral in Chapter 12 of Applied OptimalControl.
In general, our dynamic system and observation equation will not be linear as in the strict Kalman filtering case. Then how do we deal with Eqs.(4”5”)? One obvious approach is to approximate the pdfs by discrete histograms (i.e., discrete number of samples) and then try to implement (4”5”) by brute force. This is known as
PARTICLE FILTERING  One of theleading practitioners of this approach is Dr. Fred Daum of Raytheon company with over two decades of real experience and success stories to his credit. So what is Particle Filtering (PF)?In one sentence – you try to propagateand update the multidimensional density function of the system state directly via samples (i.e., particles). Of course, you approximate the multidimensional density function by histograms made up by actual samples (i.e., particles) of the state vector. Propagate a histogram (i.e., the particles making up the histogram) is not that hard computationally (think about it. It is a simple Monte Carlo run involving the particles). But the problem of updating a histogram using a new measurement is not simple. First of all,to adequately represent the histogram you may need an exponentially large number of particles if the state dimension is high. This is a universal dirty secret applied mathematicians do not like to talk about (http://bbs.sciencenet.cn/home.php?mod=space&uid=1565&do=blog&id=26889 ). To make matters worse, the Bayes rule used in the updating formula requires you to multiply the prior density by the likelihood function which falls off rapidly around the actual measurement value. Thus, only a small percentage of the particles will get significant updating. Computationally this is rather inefficient. Daum’s expertise and contribution are the tricks he has developedto ameliorate this problem (in some recent work with Huang published in SPIE proceedings they have detailed their most successful techniques. See reference below) These still have their limits, namely, the curse of dimensionality. Fortunately, many real problems have only a six dimensional state vector (3 position and 3 velocity in aerospace guidance and control applications) which helps in computational load. In fact, byhindsight, the nonlinear filtering example in my half century old text book on page 381 http://www.amazon.com/AppliedOptimalControlOptimizationEstimation/dp/0891162283 is an extremely simple problem of particle filtering(it has only two possible states) except the problem/solution was not labeled as such since the name PF had not been invented yet some 50 years ago.
Thus, here it is the complete conceptual story of filtering or estimation as I know it:
1. It is nothing more than simple least square fit carry out in a sophisticated manner
2. It isbased on two simple probabilistic formula for the propagation and updating of pdfs
3. Kalmanfiltering is but cases 1 or 2 specialized to linear systems and/or reproducing Gaussian pdfs.
Depending on the audience and time available you can use select part of this article to explain the crowning achievement and Draper prize winning results of system control for the past 50+years.
Theoretically, to derive rigorously the nonlinear filtering equations in continuous time
require carefully applying measuretheoretic tools not easily accessible to the engineering public. But for practical use, discrete time computation makes such knowledge unnecessary since we do digital computation anyway. The fact we do not reference them here do not diminish their theoretical contribution.
References
Y.C. Ho, “OnStochastic Approximation Methods and Optimal
Filtering” J. of Math Anal. and Application ,Vol. 6, No. 1, Feb.
1963, pp 152154.
R.E. Kalman, Y.C. Ho, and K.S. Narendra,“Controllability of
Linear Dynamical Systems” Contributions to Diff.Equation. ,
Vol. I, Interscience , 1963, pp 189213.
Y.C. Ho and R.C.K. Lee, “A Bayesian Approach toProblems in
Stochastic Estimation and Control” , IEEE Trans.on Automatic
Control , Vol. 9, Oct. 1964, pp 333339
R.E. Kalman, “A New Approach to Linear Filteringand
Prediction Problems”,Transactions of theASME–Journal of
Basic Engineering, 82 (Series D): 1960, 3545.
R.E. Kalman, “Discovery and Invention: TheNewtonian Revolution in
System Technology“, Journal of Guidance andDynamics, Vol 26, #6,
NovDec. 2003, pp17
Fred Daum & Jim Huang,”How to avoid normalization ofparticle flow for nonlinear filters,Bayesian decisions and transport” Proceedings of SPIE 2014
Archiver手机版科学网 ( 京ICP备14006957 )
GMT+8, 2018222 18:58
Powered by ScienceNet.cn
Copyright © 20072017 中国科学报社