||

(For new reader and those who request 好友请求, please read my 公告栏 first)

Probability and Stochastic Process Tutorial (1)

Probability is often characterized as “** a precise way to deal with our ignorance or uncertainty**”. Everyone has an intuitive understanding of the question “what are the chance of (something happening)?”. Stochastic process is then dealing with probabilities over time (or over some independent and indexed variable such as distance). There exist a number of excellent or classic textbooks on probability and stochastic processes. It is one of my favorite oral examine question which I always tell student beforehand to prepare as well as in my opinion the most useful tools of an applied mathematician and/or engineer.http://blog.sciencenet.cn/home.php?mod=space&uid=1565&do=blog&id=13708 and http://blog.sciencenet.cn/home.php?mod=space&uid=1565&do=blog&id=656455

Yet in my experience it is also one of the most confusing subjects for many students to learn. Why?

In this series of blog articles (of which this is the first) I shall try to explain the subject in my own way and my experience in learning the subject. It is NOT my intention to replace the excellent textbooks. ** The main purpose of these articles, I hope, is that by reading the articles will make the subject matter more approachable and less imposing. They are NOT meant toreplace the many excellent textbook on the subject**. I write this article not in the rigorous style required for a scholastic textbook but more in the spirit of a teacher who is engaged in a face-to-face session with a student. It will be highly informal but will make the big picture come across easier. Hopefully, it will even make it possible to read and gain insight to textbooks and articles written in measure-theoretic language. My approach will be strictly from a user point of view requiring nothing beyond freshman calculus and ability to visualize n-dimensional space as a natural generalization of our familiar 3-D space. So here goes . . .

Let us start by making one simplifying assumption which for people interested in practical application is not at all important or restrictive. This is the

**Finiteness Assumption** (FA) – We assume there is no INFINITLY large number, i.e., no infinity but there can be very large numbers, e.g. 10^100 (a number estimated to be larger than the total number of atoms in the universe.) If one deals only with real computation on digital computers, this assumption is automatically satisfied. By making this assumption we assume away all the measure-theoretic terminologies that populate theoretical probability literature and confuse the uninitiated.

With the FA assumption we now define what is a random variable.

**Random Variable (r.v.)** – a random variable is a variable that may take on any number of finite values when sampled (i.e. looked at). We characterize ar.v. by specifying its **histogram. **A histogram spells out which sampled values in a range of values the r.v. may take on what percentage of the time. Fig. 1 it a typical histogram. It is actually a histogram of a random variable which is the readership (or hits) of my blog articles for the pastfour years.

Fig. 1 histogram of readership of my blog articles (2009-2013): x-axis is #of hits, y-axis is #of article in this hit range

Note each bar of the histogram is expressed as a percentage so that the total sum of bars adds up to one or 100%, i.e., with probability one (for sure) the r.v. takes on values somewhere in the total range. While the range of values this r.v. may take on is finite by virtue of **assumption FA**, to completely specify a r.v. still can take a great deal of data. (In fact, it took me about 3 hours to collect data and make this graph which is why I did not compile the data for all 5+ year of my blog life) This is inconvenient in computation. To simplify the description (specification) we develop two common rough characterizations.

The **Mean** of a r.v. – Intuitively, if you imagine a cardboard cutout of the shape of the histogram, then the value along the x-axis at which a knife edge placed perpendicular to the x-axis that will balance this cardboard shape is the mean of this r.v..Mathematically, it is simply the average of the value of hits for each article, the ScienceNet in fact compute this value for all bloggers and displays the top-100 bloggers. My own current average happens to be 4130 per article and ranks 26th on the list.

**Variance** of ar.v. - This is a measure of the spread of the histogram. A small variance roughly mean the histogram is mostly spread over a small range of numbers around its mean and vice versa for a large variance. It is a measure of the variability of the values of the r.v.. In stock marketterminology, the b of a stock is simply the variance of the daily value of the stock and a measure of its volatility. Mathematically variance is called the **second central moments** of the histogram

Now we can develop further rough characterization of the histogram by defining what are called its higher central moments, such as**skewness** of the histogram, which is the **third central moment**. But in practice such higher moment are rarely needed nor data on these moments often available.

So much for a single r.v.. But we often have to deals with more than one random variable. Let us consider two r.v.s, x and y. Now the histogram of the random variables x-y becomes a 3D object. Graphically it looks like a multi-peak terrain map (think of Quilin in the Kwangxi province of south China or the skyscrapers of the Manhattan island of NY). But here a new concept intrudes. It is called “**joint probability**” or “**correlation/covariance** (in case of an approximate specification)” between the r.v.s x and y. It captures relationship, if any, between the r.v.s. We are all familiar with notion that smart parents tends to produce smart children. If we represent the intelligence of parents as r.v. x and that of the child is .r.v y, then mathematically we say y is positively correlated with x. If we look down on the 3D histogram of x and y, then we shall see the peaks scatter along a northeast to southwest direction as illustrated in Fig.2

FIg.2 bird’s eye view of 3D histogram with correlation

In other words, knowing the value of y will give a different idea about the probable value of x. More generally we say x and y are NOT **independent**but** correlated**. Mathematically we denote the joint probability p(x,y) (i.e., the histogram) as a general 3D function. We also define **conditional probability** of x given the value of y as

p(x/y)ºp(x,y)/p(y) or p(y/x) º p(x,y)/p(x)

Where p(y) and p(x) , called **marginally probability** of y and x respectively are simply the resultant 2D histograms when we collapse the 3D histogram onto the y or x axis respectively. Graphically, the conditional probability p(x/y) is simply the 2D histogram one sees if we take a cross sectional view of the 3D histogram at the particular value of y. Mathematically we need to divide p(x,y) by p(y) to normalize the values so that p(x/y) will still have area equal to one (100%) satisfying the definition of a histogram.

Now it is possible that the bird’s eye view of the 3D histogram is a rectangle (vs. the view of Fig. 2). In other word p(x/y)=p(x) no matter which value of y we choose. In this case, by definition of p(x/y), we have p(x,y)=p(y)p(x). We say the r.v.s x and y are** independent**. Intuitively this satisfies the notion that knowing y does not tell us anything new about the probable values of x and vice versa about y when knowing x. Computationally, this simplifies a function of 2 variables into product of single variable functions, a great computational simplification when n random variables are involved.

To roughly characterize the two generalr.v.s we have a mean vector [x,y] and a 2x2 covariance matrix with diagonal element the variance of x and y and the symmetrical covariance in the off-diagonal position

s_{x}^{2} s_{xy}

s_{yx }s_{y}^{2}_{ }

To summarize. We have so far introduced concepts

1. Random variable characterized by histograms

2. Rough characterization of histograms by mean and variance

3. Joint probability (3D histogram) of two r.v.s

4. Independence and conditional probability

5. Covariance matrix

Now suppose we have n r.v.s [ x_{1} , x_{2 }, . . . , x_{n}]instead of two, everything I said about the two r.v.s apply. We merely have to change 2D and 3D to n and n+1 dimensions. The mean of n r.v.s becomes a n-vector and the covariance matrix is a nxn matrix. In your mind’s eye you can visualize everything in n dimension the same way as Fig.1 and 2. The joint probability (histogram) p(x_{1} , x_{2 }, . . . , x) is a n variable function. And if the n variables are independent from each other, we write p(x_{1} , x_{2 }, . . . , x_{n})=p(x_{1})p(x_{2}). . . p(x_{n}). No new concepts are involved.

** Concept-wise, believe it or not, these in my opinion are all you need to know about probability and stochastic processes to function in the engineering world even if your interest is academic and theoretical**. In my 46 years of active research and engineering consulting in stochastic control and optimization, I never had to go beyond the knowledge described above. The following articles will simply illustrate and explain how to apply these ideas to more practical uses.

Computationally, because of exponential growth, to deal with arbitrary n-variable function is impossible.http://blog.sciencenet.cn/blog-1565-26889.html . Data-wise, it also involve astronomically large amount of data. To simplify notations at least theoretically, we make a continuous approximation of these discrete data and introduce continuous variables and functions. To emphasize, for our purpose, this is only a convenient approximation and simplification. No new ideas are involved. This will be the content of next article. Beyond introducing continuous variables, we also need to develop carious special cases of joint probability structures to simplify description and calculations, subsequent articles will address these issues. Once again, let me emphasize that from my view point these simplifications and special cases are need for computational feasibility and practicality. Nothing conceptually new is involved.

http://blog.sciencenet.cn/blog-1565-664051.html

上一篇：A Serious computer Crash

下一篇：Probability and Stochastic Process Tutorial (2)

Archiver|手机版|**科学网**
( 京ICP备14006957 )

GMT+8, 2017-9-27 00:28

Powered by **ScienceNet.cn**

Copyright © 2007-2017 中国科学报社