*#I am very busy right now. I will complete this post later.*

The book of Liu is indeed awesome. The author himself has made many original contributions to sampling algorithms, and many of his ideas was explained in the book. The chapters on Gibbs Sampler (chapter 6),General Conditional Sampling (chapter 7) and multi-chain MCMC (chapter 10, 11) were excellent. But given that the book was written 10 years ago, many recent developments is missing. Some algorithms were not given enough spaces (in particular, Reversible Jump MCMC was given only 2 pages!).

The handbook covers many recent developments such as likelihood-free MCMC, Adaptive MCMC…, and treats Reversible Jump MCMC with the details it deserves (20 pages). Likelihood-free MCMC is one computational method for Approximate Bayesian Computation (ABC), which recently gains many attention. Instead of computing the likelihood ratio in the acceptance probability of the Metropolis-Hasting (MH) algorithm, one tries to simulate this ratio by generating simulated data. Adaptive MCMC is a general name for a class of MCMC which parameters can be tuned automatically during the search. For example, consider a MCMC kernel that consists of a Gibbs sampler and a MH sampler. With probability one will perform the Gibbs move, and with probability one will perform the MH move. The problem is had to be fixed, while we often want to change along the course of the search, for example larger at the beginning in order to make large transitions around the parameter space to quickly identify promising areas, and smaller later to better explore the local landscape around good candidates. One can not change carelessly, since there *was* no guarantee that the resulted MCMC kernel would converge to the desired distribution.

The book start with theoretical chapters discussed many aspects of MCMC. Each chapter is written by distinguished researchers in the corresponding field. One can expect to learn from experiences and perspectives of many experts in just one book. The chapters that pique my interest right now are Reversible Jump MCMC by Fan and Sisson, Adaptive MCMC by Jeffrey Rosenthal, Hamiltonian MCMC by Radford Neal, and Likelihood-free MCMC by Sisson and Fan.

The second part of the book consists of applications and case studies, with examples range from educational research to high-energy astrophysics. This is certainly richer than examples from Liu’s book, which were biased toward Liu’s research field. But to tell the truth I am not that interested in reading about MCMC in educational research or astrophysics.

On a final note, the book did *not* cover all the new developments in MCMC up to 2011. The biggest missing part is perhaps Particle MCMC, which is gaining more and more attentions. Some other notable algorithms that I really wish they had been mentioned are the equi-energy sampler and the Riemann Manifold MCMC. The equi-energy sampler was a simple-yet-powerful algorithm introduced in 2004. The sampler has a large memory size to remember where it has been, and use this information to speed up convergence. I think a detail discussion will help beginners like me understand better strong&weak points of it, compared to Population MCMC. About the Particle MCMC and Riemann Manifold MCMC, I think intuitive introductory discussions may help the non-experts like me to gather enough courage to delve into those equation-packed papers.

1. Orthogonal projection onto optimal hyperplane:

Given -dimensional column vectors (assumed pre-processing is done )

Suppose that we want to project these vectors orthogonally onto a unit vector , so that the projected vectors will be:

If we think of the projected vectors as an approximation of the original , then a measure of how good the overall approximation is is the sum of lengths of the discrepancies . So it is natural to adopt the following sum of square as the objective function, and judge base on how small it can make the objective function be.

Expanding yields

Let . Choosing to minimize is equivalent to choosing to maximize with the condition . With a change of variable, one can prove that the optimal is the (normalized) eigenvector of corresponding to the largest eigenvalue of .

Suppose after we have fixed at , another unit vector with the property is given, and we are asked how to choose the best if the same orthogonal projection is carried out onto . With the same reasoning, the best is the one that maximizing with subject to and . Again with a change of variable, it can be proved that the best is the (normalized) eigenvector of corresponding to the second largest eigenvalue of .

A natural question to ask is whether and are still the best if we consider orthogonal projection onto the plane spanned by two orthogonal unit vectors ,.

In other words, we consider the following objective function

and ask whether the following statement holds or not

with subject to and

By Pythagoras theorem, this is indeed true. So in fact the plane spanned by is the optimal plane consider all -dimensional plane.

This property can be generalized to the case of -dimensional hyperplane (). The optimal (in the orthogonal projection sense) hyperplane in this case is spanned by the first eigenvectors of (in the order of descending eigenvalues). If we define then the projection matrix onto the optimal hyperplane is . (One can prove this generalized property by considering an arbitrary orthonormal basis and the corresponding projection matrix with defined as , then expanding the objective function with the condition .)

The are -dimensional vectors, but are completely confined to the -dimensional subspace spanned by . In other words, the “effective” dimension of in this case is only . For this reason, in a typical Principal Component Analysis, one is not interested in the themselves, but in the coordinates of in the new coordinate system defined by the orthonormal basis . The coordinates of in this system are . Therefore, further analyses after PCA often work directly with the -dimensional column vector defined as

On a side note, the have a diagonal sample covariance matrix (i.e. the dimensions of are uncorrelated). Since has (sample) covariance matrix and is a transformation of with , the covariance matrix of is , which is a diagonal matrix.

2. PCA as a solution of a ridge regression problem:

[the need for sparse PCA here]

The above formulation of PCA based on orthogonal projection onto optimal hyperplanes is actually a ridge regression formulation, and the final reformulation for SparsePCA will be very close to this.

[sparse PCA ]

]]>]]>

Paper: Introduction to Nonparametric Bayesian Models (Naonori Ueda, Takeshi Yamada)

1. Generative model: gán cho quá trình tạo ra data một mô hình xác suất.

Latent variable model: cho thêm các latent variable vào, tạo mô hình có degree of freedom cao.

Xét bài toán clustering thì có mô hình tiêu biểu Mixture model với . latent variable nếu được tạo ra từ class , . thường dùng là Normal.

- Frequentist: Định sẵn , parameter là . rồi dùng EM tìm maximum của

- Parametric Bayesian : Vẫn định sẵn .

a. Định 1 prior cho : sinh ra theo phân bố với

b. Định prior cho với Dirichlet() (conjugate with multinomial). Đến phiên thì muốn truy lên nữa cũng có thể nhưng thường thì vậy là đủ.

c.Đến đây rồi thì tìm để Max (cũng bằng EM?)

2.** Nonparametric Bayesian model**: mô hình *Dirichlet Process Mixture* (DPM) làm việc với class. Nói chung thì cũng giống như ở parametric bayesian model: sản sinh vô hạn , theo phân bố nền tảng , sản sinh sao cho . Khi đó sẽ có phân bố rời rạc với là Krocneker Delta.

Khoan nói đến sản sinh từ , xem trước cách chọn(tạo ra) , là xác suất mà một data sẽ thuộc class

a.** Stick-Breaking Process(SBP)**: nên có nhiều cách chia, 1 trong những cách đó là SBP. Tưởng tượng ban đầu có thanh gỗ có độ dài 1. SBP sẽ bẻ dần dần từ thanh gỗ này.

- generate theo Beta. sẽ thuộc
- Bẻ có độ dài bằng * Độ dài còn lại của thanh gỗ sau lần bẻ .

Tính chất: nhỏ lớn và ngược lại

giảm exponentially theo : lúc đầu thanh gỗ dài còn dễ bẻ, càng về sau càng ngắn càng khó bẻ.

Dùng SBP thì thấy rõ normalizing condition được thỏa, nhưng phải tạo đủ từ đầu (?).

b. **Chinese Restaurant Process(CRP)**:

]]>There is no inherent problem in our desire to escalate our goals, as long as we enjoy the struggle along the way. The problem arises when people are so fixated on what they want to achieve that they cease to derive pleasure from the present.

*Partition function* (also called normalizing constant)

Potential energy

The symbol means that they are neighboring pair, is the interaction strength, is the external magnetic field.

In zero field we can also define the potential energy as

We define the expectation of with respect to as *internal energy*:

The* free energy* of the system is

The *specific heat* of the system is

The *mean magnetization per spin* is

1.

Let , we have

Thus

2.

]]>

Before performing an experiment, ask yourself (or your instructor, if you dare) these questions:

+ What is this experiment all about? What are its objectives?

+ Are these objectives really worth achieving? Why do you think so?

+ By what theory/reason that we should believe the experiment will really fulfill the objectives? Are there any reasons to fear it will **not** fulfill the objectives?

+ Can the objectives be achieved by other methods?

If so, what is the best method (in the point of highest probability to achieve the full objectives), and why are we choosing **this** method?

If not, what is that made the objectives so hard to be achieved, and why is **this** method can overcome the obstacles while others can’t?

Nature is too complicated for us to model with finite equations. All mathematical models are wrong.

Do not expect to find (new) simple mathematical models which are useful in real world problems. If they do exist, the chances are that they had been discovered, or they are not so useful.

Do not expect to find relations (simple or complicated) in the small scale , or in a quantitative manner. The only choice is to find relations in the whole picture: a large scale relation in a qualitative manner is more feasible.

Do not cook up mathematical models for the problem without testing it with a large amount of data, or having a good theory to explain why it must be like that. Finding out meaning among seemingly meaningless data is surely a big deal, but to resist the temptation of cooking up an artificial meaning is no less important.

**Solution: **Let be a Markov chain on state space consist of states: , where the chain reaches state if and only if .

The transition matrix is for and . The claim is equivalent to

We know that if a Markov chain has an irreducible, aperiodic transition matrix and an invariant distribution , then for all of the state space.

We will prove the defined above is irreducible, aperiodic and then find the invariant distribution of (it will turn out that as we need).

**Exercise 7.21:** Consider a Markov chain on the states , where for we have and . Also, and . This process can be viewed as a random walk on a directed graph with vertices , where each vertex has two directed edges: one that returns to and one that moves to the vertex with the next higher number(with a self-loop at vertex ). Find the stationary distribution of this chain.(This example shows that random walks on directed graphs are very different than random walks on undirected graph).

**Solution: **