Limit theorem
Chebyshev’s inequality
σ2=∫(x–μ)2fX(x)dx≥∫u–c–∞(x–μ)2fX(x)dx+∫∞u+c(x–μ)2fX(x)dx≥c2∫u–c–∞fX(x)dx+c2∫∞u+cfX(x)dx=c2P(|X–μ|≥c)
P(|X–μ|≥c)≤σ2c2→P(|X–μ|≥kσ)≤1k2
Convergence
Mn=X1+⋯+Xnn
E[Mn]=E[X1]+⋯+E[Xn]n=nμn=μvar(Mn)=(X1–μ)2+⋯+(Xn–μ)2n2=nσ2n2=σ2n→0P(|Mn–μ|≥ε)≤var(Mn)ε2=σ2nε2→0
Example
Bernoulli process σx=p(1–p)=14
95% confidence of error smaller than 1% P(|Mn–μ|≥0.01)≤0.05
P(|Mn–μ|≥0.01)≤σ2Mn0.012=σ2x0.012n≤14n(0.01)2≤0.05
Scaling of Mn
Sn=X1+⋯+Xn→var(Sn)=nσ2Mn=Snn→var(Mn)=σ2nSn√n→var(Sn√n)=var(Sn)n=σ2
Central limit theorem
Standardized Sn=X1+⋯+Xn
Zn=Sn–E[Sn]σSn=Sn–nE[X]√nσ∼N(0,1)
P(Zn≤c)→P(Z≤c)=Φ(c)
CDF of Zn converges to normal CDF, not about convergence of PDF or PMF
Example:
n=36,p=0.5, find P(Sn≤21)
Exact answer:
21∑k=0(36k)(12)36=0.8785
Method 1: Central limit theorem
Zn=Sn–nE[X]√nσ=Sn–np√np(1–p)
E[Sn]=np=36⋅0.5=18σ2Sn=np(1–p)=9P(Sn≤21)≈P(Sn–183≤21–183)=P(Zn≤1)=0.843
Method 2: 1/2 correction for binomial approximation
P(Sn≤21)=P(Sn<22) Sn is an integerP(Sn≤21.5)≈P(Zn≤21.5–183)=P(Zn≤1.17)=0.879
The solution is closer to the exact answer.
De Moivre-Laplace CLT
When 1/2 correction is used, CLT can also approximate the binomial pmf, not just the binomial CDF.
P(Sn=19)=P(18.5≤Sn≤19.5)≈P(0.17≤Zn≤0.5)=P(Zn≤0.5)–P(Zn≤0.17)=0.124
Exact answer: P(Sn=19)=(3619)(12)36=0.125
Poisson arrivals during unit interval equals: sum of n (independent) Poisson arrivals during n intervals of length 1/n. X=X1+⋯+Xn, E[X]=1. Fix p of Xi, when n→∞, the mean and variance are changing, so CLT cannot be applied here. For Binomial(n,p):
p fixed,n→∞: normalnp fixed,n→∞: Poisson
Bayesian statistical inference
Types of Inference models/approaches
X=aS+W
#1. Model building: know “signal” S, observe X, infer a
#2. Inferring unknown variables: know a, observe X, infer S
#3. Hypothesis testing: unknown takes one of few possible values, aim at small probability of incorrect decision
#4. Estimation: aim at a small estimation error
θ is an unknown parameter and the difference between classical statistics and Bayesian is that θ is constant in classical statistics, but a random variable in Bayesian.
pΘ|X(θ|x)=pΘ(θ)fΘ|X(θ|x)fX(x)Hypothesis testingfΘ|X(θ|x)=fΘ(θ)fΘ|X(θ|x)fX(x)Estimation
Maximum a posteriori probability (MAP): choose a single value that gives a maximum probability, often used in hypothesis testing, but may be misleading when look into the conditional expectation.
Least mean squares estimation (LMS): find c to minimize E[(Θ–c)2], in which c=E[Θ], then the optimal mean squared error is var(Θ)=E[(Θ–E[Θ])2], summarized as
minimize E[(Θ–c)2]→c=E[Θ]→var(Θ)=E[(Θ–E[Θ])2]
E[(Θ–c)2]=E[Θ2]–2cE[Θ]+c2ddcE[(Θ–c)2]=–2E[Θ]+2c=0
LMS estimation of two random variables
minimize E[(Θ–c)2|X=x]→c=E[Θ|X=x]var(Θ|X=x)=E[(Θ–E[Θ|X=x])2|X=x]≤E[(Θ–g(x))2|X=x]E[(Θ–E[Θ|X])2|X]≤E[(Θ–g(x))2|X]E[(Θ–E[Θ])2]≤E[(Θ–g(x))2]
LMS estimation with several measurements: E[Θ|X1,⋯,Xn], is hard to compute and involves multi-dimensional integrals, etc.
Estimator: ˆΘ=E[Θ|X], with estimation error ˜Θ=ˆΘ–Θ
E[˜Θ|X]=E[ˆΘ–Θ|X]=E[ˆΘ|X]–E[Θ|X]=ˆΘ–ˆΘ=0E[˜Θh(x)|X]=h(x)E[˜Θ|X]=0→E[˜Θh(x)]=0cov(˜Θ,ˆΘ)=E[(˜Θ–E[˜Θ])⋅(ˆΘ–E[ˆΘ])]=E[˜Θ⋅(ˆΘ–Θ)]=0˜Θ=ˆΘ–Θ→var(Θ)=var(ˆΘ)+var(˜Θ)
Linear LMS
Consider estimator of Θ of the form ˆΘ=aX+b and minimize E[(Θ–aX–b)2]
ˆΘL=E[Θ]+cov(X,Θ)var(X)(X–E[X])
E[(ˆΘL–Θ)2]=(1–ρ2)σ2Θ
Consider estimators of the form ˆΘ=a1X1+⋯+anXn+b and minimize E[(a1X1+⋯+anXn+b–Θ)2]. Set derivatives to zero and linear system in b and the ai, only means, variances, covariances matter.
Cleanest linear LMS example
Xi=Θ+Wi, Θ,W1,⋯,Wn are independent, Θ∼μ,σ20 and Wi∼0,σ2i, then
ˆΘL=μ/σ20+n∑i=1Xi/σ2in∑i=11/σ2i
Classical statistical inference
Problem type
#1: hypothesis testing H0:θ=1/2 versus H1:θ=3/4
#2: composite hypotheses H0:θ=1/2 versus H1:θ≠1/2
#3: estimation, design an estimator ˆΘ to keep estimation error ˆΘ–θ small
Maximum likelihood estimation: pick θ that makes data most likely
ˆθMAP=argmaxθpΘ(x;θ)
Example: maxθnΠi=1θe–θxi
maxθnΠi=1θe–θxi→maxθ(nlogθ–θn∑i=1xi)nθ–n∑i=1xi=0→ˆθML=nx1+⋯+xn
desirable properties of estimators
(1) unbiased: E[ˆΘn]=θ
(2) consistent: ˆΘn→θ in probability
(3) small mean squared error (MSE): E[(ˆΘ–θ)2]=var(ˆΘ–θ)+(E[ˆΘ–θ])=var(ˆΘ)+(biased)2
Confidence intervals (CIs)
random interval [ˆΘ–n,ˆΘ+n] with an 1–α confidence interval
P(ˆΘ–n≤θ≤ˆΘ+n)≥1–α
CI in estimation of the mean ˆΘn=(X1+⋯+Xn)/n
Φ(1.96)=1–0.05/2
P(|ˆΘn–θ|σ/√n≤1.96)≈0.95
P(ˆΘn–1.96σ√n≤θ≤ˆΘn+1.96σ√n)≈0.95
more generally
Φ(z)=1–α/2→P(ˆΘn–zσ√n≤θ≤ˆΘn+zσ√n)≈1–α
Since the σ is unknown, then we should estimate the value of σ
option 1: upper bound on σ, for example Bernoulli σ=√p(1–p)≤1/2
option 2: ad hoc estimate of σ, for example Bernoulliθ, ˆσ=√ˆΘ(1–ˆΘ)
option 3: generic estimation of the variance
σ2=E[(Xi–θ)2]ˆσ2n=1nn∑i=1(Xi–θ)2→σ2ˆS2n=1n–1n∑i=1(Xi–ˆΘn)2→σ2