重采样与模拟 | Resampling and Simulation-LingLab

重采样与模拟 | Resampling and Simulation

233 阅读 2020-08-18 09:15:02 上传

以下文章来源于神经语用学

Resampling

In statistics, resampling is any of a variety of methods for doing one of the following:

1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)

2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)

3. Validating models by using random subsets (bootstrapping, cross validation)

“ Offered the choice between mastery of a five-foot shelf of analytical Statistics books and middling ability at performing statistical Monte Carlo simulations , we would surely choose to have the latter skill.”

Monte Carlo simulation

The concept of Monte Carlo Simulation was devised by the mathematicians Stan Ulam and Nicholas Metropolis who were working to develop an atomic weapon as part of the Manhattan Project. They needed to compute the average distance that a neutron would travel in a substance before it collided with an atomic nucleus , but they could not compute this using standard mathematics.

“ Ulam realized that these computations could be simulated using random numbers , just like a casino game . His uncle had gambled at Monte Carlo , which is apparently where the name came from for their new technique. “

Four steps to performing a Monte Carlo Simulation

1. Define a domain of possible values

2. Generate random numbers within that domain from a probability distribution

3. Perform a computation using the random numbers.

4. Combine the results across many repetitions.

Randomness in Statistics

In statistics , random means unpredictable , But unpredictable doesn’t means “ not deterministic “.

People have a fairly bad senses of randomness

We tend to see patterns when they don’t exist. ( “ Pareidolia “ )

People tend to think of random processes as self-correcting . ( gambler’s fallacy )

Generate Random numbers

A truly random number can only be generated through physical process. In R , we use a computer algorithm to generate a pseudo-random number.

In R , there is a function to generate random number for each of the major probability distribution :

runif() ————— uniform distribution

rnorm() ————— normal distribution

rbinom() ————— binomial distribution

Using Monte Carlo Simulation

We want to know how much time to allow for an in-class quiz. The distribution of the quiz completion time is normally distributed , the average time is 5min , and standard deviation of 1min. We expect the time could be sufficient to everyone to finish their test 99% of the time.

Using mathematical theory : Statistic of extreme value

Using Monte Carlo Simulation —— in R;

Using Simulation for statistics

If we can’t assume that the estimates are normally distributed , or we don’t know their distribution :

The idea to use the data themselves to estimate an answer : bootstrap

The idea behind the Bootstrap is that we repeatedly sample from the actual dataset; importantly, we sample with replacement , such that the same data point will often end up being represented multiple times within one of the samples.

Bootstrap：Discussion

Advantages :

Simplicity ;

a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution;

a convenient way avoids the cost of repeating experiments；

We would not usually employ the bootstrap to compute confidence intervals for the mean (since we can generally assume that the normal distribution is appropriate for the sampling distribution of the mean, as long as our sample is large enough), but this example shows how the method gives us roughly the same result as the standard method based on the normal distribution. The bootstrap would more often be used to generate standard errors for estimates of other statistics where we know or suspect that the normal distribution is not appropriate.

表情

图片

附件

热门资讯

北京大学CCL语料库【前沿】R语言元分析专题第七章：亚组分析【前沿】交叉滞后中介模型Mplus的应用【网上课堂】雨课堂+腾讯会议操作攻略语言学的主要分支 2020年最新语言学SSCI期刊影响因子排名... R语言元分析专题：计算效应量的大小兰卡斯特大学的语料库研究新工具LancsBox... R语言元分析专题第五章：森林图语系、语族、语支——世界语言万花筒

推荐工具