重采样与模拟 | Resampling and Simulation
233 阅读 2020-08-18 09:15:02 上传
In statistics, resampling is any of a variety of methods for doing one of the following:1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)3. Validating models by using random subsets (bootstrapping, cross validation)“ Offered the choice between mastery of a five-foot shelf of analytical Statistics books and middling ability at performing statistical Monte Carlo simulations , we would surely choose to have the latter skill.” The concept of Monte Carlo Simulation was devised by the mathematicians Stan Ulam and Nicholas Metropolis who were working to develop an atomic weapon as part of the Manhattan Project. They needed to compute the average distance that a neutron would travel in a substance before it collided with an atomic nucleus , but they could not compute this using standard mathematics.“ Ulam realized that these computations could be simulated using random numbers , just like a casino game . His uncle had gambled at Monte Carlo , which is apparently where the name came from for their new technique. “Four steps to performing a Monte Carlo Simulation1. Define a domain of possible values2. Generate random numbers within that domain from a probability distribution3. Perform a computation using the random numbers.4. Combine the results across many repetitions.In statistics , random means unpredictable , But unpredictable doesn’t means “ not deterministic “.People have a fairly bad senses of randomness - We tend to see patterns when they don’t exist. ( “ Pareidolia “ )
- People tend to think of random processes as self-correcting . ( gambler’s fallacy )
A truly random number can only be generated through physical process. In R , we use a computer algorithm to generate a pseudo-random number.In R , there is a function to generate random number for each of the major probability distribution :- runif() ————— uniform distribution
- rnorm() ————— normal distribution
- rbinom() ————— binomial distribution
Using Monte Carlo Simulation
We want to know how much time to allow for an in-class quiz. The distribution of the quiz completion time is normally distributed , the average time is 5min , and standard deviation of 1min. We expect the time could be sufficient to everyone to finish their test 99% of the time.- Using mathematical theory : Statistic of extreme value
- Using Monte Carlo Simulation —— in R;
Using Simulation for statistics
If we can’t assume that the estimates are normally distributed , or we don’t know their distribution :The idea to use the data themselves to estimate an answer : bootstrapThe idea behind the Bootstrap is that we repeatedly sample from the actual dataset; importantly, we sample with replacement , such that the same data point will often end up being represented multiple times within one of the samples.- a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution;
- a convenient way avoids the cost of repeating experiments;
We would not usually employ the bootstrap to compute confidence intervals for the mean (since we can generally assume that the normal distribution is appropriate for the sampling distribution of the mean, as long as our sample is large enough), but this example shows how the method gives us roughly the same result as the standard method based on the normal distribution. The bootstrap would more often be used to generate standard errors for estimates of other statistics where we know or suspect that the normal distribution is not appropriate.