1.) Describe 5 different types of data distributions. You may include jpegs or bitmaps. Provide 2 examples of a variable that is representative for each distribution. You may not use the standard normal. t-distribution, F-distribution, Chi-Square distribution, Binomial distribution, or uniform distribution. These distributions are all covered in the course.
2.) Briefly summarize three different types/approaches to testing data for normality.
3.) In many cases, your homework problems state, "assume the data is normally distributed". Why is testing for normality important?
4.) If the data is not normally distributed, what other methods or approaches can you use?
qn 1)
1) Bernoulli Distribution
Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Here,
the occurrence of a head denotes success, and the occurrence of a
tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail
since there are only two possible outcomes.
The
probability mass function is given by:
px(1-p)1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight between me and Undertaker. He is pretty much certain to win. So in this case probability of my success is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart below shows the Bernoulli Distribution of our fight.
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected value of any distribution is the mean of the distribution. The expected value of a random variable X from a Bernoulli distribution is found as follows:
E(X) = 1*p + 0*(1-p) = p
The variance of a random variable from a bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.
2) Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be any number. Now, the entire number of calls at a call center in a day is modeled by Poisson distribution. Some more examples are
You can now think of many examples following the same course. Poisson Distribution is applicable in situations where events occur at random points of time and space wherein our interest lies only in the number of occurrences of the event.
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any
successful event should not influence the outcome of another
successful event.
2. The probability of success over a short interval must equal the
probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the
interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some notations used in Poisson distribution are:
Here, X is called a Poisson Random Variable and the probability distribution of X is called Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The PMF of X following a Poisson distribution is given by:
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that interval. The graph of a Poisson distribution is shown below:
The graph shown below illustrates the shift in the curve due to increase in mean.
It is perceptible that as the mean increases, the curve shifts to the right.
The mean and variance of X following a Poisson distribution:
Mean
-> E(X) = µ
Variance -> Var(X) = µ
3)Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time between the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models the interval of time between the calls.
Other examples are:
1.
Length of time between metro arrivals,
2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner
Exponential distribution is widely used for survival analysis. From the expected life of a machine to the expected life of a human, exponential distribution successfully delivers the result.
A random variable X is said to have an exponential distribution with PDF:
f(x) = { λe-λx, x ≥ 0
and parameter λ>0 which is also called the rate.
For survival analysis, λ is called the failure rate of a device at any time t, given that it has survived up to t.
Mean and Variance of a random variable X following an exponential distribution:
Mean -> E(X) = 1/λ
Variance -> Var(X) = (1/λ)²
Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This is explained better with the graph shown below.
To
ease the computation, there are some formulas given below.
P{X≤x} = 1 – e-λx, corresponds to the area under the
density curve to the left of x.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.
Relations between the Distributions
Relation between Bernoulli and Binomial Distribution
1. Bernoulli Distribution is a special case of Binomial Distribution with a single trial.
2. There are only two possible outcomes of a Bernoulli and Binomial distribution, namely success and failure.
3. Both Bernoulli and Binomial Distributions have independent trails.
Relation between Poisson and Binomial Distribution
Poisson Distribution is a limiting case of binomial distribution under the following conditions:
Relation between Normal and Binomial Distribution & Normal and Poisson Distribution:
Normal distribution is another limiting form of binomial distribution under the following conditions:
The normal distribution is also a limiting case of Poisson distribution with the parameter λ →∞.
Relation between Exponential and Poisson Distribution:
If the times between random events follow exponential distribution with rate λ, then the total number of events in a time period of length t follows the Poisson distribution with parameter λt.
4 ) Negative Binomial distribution:
Returning again to the coin toss example, assume that you hold the number of successes fixed at a given number and estimate the number of tries you will have before you reach the specified number of successes. The resulting distribution is called the negative binomial and it very closely resembles the Poisson. In fact, the negative binomial distribution converges on the Poisson distribution, but will be more skewed to the right (positive values) than the Poisson distribution with similar parameters.
5)Geometric distribution:
Consider again the coin toss example used to illustrate the binomial. Rather than focus on the number of successes in n trials, assume that you were measuring the likelihood of when the first success will occur. For instance, with a fair coin toss, there is a 50% chance that the first success will occur at the first try, a 25% chance that it will occur on the second try and a 12.5% chance that it will occur on the third try. The resulting distribution is positively skewed and looks as follows for three different probability scenarios :
Figure : Geometric Distribution
Note that the distribution is steepest with high probabilities of success and flattens out as the probability decreases. However, the distribution is always positively skewed.
qn 2 )
In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.
More precisely, the tests are a form of model selection, and can be interpreted several ways, depending on one's interpretations of probability:
Frequentist tests:
Tests of univariate normality include the following:
KOLMOGROV-SMIRNOV TEST
Illustration of the Kolmogorov–Smirnov statistic. Red line is CDF, blue line is an ECDF, and the black arrow is the K–S statistic.
In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). It is named after Andrey Kolmogorov and Nikolai Smirnov.
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case). In the one-sample case, the distribution considered under the null hypothesis may be continuos, purely discrete or mixed . In the two-sample case , the distribution considered under the null hypothesis is a continuous distribution but is otherwise unrestricted.
The two-sample K–S test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.
The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic: see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test. However, these other tests have their own disadvantages. For instance the Shapiro–Wilk test is known not to work well in samples with many identical values.
D'Agostino's K-squared test
In statistics, D’Agostino’s K2 test, named for Ralph D'Agostino, is a goodness-of-fit measure of departure from normality, that is the test aims to establish whether or not the given sample comes from a normally distributed population. The test is based on transformations of the sample kurtosis and skewness, and has power only against the alternatives that the distribution is skewed and/or kurtic.
Jarque–Bera test
In statistics, the Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. The test is named after Carlos Jarque and Anil K. Bera. The test statistic is always nonnegative. If it is far from zero, it signals the data do not have a normal distribution.
The test statistic JB is defined as
where n is the number of observations (or degrees of freedom in general); S is the sample skewness, K is the sample kurtosis :
where and are the estimates of third and fourth central moments, respectively, is the sample mean, and is the estimate of the second central moment, the variance.
If the data comes from a normal distribution, the JB statistic asymptotically has a chi-squared distribution with two degrees of freedom, so the statistic can be used to test the hypothesis that the data are from a normal distribution. The null hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero. Samples from a normal distribution have an expected skewness of 0 and an expected excess kurtosis of 0 (which is the same as a kurtosis of 3). As the definition of JB shows, any deviation from this increases the JB statistic.
Shapiro–Wilk test
The Shapiro–Wilk test is a test of normality in frequentist statistics. It was published in 1965 by Samuel Sanford Shapiro and Martin Wilk.
Pearson's chi-squared test
Pearson's chi-squared test (χ2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., Yates, likelihood ratio, portmanteau test in time series, etc.) – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is "fair" (i. e., all six outcomes are equally likely to occur.)
qn3)
A normality test is used to determine whether sample data has been drawn from a normally distributed population (within some tolerance). A number of statistical tests, such as the Student's t-test and the one-way and two-way ANOVA require a normally distributed sample population
Normality Test | Summary |
---|---|
Shapiro-Wilk | Common normality test, but does not work well with duplicated data or large sample sizes. |
Kolmogorov-Smirnov | For testing Gaussian distributions with specific mean and variance. |
Lilliefors | Kolmogorov-Smirnov test with corrected P. Best for symmetrical distributions with small sample sizes. |
Anderson-Darling | Can give better results for some datasets than Kolmogorov-Smirnov. |
D'Agostino's K-Squared | Based on transformations of sample kurtosis and skewness. Especially effective for “non-normal” values. |
Chen-Shapiro | Extends Shapiro-Wilk test without loss of power. Supports limited sample size (10 ≤ n ≤ 2000). |
qn 4)
if your data are not normal, you should do a nonparametric version of the test, which does not assume normality. From my experience, I would say that if you have non-normal data, you may look at the nonparametric version of the test you are interested in running. But more important, if the test you are running is not sensitive to normality, you may still run it even if the data are not normal.
Several tests are "robust" to the assumption of normality, including t-tests (1-sample, 2-sample, and paired t-tests), Analysis of Variance (ANOVA), Regression, and Design of Experiments (DOE). The trick I use to remember which tests are robust to normality is to recognize that tests which make inferences about means, or about the expected average response at certain factor levels, are generally robust to normality. That is why even though normality is an underlying assumption for the tests above, they should work for nonnormal data almost as well as if the data (or residuals) were normal.
Normally distributed data is needed to use a number of statistical analysis tools, such as individuals control charts, Cp/Cpk analysis, t-tests and analysis of variance (ANOVA). When data is not normally distributed, the cause for non-normality should be determined and appropriate remedial actions should be taken.
Get Answers For Free
Most questions answered within 1 hours.