Question

1.) Describe 5 different types of data distributions. You may include jpegs or bitmaps. Provide 2...

1.) Describe 5 different types of data distributions. You may include jpegs or bitmaps. Provide 2 examples of a variable that is representative for each distribution. You may not use the standard normal. t-distribution, F-distribution, Chi-Square distribution, Binomial distribution, or uniform distribution. These distributions are all covered in the course.

2.) Briefly summarize three different types/approaches to testing data for normality.

3.) In many cases, your homework problems state, "assume the data is normally distributed". Why is testing for normality important?

4.) If the data is not normally distributed, what other methods or approaches can you use?

Homework Answers

Answer #1

qn 1)

1) Bernoulli Distribution

Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible outcomes.

The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as

The probabilities of success and failure need not be equally likely, like the result of a fight between me and Undertaker. He is pretty much certain to win. So in this case probability of my success is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart below shows the Bernoulli Distribution of our fight.

Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected value of any distribution is the mean of the distribution. The expected value of a random variable X from a Bernoulli distribution is found as follows:

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

2) Poisson Distribution

Suppose you work at a call center, approximately how many calls do you get in a day? It can be any number. Now, the entire number of calls at a call center in a day is modeled by Poisson distribution. Some more examples are

  1. The number of emergency calls recorded at a hospital in a day.
  2. The number of thefts reported in an area on a day.
  3. The number of customers arriving at a salon in an hour.
  4. The number of suicides reported in a particular city.
  5. The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is applicable in situations where events occur at random points of time and space wherein our interest lies only in the number of occurrences of the event.

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some notations used in Poisson distribution are:

  • λ is the rate at which an event occurs,
  • t is the length of a time interval,
  • And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that interval. The graph of a Poisson distribution is shown below:

The graph shown below illustrates the shift in the curve due to increase in mean.

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ
Variance -> Var(X) = µ

3)Exponential Distribution

Let’s consider the call center example one more time. What about the interval of time between the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models the interval of time between the calls.

Other examples are:

1. Length of time between metro arrivals,
2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner

Exponential distribution is widely used for survival analysis. From the expected life of a machine to the expected life of a human, exponential distribution successfully delivers the result.

A random variable X is said to have an exponential distribution with PDF:

f(x) = { λe-λx, x ≥ 0

and parameter λ>0 which is also called the rate.

For survival analysis, λ is called the failure rate of a device at any time t, given that it has survived up to t.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

Variance -> Var(X) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This is explained better with the graph shown below.

To ease the computation, there are some formulas given below.
P{X≤x} = 1 – e-λx, corresponds to the area under the density curve to the left of x.

P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.

P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.

Relations between the Distributions

Relation between Bernoulli and Binomial Distribution

1. Bernoulli Distribution is a special case of Binomial Distribution with a single trial.

2. There are only two possible outcomes of a Bernoulli and Binomial distribution, namely success and failure.

3. Both Bernoulli and Binomial Distributions have independent trails.

Relation between Poisson and Binomial Distribution

Poisson Distribution is a limiting case of binomial distribution under the following conditions:

  1. The number of trials is indefinitely large or n → ∞.
  2. The probability of success for each trial is same and indefinitely small or p →0.
  3. np = λ, is finite.

Relation between Normal and Binomial Distribution & Normal and Poisson Distribution:

Normal distribution is another limiting form of binomial distribution under the following conditions:

  1. The number of trials is indefinitely large, n → ∞.
  2. Both p and q are not indefinitely small.

The normal distribution is also a limiting case of Poisson distribution with the parameter λ →∞.

Relation between Exponential and Poisson Distribution:

If the times between random events follow exponential distribution with rate λ, then the total number of events in a time period of length t follows the Poisson distribution with parameter λt.

4 ) Negative Binomial distribution:

Returning again to the coin toss example, assume that you hold the number of successes fixed at a given number and estimate the number of tries you will have before you reach the specified number of successes. The resulting distribution is called the negative binomial and it very closely resembles the Poisson. In fact, the negative binomial distribution converges on the Poisson distribution, but will be more skewed to the right (positive values) than the Poisson distribution with similar parameters.

5)Geometric distribution:

Consider again the coin toss example used to illustrate the binomial. Rather than focus on the number of successes in n trials, assume that you were measuring the likelihood of when the first success will occur. For instance, with a fair coin toss, there is a 50% chance that the first success will occur at the first try, a 25% chance that it will occur on the second try and a 12.5% chance that it will occur on the third try. The resulting distribution is positively skewed and looks as follows for three different probability scenarios :

Figure : Geometric Distribution

Note that the distribution is steepest with high probabilities of success and flattens out as the probability decreases. However, the distribution is always positively skewed.

qn 2 )

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.

More precisely, the tests are a form of model selection, and can be interpreted several ways, depending on one's interpretations of probability:

  • In descriptive statistics terms, one measures a goodness of fit of a normal model to the data – if the fit is poor then the data are not well modeled in that respect by a normal distribution, without making a judgment on any underlying variable.
  • In frequentist statistics statistical hypothesis testing, data are tested against the null hypothesis that it is normally distributed.
  • In Bayesian statistics, one does not "test normality" per se, but rather computes the likelihood that the data come from a normal distribution with given parameters μ,σ (for all μ,σ), and compares that with the likelihood that the data come from other distributions under consideration, most simply using a Bayes factor (giving the relative likelihood of seeing the data given different models), or more finely taking a prior distribution on possible models and parameters and computing a posterior distribution given the computed likelihoods.

Frequentist tests:

Tests of univariate normality include the following:

  • D'Agostino's K-squared test,
  • Jarque–Bera test,
  • Anderson–Darling test,
  • Cramér–von Mises criterion,
  • Kolmogorov–Smirnov test (this one only works if the mean and the variance of the normal are assumed known under the null hypothesis),
  • Lilliefors test (based on the Kolmogorov–Smirnov test, adjusted for when also estimating the mean and variance from the data),
  • Shapiro–Wilk test, and
  • Pearson's chi-squared test

KOLMOGROV-SMIRNOV TEST

Illustration of the Kolmogorov–Smirnov statistic. Red line is CDF, blue line is an ECDF, and the black arrow is the K–S statistic.

In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). It is named after Andrey Kolmogorov and Nikolai Smirnov.

The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case). In the one-sample case, the distribution considered under the null hypothesis may be continuos, purely discrete or mixed . In the two-sample case , the distribution considered under the null hypothesis is a continuous distribution but is otherwise unrestricted.

The two-sample K–S test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.

The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic: see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test. However, these other tests have their own disadvantages. For instance the Shapiro–Wilk test is known not to work well in samples with many identical values.

D'Agostino's K-squared test

In statistics, D’Agostino’s K2 test, named for Ralph D'Agostino, is a goodness-of-fit measure of departure from normality, that is the test aims to establish whether or not the given sample comes from a normally distributed population. The test is based on transformations of the sample kurtosis and skewness, and has power only against the alternatives that the distribution is skewed and/or kurtic.

Jarque–Bera test

In statistics, the Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. The test is named after Carlos Jarque and Anil K. Bera. The test statistic is always nonnegative. If it is far from zero, it signals the data do not have a normal distribution.

The test statistic JB is defined as

where n is the number of observations (or degrees of freedom in general); S is the sample skewness, K is the sample kurtosis :

where and are the estimates of third and fourth central moments, respectively, is the sample mean, and    is the estimate of the second central moment, the variance.

If the data comes from a normal distribution, the JB statistic asymptotically has a chi-squared distribution with two degrees of freedom, so the statistic can be used to test the hypothesis that the data are from a normal distribution. The null hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero. Samples from a normal distribution have an expected skewness of 0 and an expected excess kurtosis of 0 (which is the same as a kurtosis of 3). As the definition of JB shows, any deviation from this increases the JB statistic.

Shapiro–Wilk test

The Shapiro–Wilk test is a test of normality in frequentist statistics. It was published in 1965 by Samuel Sanford Shapiro and Martin Wilk.

Pearson's chi-squared test

Pearson's chi-squared test (χ2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., Yates, likelihood ratio, portmanteau test in time series, etc.) – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is "fair" (i. e., all six outcomes are equally likely to occur.)

qn3)

A normality test is used to determine whether sample data has been drawn from a normally distributed population (within some tolerance). A number of statistical tests, such as the Student's t-test and the one-way and two-way ANOVA require a normally distributed sample population

Normality Test Summary
Shapiro-Wilk Common normality test, but does not work well with duplicated data or large sample sizes.
Kolmogorov-Smirnov For testing Gaussian distributions with specific mean and variance.
Lilliefors Kolmogorov-Smirnov test with corrected P. Best for symmetrical distributions with small sample sizes.
Anderson-Darling Can give better results for some datasets than Kolmogorov-Smirnov.
D'Agostino's K-Squared Based on transformations of sample kurtosis and skewness. Especially effective for “non-normal” values.
Chen-Shapiro Extends Shapiro-Wilk test without loss of power. Supports limited sample size (10 ≤ n ≤ 2000).

qn 4)

if your data are not normal, you should do a nonparametric version of the test, which does not assume normality. From my experience, I would say that if you have non-normal data, you may look at the nonparametric version of the test you are interested in running. But more important, if the test you are running is not sensitive to normality, you may still run it even if the data are not normal.

Several tests are "robust" to the assumption of normality, including t-tests (1-sample, 2-sample, and paired t-tests), Analysis of Variance (ANOVA), Regression, and Design of Experiments (DOE). The trick I use to remember which tests are robust to normality is to recognize that tests which make inferences about means, or about the expected average response at certain factor levels, are generally robust to normality. That is why even though normality is an underlying assumption for the tests above, they should work for nonnormal data almost as well as if the data (or residuals) were normal.

Normally distributed data is needed to use a number of statistical analysis tools, such as individuals control charts, Cp/Cpk analysis, t-tests and analysis of variance (ANOVA). When data is not normally distributed, the cause for non-normality should be determined and appropriate remedial actions should be taken.

Know the answer?
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for?
Ask your own homework help question
Similar Questions
Note that you must do this project on your own—you may not work with other students....
Note that you must do this project on your own—you may not work with other students. You are always welcome to ask your instructor for help. Visit the NASDAQ historical prices weblink. First, set the date range to be for exactly 1 year ending on the Monday that this course started. For example, if the current term started on April 1, 2018, then use April 1, 2017 – March 31, 2018. (Do NOT use these dates. Use the dates that...
Specimen 1 2 3 4 5 6 7 8 9 Steel ball 50 57 61 70...
Specimen 1 2 3 4 5 6 7 8 9 Steel ball 50 57 61 70 68 54 65 51 53 Diamond 52 55 63 74 69 55 68 51 56 The manufacturer of hardness testing equipment uses​ steel-ball indenters to penetrate metal that is being tested.​ However, the manufacturer thinks it would be better to use a diamond indenter so that all types of metal can be tested. Because of differences between the two types of​ indenters, it is...
Visit the NASDAQ historical prices weblink. First, set the date range to be for exactly 1...
Visit the NASDAQ historical prices weblink. First, set the date range to be for exactly 1 year ending on the Monday that this course started. For example, if the current term started on April 1, 2018, then use April 1, 2017 – March 31, 2018. (Do NOT use these dates. Use the dates that match up with the current term.) Do this by clicking on the blue dates after “Time Period”. Next, click the “Apply” button. Next, click the link...
I. Solve the following problem: For the following data: 1, 1, 2, 2, 3, 3, 3,...
I. Solve the following problem: For the following data: 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6 n = 12 b) Calculate 1) the average or average 2) quartile-1 3) quartile-2 or medium 4) quartile-3 5) Draw box diagram (Box & Wisker) II. PROBABILITY 1. Answer the questions using the following contingency table, which collects the results of a study to 400 customers of a store where you want to analyze the payment method. _______B__________BC_____ A...
BIOS 376 Homework 7 1. A professor claims that the mean IQ for college students is...
BIOS 376 Homework 7 1. A professor claims that the mean IQ for college students is 92. He collects a random sample of 85 college students to test this claim and the mean IQ from the sample is 84. (a) What are the null and alternative hypotheses to test the initial claim? (1 pt) (b) Using R, compute the test statistic. Assume the population standard deviation of IQ scores for college students is 17.6 points. (1 pt) (c) Using R,...
Choose one disease process or injury that you would like to learn more about. Decide on...
Choose one disease process or injury that you would like to learn more about. Decide on an organ system or organ, and then choose a disorder that is different from those already presented in Braun and Anderson (2017). For example, if you wish to learn more about cancer, choose a cancer that is not included in chapter 7 (perhaps choose Melanoma) or if you would like to know more about heart disease, choose something different from topics presented in chapter...
Activity 1: Scientific Reports You may have heard the question “If a tree falls in a...
Activity 1: Scientific Reports You may have heard the question “If a tree falls in a forest and no one is around to hear it, does it make a sound?” A similar question can be asked about experiments. “If a researcher performs an experiment and never publishes the result has science been performed?” Many people would say no because science is the accumulation of knowledge. If the results of an experiment are not published, knowledge is not gained. The final...
As you saw from the lab PowerPoint slides last week, you will be doing a research...
As you saw from the lab PowerPoint slides last week, you will be doing a research study looking at ‘Aggression Priming” for your first paper. For this week’s discussion, I want you to discuss with your group what you think this study is about. What is the hypothesis? What theory does it come from? What do you predict will happen (do you expect something different than the hypothesis in the researcher instructions? If so, what and why?)? Do you think...
The Data A central tenet to being a good instructor, I believe, is to be a...
The Data A central tenet to being a good instructor, I believe, is to be a reflective practitioner of your craft. From time to time I find it to be necessary to review and reflect on my work as an instructor from a grade perspective. On our moodle page you will find a link to a second word document titled “300 Grades”. This data sheet contains both the alphanumeric grade (page 1) and the numerical equivalent (page 2) for every...
Please read the article and answear about questions. Determining the Value of the Business After you...
Please read the article and answear about questions. Determining the Value of the Business After you have completed a thorough and exacting investigation, you need to analyze all the infor- mation you have gathered. This is the time to consult with your business, financial, and legal advis- ers to arrive at an estimate of the value of the business. Outside advisers are impartial and are more likely to see the bad things about the business than are you. You should...
ADVERTISEMENT
Need Online Homework Help?

Get Answers For Free
Most questions answered within 1 hours.

Ask a Question
ADVERTISEMENT