How to do Hypothesis Testing : A Beginner Guide For Data Scientist

Hypothetical Testing is an application of your statistical model to the questions from the real world. In the hypothetical testing, you first assume the result as an assumption. It is called the null hypothesis.
After the assumption, you hold an experiment for testing this hypothesis. Then after based on the results of the experiment. You will either reject or fail to reject the null hypothesis.

For example, based on the experiment result of testing. You rejected the null hypothesis. You can say the data supports another mutually exclusive alternate hypothesis.
The statement rejects or fails to reject is important to understand. In the hypothetical testing, we never prove a hypothesis. We only reject or fail to reject the hypothesis.

How to convert Real World Problem to Hypothesis?

Step 1: At the starting of the experiment you will assume the null hypothesis is true. Based on the experiment you will reject or fail to reject the experiment.

Step 2: If the data you have collected is unable to support the null hypothesis only then you look for the alternative hypothesis.

Step 3: If the testing is true then we can say the hypothesis will reflect the assumption.

Let’s understand more about it with the real life example.

Suppose there are a claims that “ A product has an average weight of 5.6 kg”.

Null Hypothesis: Average Weight is equal to 5.6 Kg. H(0) = mu
Alternate Hypothesis: Average Weight is not equal to 5.6 Kg. H(1) != 5.6

If we are testing a claim to be true and you can assume the test opposite that is you will test the claim opposite.
For example. This Machine Learning Course improves than the coding skills.
Null Hypothesis: Old Coding Skills >= New Coding Skills
Alternate Hypothesis: Old Coding Skills < New Coding Skills.

Keep in mind that the null hypothesis contains an equality sign. (= ,<= ,>= ) and the Alternate hypothesis contains (!=,<,>).

How to reject the Null Hypothesis?

After assuming the null hypothesis you run an experiment and record all the results. Let’s assume that Our null hypothesis is valid. Then if the probability of observing these results is very small (< or inside the 0.05) then you will reject the null hypothesis. Here 0.05 is the level of the significance. ($latex \alpha&s=2$). If the significance level is not mentioned in the statement then you will assume default 0.05.

Level of the significance ($latex \alpha&s=2$) is the area inside our null hypothesis.

Hypothesis Example of a Fair Coin.

Let’s assume that the null hypothesis that a fair coin has a head on one side and tail on the other side. If we run an experiment and flip that coin 20 times in a row, the null hypothesis is that all our heads.
Here the level of the significance 0.05 and is the area inside the tail of our null hypothesis.

If the $latex \alpha&s=2$ is 0.05 for the null hypothesis then its alternative hypothesis will be less than the null hypothesis mean that is less than 0.05. Then you will consider the left side of the normal distribution and its area is 0.05. H(1): < null. In the same way, if the $latex \alpha&s=2$is 0.05 and the alternative hypothesis is more than the null then you will consider the right side of the normal distribution. The probability of that is the area of the curve that is 0.05.H(1): > H(0).

And the last if the alternative hypothesis is not equal to the null, then the two tails will share the same area of the probability curve. H(1) =! Null. It means 0.025 area of the left tail and 0.025 area of the right tail.

These areas in the Hypothesis area the critical values or also known as z scores.

Before testing the Hypothesis you should clear these terms.

Mean and Proportion

In a population whenever we want to find the average and or some specific values, then you are dealing with means. And when you say something like a percentage or most or least then you are dealing with the proportions.

The formulae for the z score is when you have mean and population alpha ($latex \sigma&s=2$) is:

And you are dealing with proportions then use the following formulae

There are two ways you can test for the hypothesis.

Traditional Test
P value test.

Traditional Test

You will take the level of the significance to determine the critical values and will use it to compare the test statistics with the critical values.

P value test

In the P value test First, you take the test statistics to find the P-value and then you will use it to compare it with the level of the significance (p).

If the p-value is low then you will reject the H0 null hypothesis. And if the p is high then you will fail to reject the H0

These are the example you can understand with each testing method with an example.

Hypothetical Testing for Mean

Suppose an E-commerce company wants to increase their sales by improving their website performance. Currently, the download time for the website is 3.125 and it’s mean ($latex \mu&s=2$) and the standard deviation($latex \sigma&s=2$) is 0.700. The level of the significance ($latex \alpha&s=1$) is 0.01. A 40 new pages sample is tested and it has meantime($latex \bar{x} &s=2$) is 2.875. Are the results faster than before?

Step by Steps Method for testing?

Step 1: Find all the values before the testing.

Mean, $latex \mu&s=2$ = 3.125

Standard Deviation,$latex \sigma&s=2$ = 0.70

Level of the Significance, $latex \alpha&s=2$ = 0.01

Sample Size, n =40

Sample Mean, $latex \bar{x} &s=2$ = 2.875

Step 2 : State the Null Hpothesis and the Alternative Hypothesis

Null Hypothesis

$latex H_{0}:\mu \geq &s=2$ 3.125
Alternative Hypothesis

$latex H_{1}:\mu <&s=2$ 3.125

Step 3 : Set the level of the significance.

Here it is $latex \sigma&s=2$ = 0.70

Step 4: Determine the type of the test.

Here the null hypothesis is $latex H_{0}:\mu \geq &s=2$, then the Alternate Hypothesis will be $latex H_{1}:\mu <&s=2$. Thus we will choose the left tail for testing by ignoring the right tail, two tail.

The traditional method for testing Hypothesis is finding the z score (critical Value) by the using the below formulae

On solving by putting all the values you will get Z = -2.259. Then from the z table, look value for $latex \alpha&s=1$ =0.01 . you will get Z = -2.325.

Thus you will fail to reject the Null Hypothesis as the Z value (-2.259 ) is greater than the Z value at the level of the significance $latex \alpha&s=1$ (-2.325).

For the P value test, you will find the P value from Z table lookup on -2.56, then you will get P =0.0119. In this example, P > 0.01, thus we fail to reject the null hypothesis. And you cannot say that the new pages of the website are statistically faster.

Hypothetical Testing for the Proportion

An E-commerce company want to survey their 400 customers and finds that 58% of the Samples are teenagers. Then most of the customers are teenagers. Is it Fair?

Step 1: Find all the values and the proportion before the testing.

In this, you find proportion according to the statement like here 58% are teenagers. So for the null hypothesis, you can choose any percentage less than 58%. But to make easy calculation I will choose 50% proportion.

Sample Size, n =400.

Step 2 : State the Null Hypothesis and the Alternative Hypothesis

Null Hypothesis

$latex H_{0}: P \leq &s=2$ 0.5
Alternative Hypothesis

$latex H_{1}:P >&s=2$ 0.5

Step 3: Set the Significance Level.

Here in this example, it is not mentioned, therefore you will use the default that is 0.05.

Step 4: Determine the type of test.

In the $latex H_{1}:P >&s=2$ 0.5, the alternate hypothesis is using greater than so you will consider the right side tail of the normal distribution.

Step 5: Calculate the Test Statistics using the following formulae

Here,

$latex \hat{p} = 0.58 &s=2$, Actual Proportion

$latex p_{0} =0.50 &s=2$, Sample Proportion

On the solving ,you will get Z = 3.2

Step 5: Now look up for the Z value at the $latex \alpha =0.05 &s=2$ , you will get Z =1.645. The value of the Z for tested sample is 3.2 and it is greater than the alternate hypothesis. So you will reject the null hpotheis and can say most customers are teenagers.

Conclusion

Hypothesis Testing is the best method for analyzing the population on the larget set of the sample data. Researcher always uses it in finalization of their analysis by testing and rejecting their hypothesis. You can also apply these testing in any real world or daily life problems.

If you have liked this tutorial and want to ask something on this topic please contact us. You can also give some suggestion. Don’t forget to subscribe to get more articles on Hypothesis and Statistics.