Summarizing Dataset Variables
The author of the previous report used different techniques of summarizing data. First, the author gives an introduction of what he is planning to report. He then goes ahead to explain the dataset used by stating what each and every variable in the dataset represents. He states whether the variables are categorical or quantitative variables. For instance, the author categorizes “Gender”, “Are they old? Above or under 40” and “Do they like the product” as categorical variables while “How much they would pay for it” is categorized as a quantitative variable.
For the quantitative variables, the author utilizes descriptive summary statistics such as mean, median, mode among others to summarize the variables (quantitative) while for the categorical variables, the author utilizes frequency tables and bar graphs to present them.
Section 2
- Pivot tables that let you investigate the relationship between the variables
“old or young” and “do the like the product? hate or like”
Count of do they like product? |
Column Labels |
||
Row Labels |
old |
young |
Grand Total |
hate |
15 |
10 |
25 |
like |
55 |
20 |
75 |
Grand Total |
70 |
30 |
100 |
Count of do they like product? |
Column Labels |
||
Row Labels |
old |
young |
Grand Total |
hate |
21.43% |
33.33% |
25.00% |
like |
78.57% |
66.67% |
75.00% |
Grand Total |
100.00% |
100.00% |
100.00% |
Make a simple comment
Majority (33.33%, n = 10) of the young people seem to hate the product as compared to the old people (21.43%, n = 15).
Using your sample what is the estimate for p1– p2? In other words what is the difference between the sample proportions –
0.7857-0.6667 = 0.119
Section 3
- A pivot table that let you investigate the relationship between the variables
“old or young” and “how much they would pay for the product ”
sample collector id |
420 |
||
Row Labels |
Average of how much would pay? |
StdDev of how much would pay? |
Count of are they old? |
old |
2.520 |
1.224 |
70 |
young |
2.183 |
1.405 |
30 |
Grand Total |
2.419 |
1.283 |
100 |
Make a simple comment about the relationship between the variables
Old people are willing to pay slightly higher for the product as compared to the young people
Using your sample what is the estimate for µ1– µ2? In other words what is the difference between the sample means
2.520 – 2.183 = 0.337
Section 4
Scatterplot
- Make a simple comment about the relationship between the variables
- Estimated profit for the casino when there 1000 bets is
Section 5
- A) Using the answer in section 2
Test the claim there is a difference in the proportions, use a 5% level of significance
- State an appropriate H0and H1
Solution
- Find the p-value Only using the answers to part (A) and the webpage
https://epitools.ausvet.com.au/content.php?page=z-test-2
Results
|
State whether or not you reject the H0
Solution
We fail to reject the null hypothesis (H0) since the p-value > 0.05
- Give a conclusion in plain English
Solution
There is no significant statistical evidence to conclude that the proportion of old people who like the product is different from the proportion of young people who like the product.
- B) Using the answer in section 3
Test the claim that there is a difference between the means using a 5% level of significance - State an appropriate H0and H1
Solution
- Find the p-value using the answers to part (A) and the webpage
https://www.medcalc.org/calc/comparison_of_means.php
Solution
Results
Difference |
-0.337 |
Standard error |
0.279 |
95% CI |
-0.8914 to 0.2174 |
t-statistic |
-1.206 |
DF |
98 |
Significance level |
P = 0.2306 |
State whether or not you reject H0
Solution
We fail to reject the null hypothesis (H0) since the p-value > 0.05
Give a conclusion in plain English
Solution
There is no significant statistical evidence to conclude that the average amount spent by old people is different from the average amount spent by young people.
Section 6
Use the dataset given below you must use your own sample
Suppose A business has conducted an opinion poll to find out if their customers support a change to the Business
- Use the PivotTable feature in excel to find appropriate summary statistics for your sample,. You should paste both into word, you do not need the excel file.
This pivot table must have the number of people that answer yes and the number of people that answer no
Solution
Row Labels |
Count of do you support proposed change? |
no |
90 |
yes |
112 |
Grand Total |
202 |
The sample size n is 1000 and the sample proportion
Find 90% confidence interval for the proportion of people that support the change
standard error = = 0.03497
Using the z distribution 90% of sample proportions are within 1.645 standard errors of the population proportion so the 90% confidence for sample proportion is between
Lower bound:
Upper bound:
We are 90% confident that the sample proportion of people that support the change is between 0.4975 and 0.6125.
Section 7
Histogram
The histogram below shows the relationship between the variables “Win or loss” and the “goal difference “for the Man United football club.
- Description of the variables
The variable “win or loss” is categorical variable because it is a question “Was it a win or a loss?” The variable goal difference is quantitative variable because the value is given in numbers.
- Description of the relationship
The amount people would pay for the snack food is between 0 and 6
Large goal difference is observed for the wins as compared for the losses
- Consider the histogram you found yourself and discussed in parts (a) ,(b) and (c)
Would the discussion be useful in business? Give a reason for your answer.
Solution
Yes the discussion would be useful in business since it will be able to predict the goal difference the team is likely to get in a win or a loss and this will prepare the manager on how to handle the case.
- Consider the following discussion taken from the sample report you had to read in section 1, Would the discussion be useful in business? Give a reason for your answer
Solution
The discussions in section 1 are useful since they help in making summary for a business case. The summaries are able to tell the mean or the median values which helps the decision makers to plan well.
Section 8
This section is abstract so you are encouraged to try and roughly understand the following before attempting the task
- a) Using section 2
- Find the zscore of the estimate section 2d note that average of the estimates is 0.14 with standard deviation 0.088
Solution
Count of do they like product? |
Column Labels |
||
Row Labels |
old |
young |
Grand Total |
hate |
21.43% |
33.33% |
25.00% |
like |
78.57% |
66.67% |
75.00% |
Grand Total |
100.00% |
100.00% |
100.00% |
Using part (i) find P(Z<zscore) using wolframalpha.com
- for example if the zscore is 0.5 type in
P(Z<0.5)”
into wolframalpha.com
IF there was a list of 1000 estimates ranked from lowest to highest, roughly what rank do you expect your estimate to have?
- Hint: just use the formula
expected rank = P(Z<zscore)*1000
Solution
- Complete the following table using https://app.box.com/s/2to195ysj0deo5wawwjp53e9jlt4peqp
Which sample |
Rank lowest to highest |
Estimate X |
Zscore=(X-mean)/stdev |
|
Lowest estimate |
475 |
1 |
-0.14306 |
-3.19465 |
Estimate from allocated sample |
420 |
422 |
0.11905 |
-0.2386 |
Highest estimate |
663 |
1000 |
0.543672 |
4.570203 |
b) Using section 3
Find the zscore of the estimate in section 3c note that average of the estimates is 0.408 with standard deviation 0.26
Solution
sample collector id |
420 |
||
Row Labels |
Average of how much would pay? |
StdDev of how much would pay? |
Count of are they old? |
old |
2.520 |
1.224 |
70 |
young |
2.183 |
1.405 |
30 |
Grand Total |
2.419 |
1.283 |
100 |
The estimate is – = 2.520 – 2.183=0.337
So the zscore is
Using part (ii) What is P(Z<zscore), you can find out the answer using wolframalpha.com
- for example if the zscore =-1 type in
P(Z<-1)
into wolfram alpha
- If there was a list of 1000 estimates ranked from lowest to highest, what rank do you think your would be close to, hint just use the formula
expected rank = P(Z<zscore)*1000
- Complete the following table , use https://app.box.com/s/kiqemn0h0m3d03uygo1dhemvx4e5uf6r
Which sample |
Rank lowest to highest |
Estimate X |
Zscore=(X-mean)/stdev |
|
Lowest estimate |
475 |
1 |
-0.43474 |
-3.23897 |
Estimate from allocated sample |
420 |
416 |
0.3367 |
-0.27308 |
Highest estimate |
663 |
1000 |
1.607576 |
4.613465 |
Using section 4
Find the zscore of the slope estimate in section 4a note that average of the estimates is 0.952 with standard deviation 0.237
Solution
Using part (ii) What is P(Z<zscore), you can find out the answer using wolframalpha.com
- for example if the zscore =-1 type in
P(Z<-1)
into wolfram alpha
- If there was a list of 1000 estimates ranked from lowest to highest, what rank do you think your would be close to, hint just use the formula
expected rank = P(Z<zscore)*1000
- Summary some of the 1000 estimates the full list of estimates is available from https://app.box.com/s/35a0x0hnxcqq2qh6krzua6qp587fke51
Which sample |
Rank lowest to highest |
Estimate X |
Zscore=(X-mean)/stdev |
|
Lowest estimate |
141 |
1 |
-0.00348010 |
-4.03134 |
Estimate from allocated sample |
420 |
471 |
0.93864267 |
-0.05654 |
Highest estimate |
683 |
1000 |
3.878984 |
3.876998 |
For parts a,b and c , compare the predicted rank for your sample iii using P(Z<zscore) to the actual rank in part iv
Solution
Section |
Predicted rank |
Actual rank |
Section 2 |
406 |
422 |
Section 3 |
392 |
416 |
Section 4 |
478 |
471 |
As can be seen, the predicted and the actual ranks are slightly different; none of the ranks (predicted and actual ranks) were the same.
- Comment on the connection between the following facts
*“part (d) shows totally different population with totally different variables have the same sampling distribution, (the normal distribution)”
*”Hypothesis testing uses a sampling distribution, p-value is a shaded area on the sampling distribution
Solution
Yes results showed totally different with the actual values since there is use of samples which are predicted to come from the sample but have almost similar characteristics.