- Read Lecture 2. React to the material in the lecture. Is there anything you found to be unclear? How could you use these ideas within your degree area? My degree area is Business Administration.
- Read Lecture 3. React to the material in this lecture. Is there anything you found to be unclear about setting up and using Excel for these statistical techniques? Looking at the data, develop some descriptive statistics on a variable—other than compa-ratio or salary—you feel might be important in answering our equal pay for equal work question. Interpret your results.
BUS308– Week 1 Lecture 2
Describing Data
Expected Outcomes
After reading this lecture, the student should be familiar with:
1. Basic descriptive statistics for data location
2. Basic descriptive statistics for data consistency
3. Basic descriptive statistics for data position
4. Basic approaches for describing likelihood
5. Difference between descriptive and inferential statistics
What this lecture covers
This lecture focuses on describing data and how these descriptions can be used in an
analysis. It also introduces and defines some specific descriptive statistical tools and results.
Even if we never become a data detective or do statistical tests, we will be exposed and
bombarded with statistics and statistical outcomes. We need to understand what they are telling
us and how they help uncover what the data means on the “crime,” AKA research question/issue.
How we obtain these results will be covered in lecture 1-3.
Detecting
In our favorite detective shows, starting out always seems difficult. They have a crime,
but no real clues or suspects, no idea of what happened, no “theory of the crime,” etc. Much as
we are at this point with our question on equal pay for equal work.
The process followed is remarkably similar across the different shows. First, a case or
situation presents itself. The heroes start by understanding the background of the situation and
those involved. They move on to collecting clues and following hints, some of which do not pan
out to be helpful. They then start to build relationships between and among clues and facts,
tossing out ideas that seemed good but lead to dead-ends or non-helpful insights (false leads,
etc.). Finally, a conclusion is reached and the initial question of “who done it” is solved.
Data analysis, and specifically statistical analysis, is done quite the same way as we will
see.
Descriptive Statistics
Week 1 Clues
We are interested in whether or not males and females are paid the same for doing equal
work. So, how do we go about answering this question? The “victim” in this question could be
considered the difference in pay between males and females, specifically when they are doing
equal work. An initial examination (Doc, was it murder or an accident?) involves obtaining
basic information to see if we even have cause to worry.
The first action in any analysis involves collecting the data. This generally involves
conducting a random sample from the population of employees so that we have a manageable
data set to operate from. In this case, our sample, presented in Lecture 1, gave us 25 males and
25 females spread throughout the company. A quick look at the sample by HR provided us with
assurance that the group looked representative of the company workforce we are concerned with
as a whole. Now we can confidently collect clues to see if we should be concerned or not.
As with any detective, the first issue is to understand the “who” and “what” about the
victim. In this case, we need to use our sample to understand basic information about how males
and females are paid. Understanding data sets typically involves look at several characteristics.
These descriptive measures describe the data set. Typical descriptive measures include:
• Measures of location such as the average (AKA mean), the median (middle
point), and mode (most often occurring value if it exists).
• Measures of consistency such as range (largest value minus the smallest value),
variance, and standard deviation.
• Measure of position showing where a single data point is within the data set, such
as percentile and rank.
• Measures of likelihood showing the probability of obtaining specific outcomes.
Note: Descriptive statistics describe a particular data set and can only be used for that
data set. However, often we want to use a sample to “infer” back to a larger population. In this
case, we would use inferential statistics. Most measures, except for variance and standard
deviation, are calculated the same way. We will see the specific difference for those two later in
this lecture.
The key to whether we have descriptive statistics or inferential statistics lies with the
group we are taking the measures on. If we are only concerned with that group, we use
descriptive statistics. If, however, we want to use that group to make inferences, claims, and
conclusions about a larger population, then we take a random sample from the population and
use inferential statistics (allowing us to infer back to the population). Our class data sets – both
the lecture and homework – are random samples from a larger population, so we will basically
be using inferential statistical measures.
Note that these are not the complete list of possible descriptive statistics. Excel’s
Descriptive Statistics function (described in Lecture 3 for this week) includes a couple of
measures that focus on data distribution shape. These have some specialized uses that we will
not be getting into.
Location Measures
Perhaps the most often asked question about data sets is what is the average? The intent
is to get a measure that shows us the center of the data. Unfortunately, average is a somewhat
imprecise term that could mean all three of our measures of location identified above. So, as
analysts we tend to be more precise and use mean, median, and mode.
While these all tell us something about where the data might be clustered, they can
provide very different views of the data. An example of this comes from an example the author
heard back in High School. At that time, the mean per capita income for citizens of Kuwait was
about $25,000; the median income was around $125; and the mode was $25! The very high (due
to oil revenues) income of the Royal family accounted for much of this difference, but just look
at the different impressions we get about the country depending on which value we look at.
• Mean, AKA average, is the sum of all the values divided by the count. This can be
considered the “weighted center” of the data set. For example, the mean of 1, 2, 3, 4, and
5 = 1+2+3+4+5/5 = 15/5 = 3. The mean is generally the best measure for any data set as
it uses all of the data values and requires interval or ratio level data. Thus, while we can
average salary, compa-ratio, seniority, etc., we cannot average gender or gender1 (even if
one is coded in numbers) or grade in our data set.
• The median is the middle value in an ordered (listed from low to high) data set. This is
the “physical center” of a data set. For example, the median of 1, 2, 3, 4, and 5 = 3, the
middle value. If we have an even number of values, the median is the average of the
middle two values. Medians can be found on ordinal, interval, or ratio level data.
• The mode is the most frequently occurring value. This is more or less the “popular
center” as it is where most numbers group together. A data set may have no modes or
one or more. Modes may occur with any level of data. The data set 1,1,2,2,2,2,3,8,8,9
has a primary mode of 2, and two secondary modes of 1 and 8.
Consistency/Variation Measures
While they do not have the popularity of their location cousins, knowing the consistency
or variation within the data is as important, some say even more important, as knowing the
central tendency for us to understand what the data is trying to tell us. Very consistent data, with
little variation, has a mean that is very representative of the data and is unlikely to change much
if we resample the population. Data with a large amount of variation tends to have unstable
means, meaning that these values would change a lot with multiple samples. Inconsistent data
(having large variation) is often a problem for businesses, particularly for manufacturing
operations, as it means the results they produce differ and might often not meet the quality
specifications. Predictions based on data with large variations are rarely useful. Consider
attempting to estimate how long it would take you to get to work if your route had frequent
traffic accidents that made the travel time different every day.
The key measures of variation are:
• Range, which equals the maximum value minus the minimum value. For our
example data set of 1, 2, 3, 4, and 5, the range is 5 – 1 = 4.
• Variance, which is the average of the square of sum of the differences between each
value in the data set from the mean. To get the variance, find the mean of the data,
subtract this value from each of the data points, square this result (to get rid of the
negative differences), add them up and divide by the total count. For our example
data set, this would look like:
Value Mean Difference Squared
1 3 -2 4
2 3 -1 1
3 3 0 0
4 3 1 1
5 3 2 4
Sum = 10
Variance = 10/5 = 2
The problem with variance is that it expressed as units squared. So, if our data set
were dollars, the variance would be 2 dollars squared. How should we interpret
dollars squared? In general, we do not and use the next measure instead.
• Standard Deviation is the (positive) square root of the variance. It returns the
dispersion measure back to one that is in the same units as the original data, so we can
compare it to the data values. For our example, the standard deviation is the square
root of 2 dollars squared, or 1.4 dollars. This much easier to understand measure
means that the average difference from the mean is 1.4 dollars (in our example above
having a mean or average value of 3 dollars).
• Important point about the variance and standard deviation. When we find these
values for a population, the entire group we are interested in, we divide the numerator
by the sample size. However, when we have a sample of the entire group (and want to
use this sample to estimate the population value for either variance or standard
deviation), we create the inferential estimate by dividing the numerator by the (count
– 1). This is an adjustment that increases the estimate to take into account we most
likely do not have the extreme low and extreme high value from the population in our
sample, so its variation is less than the group we are using the sample to describe.
Just as detectives want to know what victims typically did and how consistent they were
in their behavior around the time of the crime (For example: Was he usually in this area, and if
not, why last night?), examining location and consistency measures provide a similar perspective
on data variables and how they behave.
Applying the Information: Equal Pay Questions
OK, we can now start looking at our data set to see what the numbers are hiding, and
develop some clues. As with all analysis, we start with questions, then identify the tools to use
for those questions, and finally apply those tools to the data. Our initial question is, do males
and females get equal pay for equal work? We also said we needed to start with the question of
whether or not we had some measures that showed pay comparisons between males and females.
Let’s take a look at some of the group and sub-group data. A couple of measures that might
answer this question are:
• What are the group averages for each variable?
• What are the average male and female compa-ratios? (Remember, you will work with
the Salary variable in the homework.)
• How consistent are the compa-ratios for each?
Note that we will be focusing on the compa-ratio data in the lectures, while you will
focus on the same questions using salary in the weekly homework assignments. As described,
compa-ratio is the result of dividing an employee’s salary by their grade midpoint. It generally
ranges from about 0.80 to 1.20 in most pay plans. The value of this measure is it removes the
impact of different grades (each of which we are assuming are different levels of work from
other grades and contain equal work for all the jobs within the grade). While not a perfect
measure, it is the start of measuring what is paid for equal work. Side note: a grade’s midpoint is
generally pegged to the average market pay needed to hire new employees into a job.
Week 1 Question 1
Question 1 asks for some summary statistics. Part A asks for you to use the Excel Descriptive
Statistic function (more on this in the third lecture), while part B asks for some specific statistics
using the Fx function list (again, how to do this is covered in lecture 3). The purpose for these
specific requests is to let you show mastery in using these two Excel tools.
For part a, the mean, standard deviation, and range of the entire compa-ratio data set is
highlighted. This shows us that that mean is 1.062, the standard deviation is 0.077, and the range
is 0.34. As interesting as these values are, they do not really tell us anything. Measures
generally need to be compared to provide information.
This is where part b comes in. We see that the male and female averages (1.056 and
1.069 (rounded) respectively) appear relatively close and are on opposite sides of the overall
mean. The standard deviations are also close at 0.084 and 0.07 and surround the standard
deviation from the entire data set. The ranges are both smaller than the overall range – meaning
that neither gender has both the smallest and largest value. The female compa-ratios appear to
be slightly more clustered (less variation, more consistent) than the male values from both the
range and standard deviation results.
Two things stand out. First, perhaps surprisingly, the females appear to be paid more
relative to their grade midpoints than the males. Second, measures of dispersion appear fairly
close with males being slightly more spread out than females. So far, nothing seems to create
any concerns as we expect sample results to be a bit different than the overall population values.
These differences seem to be small enough that they might be simple sampling errors – if we
resampled (such as the data set you will be working with) the male and female results might
switch.
Remember, when you do this problem in the homework, use the salary data. As practice
you can copy the data set into a practice excel file and try to replicate the same answers as show
up in the lectures. Ask a question if you are unsure of how to do this or do not get the same
results using the lecture dataset.
Position Measures
Often, we are interested in where within a data set a particular measure falls. This opens
up the idea of distributions, how the data values are spread across the range of values. Our
detectives would be looking at where victims typically went and where they spent their time –
the pattern of their normal behavior.
Distributions. Location and consistency measures are important for summarizing the
data set. Important as they are, they do not always give us all the information we need. At times
we want to know how specific values fit within the data set. For example, we might want to
compare the 10th highest male and female value to get a sense of how relative positions within
the data range differ. This often means we need to examine the distribution, or shape, of the
data. This shows us how all the data values relate to all of the other values with the sample.
One important tool in analyzing data sets that we will not cover (we cannot cover
everything, alas) is graphical analysis – looking at how data sets are distributed when graphed.
One example will show how powerful these techniques can be. One very common graph is a
histogram – a count of how many times a certain value occurs. For example, if you tossed a pair
of dice 50 times, you might get the following results. The table shows the results we got. The
Histogram shows the distribution or shape of the data, with the x-axis, horizontal, showing the
sum of the numbers on the two faces and the y-axis, vertical, showing how often we observed
Outcomes from tossing a pair of dice
Count showing
2 3 4 5 6 7 8 9 10 11 12
Frequency seen 1 2 4 3 9 12 7 5 4 1 2
0
2
4
6
8
10
12
14
2 3 4 5 6 7 8 9 10 11 12
a particular result in our 50 tosses.
A couple of things we can do with distributions can be easily shown with this histogram.
First, we can find the center, in this case 7. We can see that there are two tails around the center,
one to the left showing counts for values less than the middle value of 7, and one to the right
showing how often we got values greater than 7. Visually, we can see that the further away from
the center we get, the less often – or less likely – we are to get any particular outcome. Ways to
quantify these observations are discussed below.
Our detectives use this logic when they attempt to find out where all the persons of
interest were at the critical times. These approaches provide more detailed information about
how the data looks more specifically than the summaries of dispersion examined earlier.
Position Measures. Central tendency and variation are group descriptive measures –
particularly the mean and standard deviation, which use all the values in the data set in their
calculation. At times; however, we are concerned with specific values with in the distribution,
such as:
• Quartiles,
• Percentiles, or centiles,
• The 5-number summary, or
• Z-score.
Quartiles and Percentiles. These measures divide the data into groups, four with the
quartile and 100 with the percentile. One example that many of you might be familiar with is
percentile (AKA percentile rank). This is often use when doctors describe a child as in the 80th
percentile in height or weight for his/her age. This means that 80% of other children at this age
are at or below this particular child’s measure. Percentiles range from 1 to 100%-tile, meaning
the lowest score would be at the first (or 1%-tile) and the highest score would be at the 100%-
tile. Percentiles are very useful for comparing groups.
The general percentile formula lets us find percentiles, deciles (the 10% divisions), and/or
quartiles, although Excel will do this for us. The formula is:
Lp = (n+1) * P/100; where
Lp is the count of the desired percentile (25 would be the location of the first quartile, for
example)
n is the size/count of the data set
P is the desired percentile; using 25, 50, or 75 gives the quartile points, while using 10,
20, etc. would give the decile points.
Example: if we wanted to find the cut-off for the first (or lowest) quartile of the data, also
known as the 25th percentile in a data set of 50, we would use (50+1)*25/100 = 12.75, or
the 13th value from the bottom in an ordered list. By convention, we always round up to
the next whole value.
5-Number Summary. As its name suggests, the 5-number summary identifies five key
values in a data set: minimum value, 1st quartile, median or 2nd quartile, 3rd quartile, and
maximum values. For the compa-ratio data set used in the lectures, the 5-number summary can
be found from the following table results. The 1st quartile, for either gender group of 25 is
(25+1) * 25/100 = 6.5, or the 7th values in a rank ordered list. The 3rd quartile is located at 19.5.
For the entire sample of 50, these values are located at the 13th and 39th rank ordered places,
respectively. Here is a 5-number summary for the overall compa-ratio values in the sample:
Compa-ratio 5-number summary: 0.870, 1.013, 1.051, 1.134, 1.210.
More on this shortly.
Z-score. What is often of more value is looking at where specific measures lie within
each range. The z-score measures show how far from the mean a specific data point lies,
measured in standard deviation units. (I know that sounds strange but keep reading.) The Z-
score provides a measure of how many standard deviations a particular score lies from the mean,
and in what direction (above or below). The Z-score formula is:
Z = (individual score – mean) / (standard deviation).
Looking at this formula we can see that a score above the mean would give us a positive
z-score, a score below the mean would give us a negative z-score, and a score that exactly equals
the mean would gives us a z-score of 0. For most data sets, the z-score ranges from a -3.0 to a
+3.0.
For example, in our example data set (1, 2, 3, 4, and 5) (see above for descriptive
statistics on this data set), the z score for 2 would be (2-3)/1.4 = -1/1.4 = -0.71. The negative
value means that 2 is below (or less than) the mean and is 0.71 standard deviation units away
from the mean (0.71 times the standard deviation of 1.4 = 1).
Using this measure, we can easily examine relative placement of scores. For example, a
compa-ratio of 1.06 would have Z-scores of 0.048 for males, -0.129 for females, and -0.03 for
the overall group. (We will see how we got these values shortly.) Thus, we can see that a person
with this compa-ratio is slightly above average for males, but below average for the overall
group and for females.
Applying the information
Week 1 Question 2
Question 2 asks for a 5-number summary for the overall compa-ratio data set as well as for the
male and female sub-groups within the data.
Note: Lecture 1-3 will show the same screen shot with the cell formulas displayed.
One of the first observations that confirms an earlier observation is that neither the male
nor female data set has both the largest and smallest values. The males appear to have a slightly
lower overall range of values than do the females. Some other interesting observations include
the relatively similar 3rd quartile values for all three groups and the lower midpoint for females,
meaning that more females are lower in the overall range than males. More males are in the first
quartile than females. What other observations can you make about how employees are
distributed within their respective compa-ratio ranges?
Week 1 Question 3
Often looking at how a single point lies within a data range is helpful to get some insight
into how the distributions are positioned. Question 3 asks for us to examine where the midpoint
of each gender’s dataset fits within the entire compa-ratio data set. The Percentank.exc function
returns a percentile rank, the percent of data values that fall at or below a given value. For
example, the percentrank.exe of the median would be 50%-tile as half the values are above and
half below the median (as expected).
When we look at the male median, we see it falls at the 51st %-tile, meaning it is slightly
above the overall median. The female median (half of the female compa-ratios are below this
value remember) falls at the 33rd %-tile! This means that most of the females are in the bottom
half of the distribution, even though (from Question 2), females have the “higher” range.
Interesting.
The z score is a measure of relative placement based on the mean rather than the median.
A value that equals the mean would have a z score of 0, a value that is greater than the mean
would have a positive z score, while a value less than the mean would have a negative z score.
Both the male and female medians fall below the overall compa-ratio mean, with the female
median being relatively lower in the distribution. This is consistent with what the percentile
scores suggested. Overall, these two questions are suggesting that males and females are not
distributed the same within in the compa-ratio data set.
Likelihood Measures
Likelihood, or probability, focuses on how often we can expect to see an outcome. In
statistics, many decisions are made based upon how likely, or more accurately, how unlikely it is
to see an outcome.
Probability
Probability is the likelihood that an event will happen. For example, if we toss a fair
coin, we have a 50/50 chance, or a probability of .5 of getting a head. If we pick a date between
1 and 7, we have a 1 out of 7 chances (or a probability of 1/7 = .14 or 14%) that it will be a
Wednesday in the current month. Statisticians recognize three types of probabilities:
• Theoretical – based on a theory, for example – since a die (half of a pair of dice) has 6
sides, and our theory says each face is equally likely to show up when we toss it; we
therefore expect that will see a 1 1/6th of the number of times we toss it (assuming we
toss it a lot).
• Empirical – count based; if we see that an accident happens on our way to work 5
times(days) within every 4 weeks, we can say the probability of an accident today is 5/20
or 25% since there are 20 work days within a 4-week period. An empirical probability
equals the number of successes we see divided by the number of times we could have
seen the outcome.
• Subjective – a guess based on some experience or feeling.
There are some basic probability rules that will be helpful during the course. The
probability
• of something (an event) happening is called P(event),
• of two things happening together – called joint probability: P(A and B),
• of either one or the other but not both events occurring – P(A or B),
• of something occurring given that something else has occurred, conditional probability:
P(A|B) (read as probability of A given B).
• Compliment rule: P(not A) = 1- p(A).
Two other issues are needed, the idea of mutually exclusive means that the elements of
one data set do not belong to another – for example, males and pregnant are mutually exclusive
data sets. The other term we frequently hear with probability is collectively exhaustive – this
simply means that all members of the data set are listed.
Some rules, which apply for both theoretical and empirical based probabilities, for
dealing with these different probability situations include:
• P(event) = (number of success)/(number of attempts or possible outcomes)
• P(A and B) = P(A)*P(B) for independent events or P(A)*P(B|A) for dependent events
(This last is called conditional probability the probability of B occurring given that A has
occurred).
• P(A or B) = P(A) + P(B) – P(A and B); if A and B cannot occur together (such as the
example of male and pregnant) then P(A and B) = 0
• P(A|B) = P(A and B)/P(B).
One of the more interesting uses of probabilities (other than forecasting the likelihood of
rain on our days off) is the comparing of outcome likelihoods for different groups.
• The probability of randomly picking a female [P(F)] is the same as randomly picking
a male [P(M)] from the group = 25 specified outcomes/50 possible outcomes. This is
a simple empirical probability – counts divided by counts.
• We can get a bit more complicated, such as the probability of picking a female from a
specific grade such as B – P(F|B), probability of picking a female given (from) only
grade B. Again, this is empirical – we have 7 employees in grade B, and 4 of these
are females, so P(F|B) = 4/7.
• Now the probability of picking a Female who is also in grade B (from the entire data
set is 4 females out of 50 = 4/50 = 0.08, empirically. We can find this using the P(A
and B) formula referenced above. P(F and B) = P(F)*P(B|F), since the events of
female and grade E are not independent. So, we know P(F) = .5, and P(B/F) = 4/25
(4 females out of 25 are in grade B), so by theory, P(Female and grade B) = .5 * .16 =
0.08, the same results.
• The compliment rule is often helpful, if we want to find the probability of picking any
female EXCEPT those in grade B, we could figure out the probability for each of the
grades and add them together, OR we could simply say that the probability of Female
and not grade B is simply 1 – P(Female and grade B), or 1 -0.08, or 0.92. We will
use this property of probabilities a lot in the rest of the class.
As we can see, probabilities can show us a lot and can be somewhat complex in determining
their values. The nice thing is that this is about as complicated as it gets.
Applying the information
Week 1 Question 4
Question 4 gives us some probability values- how likely are we to exceed the respective
gender midpoints in the entire data set. We are looking at the empirical and normal curve
probabilities. If the data set is normally distributed, the probabilities should be fairly close; if
not, we have a clue that the data might not be normally distributed over the entire data range.
The male empirical probability of exceeding the midpoint in the entire data set is 50%
empirically (close to the 51st percentile value we got above) and 55% assuming normality –
fairly close. The female probabilities are 68 and 60% respectively; again not too far off.
The data again support the idea that a lot of females are at the higher end of the compa-
ratio distribution.
Drawing Conclusions: Week 1 Question 5
As interesting as the numbers are themselves, they mean very little unless we can
interpret their meaning and apply that insight to the question(s) at hand.
Recapping our results, we see that while female overall average compa-ratio is somewhat
higher than the males, the probability and distribution outcomes suggest that males and females
are not distributed in the same fashion and that more of the females are relatively lower in their
range than the males.
While we have not yet accounted for equal work, it appears that there are some issues
suggesting that males and females are not paid the same within the company. At least enough
for more investigation.
On our detective shows, we might say that we have some evidence, but not enough to
take it to the grand jury for an indictment yet.
Summary
This lecture looked at descriptive statistics and what they can tell us about the data set.
We reviewed the questions that are asked in the Week 1 assignment and the answers for each
question using the COMPA-RATIO variable. The focus of this lecture was on interpreting
presented results, as that is a more frequent activity for professionals than actually developing
the measures.
Specifically, we looked at the developing the following information.
Note that this was created by listing the tool as we introduced it, the data requirements,
and then a typical question that would require this tool. By copying this information to a second
Excel sheet and sorting the columns we can create a guide as to when to use each tool, a shown
below.
Now, we move on to some specific ways to set-up Excel to provide the results that we
just looked at.
Before we do, however, please respond to Discussion Thread 2 for this week with your
initial response and responses to others over a couple of days before moving on to reading the
second lecture for the week.
Please ask your instructor if you have any questions about this material.
BUS308– Week 1 Lecture 3
Setting Up Excel: Descriptive Data
Expected Outcomes
After reading this lecture, the student should be familiar with:
1. How to copy and paste data into columns on the assignment pages.
2. Excel’s Analysis Toolpak.
3. How to use the Descriptive Statistics function within the Analysis Toolpak.
4. Excel’s FX (or Formulas) functions.
5. How to use Excel formulas in cells to obtain numerical results.
What this lecture covers
This lecture focuses on using Excel’s functions and tools to develop numerical answers to
specific data related questions. Our data detectives started this week with the whiff of a possible
“crime” or issue for our company. We looked at some of the logic and reasons for using
statistics to uncover hidden meaning in data as well as clues in the first lecture. We then moved
on to look at how to examine and interpret meaning behind summary statistics (for location,
consistency, probability, and location) and found that most value comes from comparing
measures rather than simply examining them without any reference points. We are now moving
to close the loop and provide you with some tools to summarize and condense data sets into
meaningful summaries using Excel.
This lecture covers the detective tools available through Excel. If you are not familiar
with Excel, you will be amazed at what it can do for us. If you are familiar with Excel, you
might not know the statistical analytical power it has (at least that is what past classes have said).
Side note – much of what we will cover with Excel provides an excellent starting point for
classes in accounting, finance, and operations management (to name a few).
How do we start?
If you have not already done so, please download the Excel Student assignment file to
your computer. This file is in the Required Resources link. If you have not loaded the analysis
toolpak, please do that now as well. (See the BUS308 material sent out before the course started
for instructions on how to do this. This material was sent to your school email address.)
Students in previous sections recommend this video on loading the Toolpak (note, this
video continues on to show the use of descriptive statistics as well – not yet needed):
Video Link: http://www.youtube.com/watch?v=4_9vGqQaCFk
The lectures will use the data set provided in Lecture 1. As mentioned, this is similar to,
but not identical to the data set you will use in the weekly homework assignments. A note about
your homework. All the information we need to answer the weekly assignments – and,
eventually find the answer to our mystery – is located in the Student Assignment file on the Data
tab. You should download this file and save it on your computer for best results. The other best
practice is to NEVER manipulate the original data set found in the data tab. Doing so runs the
risk of “messing” up the data relationships – the links between each of the variables to specific
employees. For example, if you sort the gender variable by gender, we lose the link between
which employee are male and female – an important piece of information. What we should do is
to copy and paste the appropriate variable columns to the worksheets for each weekly
assignment; even a couple of times if we need to.
Setting up the data
The first step in describing data, for both descriptive and inferential statistics, involves
setting up the data so that Excel can access only the data variables you want to consider for each
question. This is quite important for, as an old computing expression goes, GIGO (garbage in,
garbage out). One key rule: Never mess with the original data set. Again, always copy and paste
it to the work sheet rather than doing any work with the original data set. This way, if something
goes wrong and the data gets lost or mixed up, we can always go back to the original.
In Excel, we can copy and paste data using either the CTRL-C and CTRL-V keys. First
use the mouse to highlight the range you want to copy, then use CTRL-C (the CTRL and C key
depressed at the same time). Then go to the worksheet and cell you want to place the data
and
use CTROL-VB to paste. For those not familiar with these approaches, here is a short (8:10
minutes) video on cutting and pasting data. (Note: for purposes of this video example, only a
portion of each data range is used – this allows the sorting example to be shown more clearly. In
the examples below and in your homework, use all 50 values.)
Video Link: https://screencast-o-matic.com/watch/cbQX3iII2K
(You might have to click on the arrow on the video to get it started and/or enlarge the viewing
window)
Examples
Note: The following discussion will mirror the questions and activities that you will be asked to
do in the Week 1 assignment. However, while you will be asked to analyze salary in the
assignments, this example (and the others in the succeeding weeks) will focus on the second
measure of pay, the compa-ratio. Remember, the data used in the lectures is from a
different sample and therefore is not exactly the same as the data you will use in the
assignments. However, the conclusions reached in each lecture example should be
considered when you make your conclusions about what the data tells us each week.
Question 1: Summary Statistics
A suggestion, before starting with any question read all the parts and identify the
variables needed. The first question asks for some descriptive statistics on the SALARY variable
in the overall group, and then, in Part b, for male and female subgroup specific descriptive
statistics.
Remember, your assignments work with Salary and these lecture examples work with
Compa-ratios. So, whenever the lecture examples mentions “compa-ratio” you will be doing the
https://screencast-o-matic.com/watch/cbQX3iII2K
same thing only using “salary” in your assignments. (I know this is beginning to sound
repetitive, but it is an issue with some students who use the compa-ratio in the homework and
lose points for doing this.)
Lecture Example. Question 1 asks for some descriptive statistics, so we first need to
copy and paste the data into our worksheet. Since we need the compa-ratio and gender variables
(working with Gender1 is easier) for this question, we need to copy and paste these two columns
(compa-ratio and Gender1) from the data tab to the right of our answer space, such as columns Q
and R – don’t forget to include the labels in Row 1 when you copy. Note: It is generally
preferable to set up data for each question separately. This prevents our answers from changing
if we re-sort data for a different question.
Next, sort the two columns using the Gender1 column and the Custom sort option in the
Sort option in the editing box. Be sure NOT to include the labels in your highlighted sort range.
The completed sort will look something like the following example on the right (this example
shows compa-ratio and gender rather than your assignment variables of salary and gender).
Now that our data is ready, we can move on to finding our desired values.
Part 1 asks for the descriptive statistics for the entire compa-ratio group and asks that you
use the Analysis Toolpak function Descriptive Statistics to do find these values.
Video Link: Here is a link to a video on using the Analysis Tool Pak: https://screencast-o-
matic.com/watch/cbQX35IID7
Click on the Data tab in the main ribbon, then in the Analysis tab (If this is not visible,
the Analysis Toolpak has not been loaded.) on the right click on Data Analysis, scroll down to
Descriptive Statistics.
https://screencast-o-matic.com/watch/cbQX35IID7
https://screencast-o-matic.com/watch/cbQX35IID7
Once Descriptive Statistics is highlighted, left click on OK, and the data entry box will
open. Enter the data range for the variable of interest. See the screenshot example below.
Below is a completed example of descriptive statistics for the single variable compa-ratio for our
Excel file. (In your homework you should have Salary in your column Q, for this exact set-up to
give you the asked for result in Part a.) In the input range, place the range that contains the
variable you want to describe. Either highlight the range or enter the numbers manually. Make
sure you have clicked on the grouped by “column” button.
It is generally a good idea to include the label in input ranges, if you do – and ONLY if you
include the label – be sure and click the Labels in first row box. (Note, if this box is checked and
the label is not included, the first data value will show up as a label. This is a strong hint that the
input range was not correct.)
Then click on the button in front of Output Range and enter the cell where you want the table to
start. For most questions, this cell will be given to you. For this question, enter K19 in the box.
Then look at your output options. This question only asks for the summary statistics button to be
checked for the output, but you are welcome to select the others as well.
Clicking on OK, will produce the outcome. Part a asks for us to highlight three (3) of the
statistics, and the following shows the outcome. (Note that we changed the alignment of cell
K25 to “align right” to show the full cell name. You do not have to do this.)
Part b asks for the same information but asks us to use the Fx functions to find each value
separately. One important part of this question is how we show the results. Whereas the data
from the descriptive statistics output shows actual numerical values, the outputs using the Fx
formulas need to show the formula in the cells (example: =average(Q2:Q26)) rather than simply
the numerical outcome. This is because part of the assignment requires showing mastery using
Excel functions rather than copying the values. The values for part b will not be the same as in
part a since we are asking for the statistics by gender group rather than the entire sample as in
part a. In the following Excel formulas, the range would be the values in column Q that relate to
the gender being asked for, example for females the range would be Q2:Q26 (a range of the full
25 values).
The mean or average is found in Excel using =average(range)
The Sample standard deviation (a measure of variation or spread within the data) is found
using: =stdev.s(range)
The range is found by: =max(range)-min(range)
The range of cells will be the same in each of the three (3) questions for the Female and
Male columns (each column having a different range of cells that relate to their salary values.)
Video Link: Here is a link to a video on using the functions in Fx: https://screencast-o-
matic.com/watch/cbQlnxIIUf
Position Issues
Question 2. This question moves from descriptive statistics to location measures – telling us
where within a data set we will find a value. These are good for comparison activities. Question
2 asks for a 5-number summary for the entire salary range, as well as for each gender (male and
female).
The values for each element in the 5-number summary are easily developed using the Fx (or
Formulas) function – under the Statistical group.
The Maximum value (Max) is simply; =MAX(Range), for example =MAX(Q2:Q51).
The Third Quartile (3rd Q) is: =PERCENTILE.EXC(Range, 0.75)
The Midpoint or median is: =MEDIAN(Range)
The First Quartile (1st Q) is: =PERCENTILE.EXC(Range, 0.25)
The minimum value (Min) is” =MIN(Range).
Replacing “Range” in each formula with the appropriate salary data range for the columns
labeled “Overall,” “Males,” and “Females” will complete the table.
Here is an example using compa-ratio of setting up the 5-number summary. The female column
shows the Excel formulas and ranges used to find the respective values. The only difference
existing in the other columns is the range – adjusted for overall or male values.
https://screencast-o-matic.com/watch/cbQlnxIIUf
https://screencast-o-matic.com/watch/cbQlnxIIUf
One use of these two questions is to compare values – who has the highest values, the
lowest, where are midpoints relative to each group, etc. One interesting issue is seeing where the
values are located within the range.
Question 3. The third question asks us to examine a specific value, in this case the male
and female midpoints, and see where they are located within the entire salary data range. This
gives us a feel for where the two groups are distributed within the entire data set. Here is a
screen shot showing the cell formulas used for the female values.
Excel’s PERCENTRANK.EXC(range, specific data value) provides the rank of any
specific data point within a specified data range. The rank is shown as a 3-digit decimal that
ranges from 0 at the bottom and 1.000 at the top.
Z-Score. The second row in this question asks for the z-score or Z-value. This is a
relative measure telling us (in standard deviations) how far from the mean a particular value lies.
It involves the (a) value we are interested in, the mean of the distribution, and (c) standard
deviation. These values for the entire data range were developed in Question 1 with the
descriptive statistics function. We can find a z-value in several ways. Letting Excel do the math
for us, we could enter into the cell for Male z-value of 1.149 the following formula: =(1.149 –
1.056)/0.08379; this gives us a value of 1.11 (rounded to two (2) decimal places, traditional for z
scores). We could also have used cell values and entered =(K35 – D29)/D30 to get a value of
1.11 rounded. However, if we look at more decimal places we will find a slight difference. This
difference is due to Excel’s rounding of the compa-ratio values, the actual mean and standard
deviation are slightly larger than the rounded values shown in the table output. The preference is
to use cell references in formulas.
The easier way, and the approach asked for in the homework, to find z-values is to use
the Fx function =standardize(x, mean, standard_deviation);
The STANDARDIZE(value, mean, standard deviation) provides a z score for a specific
value with a given data set mean and standard deviation. Only the range and cell location of the
female compa-ratio information would be changed to get the female results.
Question 4
Part a in the next question is about the empirical probability that a randomly selected
salary would exceed either the male and female midpoint our value found in Part a. Empirical
probably is a simple count outcome, the number of successful outcomes divided by the total
number of possible outcomes.
Our lecture example uses compa-ratios. For each gender, we have 50 possible salary
values that could be selected at random. What we need to know is, for each gender, how many
employees have a compa-ratio as large or larger than then cut-off value we found are using (the
gender midpoint). The labor-intensive approach is to look at the sorted data and count (that is a
pain in the neck). Excel will do this for us.
Empirical Probability. Excel has several counting functions found in the statistical list
of Fx functions. The one we want is =countif (range, “criteria”). We know the range for the
entire salary list, what we need is the criteria. Countif lets us use =, >, or < when we are asking
about a specific value, greater than (>) than a specific value, or less than (<) a specific value); it
does not let us use => (equal to or greater than) or <= (less than or equal to). When we want
these, we must find the next smallest or largest value and use one of the simpler criteria. Finding
the next smaller number could be complicated in long unsorted lists.
If we use a slightly more complicated screening criteria, we can use a cell reference in a
COUNTIF function. Here is a screen shot of how we can get an empirical probability and the
related normal curve probability (a theoretical probability) for the male and female midpoints.
Note how the COUNTIF function is presented. It needs to be laid out in this format if we
use cell references. Note that we are using the entire data range (T2:T51) as we are looking for
each midpoint in relation to all salaries. The second part is a technical requirement (“>=”&) for
COUNTIF to work with a cell reference (the G40 in this example, the Female midpoint
location). This gives us a count of how many values equal or exceed our key value (Female
midpoint is located in G40). We are dividing this count by the total of 50, to find the empirical
probability (count/total) for our value.
The second row finds our normal curve probability. NORM.S.DIST(z-score, 1) provides
the area or probability under the normal curve for a given z-score. Since our z-score represents
the location of the midpoints within the normal curve (found in Q3 and cell J60 for females),
NORM.S.DIST gives us the probability of equaling or being less than this value. So, to get the
probability of equaling or exceeding this score we want the area to the right of our score which is
1- probability of below. Note, since the normal curve is considered to range from minus infinity
to plus infinity, the probability of a single point is so infinitesimally small to be 0 for all practical
purpose. This allows us to say Probability of (“at or above”) = 1-Probability of (at or below”)
and use that point in both formulas.
Question 5
The final question this week, as with every week, asks us to consider the meaning of our
findings. It asks for a review of the findings from the lecture’s review of the compa-ratio data, a
review of your findings on the same questions using the salary data, and then a conclusion that
incorporates both findings as they relate to our question about equal pay for equal work.
When we look at our descriptive statistics, we see that all the measure outcomes; means,
standard deviations, ranges, and probabilities, are fairly close together. This suggests that the
compa-ratio distributions might be similar and we have no equal pay issue overall. So, at first
pass we see that, when grade differences are held constant (or eliminated) as the compa-ratio
does; the females seem to be paid slightly more relative to their midpoint than males are. At the
same time, they are a bit less spread out, or more consistent, within their data group than the
males are. We can also see that the overall average for each measure falls between the group
values. The range, high, and low values also show that the males generally have less consistency
as well as the smaller low and high values compared to females.
Now, while possibly insightful, as with any analyst we need to withhold making a final
determination until we find some additional information. Two other initial questions pop up at
this point. The first, are these differences meaningful or just sample errors, meaning if we took
another sample would we get values that were closer together or even reversed? This issue will
be looked at next week. The second question is are we sure we have a measure that measures
pay by equal work? If not, then average compa-ratio might not be telling us anything. This issue
is examined throughout the course.
So, what can we say at this point? It appears that males and females have about the same
range and standard deviations for compa-ratios, but that females appear to average a bit higher
than the males. However, at this point, we cannot say anything about our equal pay for equal
work question as the compa-ratios may not be the best measure of equal work. So, at this point
we have some interesting information but no conclusive results yet.
Summary
This lecture focused on the Excel tools needed to produce the results we examined in
Lecture 2. The Data Analysis function located on the Data ribbon was presented. Additionally,
several descriptive statistics tools in the Fx link or the Formulas ribbon were illustrated. These
included:
• Descriptive statistics
• Average
• Median
• Mode
• Max and Min
• Standard Deviation
• Quartile
• Percentrank.exc
• Standardize.
These tools can be incorporated into the review sheets easily.
and
Please ask your instructor if you have any questions about this material.
When you have finished with this lecture, please respond to Discussion Thread 3 for this
week with your initial response and responses to others over a couple of days.