Week 2, Day 2: Statistics
Chapter 13: Math Foundations – Statistics Review
Strictly speaking, probability is a subset of statistics
which measures the likelihood of events based on ratios. Statistics more
broadly construed involves taking a sample from a population and deriving
claims about the population based on the sample. More narrowly, we measure the
central tendency (averages) and dispersion (spread) of the data in that sample.
Central Tendency (Averages)
Mean – what we usually call an average, aka arithmetic mean
Weighted mean – when we calculate the average by percentages
of the whole, e.g.
Median – the middle number
Mode – the most frequent value(s)
Each has a different use, and errors will come from misuse
or misunderstanding.
Dispersion (Spread)
Range – the distance (difference) between the least and the
greatest terms
Variance – the average of the squares of differences between
each data point and the mean
Standard deviation – the square root of the variance
Each measures how far apart the data spreads out or
disperses.
We can also measure Position (by quartile, percentile, etc.)
which can be important for some of the most challenging questions. Take a given
data set and arrange it in consecutive numerical order. The median M marks the
halfway point (aka Q2 or P50, depending on whether this
is quartiles or percentiles). Halfway between M and the Least value L is the
end of Quartile 1, or Q1 (or P25). Halfway between M and
the Greatest value G (Q4 or P99), is the end of the third
quartile, Quartile 3 or Q3 (or P75).
We take any set of values and divide it into four groups for
quartiles or one hundred groups for percentiles. If the set is not easily
divisible by 4 (or 100), then we rely on the median values in each set.
The difference between Q3 and Q1 (or Q3
– Q1) is interquartile range. Outliers fall outside this range
and therefore do not affect it. Boxplots show the quartiles visually and can
also be called box-and-whiskers-plots.
From Math 1:
Central tendency measures how the data in a given set
clusters or clumps together, or in other words where the middle point of the
data is. However, there are three meanings for “middle” or “average” that we
must distinguish carefully.
Mean
The mean or arithmetic mean is the most commonly used
measure of where the center of the data lies. This is what people usually mean
when they say “find the average” of several values or data points. I remember
this one with the mnemonic, “When she said I was average, she was
just being mean.”
We take the sum (represented by the capital Greek letter
sigma or Σ) and divide it by the number of data points (represented by n) to
get the mean (represented by the lowercase Greek letter mu or μ). Thus, we get
the following formula:
μ = Σ ¸ n
Weighted mean
The weighted mean is a shortcut when dealing with multiple
data points that have similar values. An example is on p. 84. You can put the
information into a table that indicates both frequency (f) and value (x):
|
|
f |
x |
computation |
|
|
3 |
70 |
3 * 70 = 210 |
|
|
7 |
85 |
7 * 85 = 595 |
|
Total: |
10 |
|
210 + 595 = 805; 805¸10 = 80.5 |
Thus the
70 is weighted by the three and the 85 by the seven. This is how teachers
compute grades regularly.
Median
The median is the middle number in a list of numbers
arranged from least to greatest (ascending order) or from greatest to least
(descending order). I remember that the middle number is the median
because they both have an “i” in them.
When there is an odd number of data points, there is one
middle number, so the median is easy to locate. For example: {1, 2, 3, 4, 5}
has five data points, an odd number. The 3 is in the middle, so the 3 is the
median.
When there is an even number of data points, you must take
the two middle numbers and find their mean. For example: {6, 7, 8, 9} has four
data points, an even number. The 7 and 8 are in the middle, so find their mean
(7 + 8) ¸
2 = 7.5, and that is the median for this set.
Most problems you will encounter at higher levels of
difficulty will require you to put the numbers in order before you can find the
median – and some of these numbers will repeat, so you have to include all
repetitions to do the calculation. For example: {1, 2, 1, 0, 5, 5, 3, 4, 4, 4}
is not in order, and is a set with ten data points, an even number. {0, 1, 1,
2, 3, 4, 4, 4, 5, 5} is this set arranged in ascending order. The two middle
numbers are now 3 and 4, which give us the median 3.5.
Mode
The mode is the number that occurs with the highest
frequency in a set where values occur more than once. In other words, I
remember it because the mode is the most frequent
number – both words have an “o” in them.
In the set from our previous example: {0, 1, 1, 2, 3, 4, 4,
4, 5, 5}, the one occurs twice, the four occurs three times, and the five
occurs twice. The four occurs the most frequently, so the mode is 4.
It is possible that a data set will either have no modes or
more than one mode, so be careful when counting frequency.
Dispersion is how spread out the data is. There are several
such measures that tell us the distances from each data point to the others.
Range
Many books treat range as a measure of central tendency
because it measures the distance from the greatest value to the least value –
but it is precisely because of this measure that I treat the range as a measure
of spread or dispersion. To calculate the range, find the smallest value (x1,
pronounced “x sub 1,” the first data point when arranged in ascending order)
and subtract it from the largest value (xn, pronounced “x sub n,”
the last data point when arranged in ascending order). Thus the range = xn
– x1 or (if G = greatest value and L = least value) G
– L.
Variance
Variance measures how far each data point varies from the
mean, and is required to determine the standard deviation. More detail comes in
the next section. I have never seen variance per se on the GRE, but it
is essential.
Standard Deviation
Standard deviation, simply stated, measures the average of
the distance from each data point from the mean – though it is a little more
complicated than that. In fact, it is the square root of the variance. The
formula makes use of the Greek letters and math symbols we have been using so
far, and looks very complex:
The lowercase Greek letter sigma (σ) stands for standard
deviation. The symbol for each individual data point is xi, pronounced
“x sub i.” I usually write a lowercase n instead of a capital N for the number
of the total population included. Before we take the square root of the result,
we have the variance, but after finding the square root of the result we have
the standard deviation.
1. Find the mean of the data set.
μ = (x1 + x2 + … + xn) ¸ n
(x1 – μ), (x2 – μ), … (xn – μ)
3. Square the differences.
(x1 – μ)2, (x2 – μ)2, … (xn – μ)2
4. Take the sum of the squares.
Σ(xi – μ)2 = (x1 – μ)2 + (x2 – μ)2 +… + (xn – μ)2
5. Divide the sum by the number of data points. ß The quotient is the variance.
Σ(xi – μ)2 ¸ n = quotient = variance
6. Take the square root of the quotient. ß The square root is the standard deviation.
√[Σ(xi – μ)2 ¸ n] = square root = standard deviation = σ
A large standard deviation is when most of the data spreads
out far away from the mean. A small standard deviation is when most of the data
is very close to the mean.
Below you will see the standard bell curve, called that
because it is shaped sort of like a bell. Most averages when graphed will take
on this shape, such as average height or weight. Most people will cluster
within one standard deviation of the mean, in the center of the curve, while
very few people will be beyond two standard deviations of the mean – at the
very, very small or the very, very large on either the left or right extreme
ends of the curve.
It is important to memorize the percentages inside the curve
because the GRE may ask you about the probability that a measure will occur
within the first two standard deviations above the mean, for example.
Usually this means that about 68.2% of the population will
exist within 1 standard deviation of the mean, while the remaining 31.8% of the
population is 1 standard deviation or more away from the mean value. Exactly
half (50%) of the population will fall before the mean, and exactly half will
fall after. Memorizing these percentages makes probabilities easier to discern
when dealing with standard deviation problems.
When the data points increase or decrease all by the same
amount, the measures of central tendency (mean, median, and mode) all change,
but range, standard deviation, and variance do not. Adding a brand new data
point may change the standard deviation and variance, but ONLY when it is
outside of the existing standard deviation. Adding a new data point only
changes the range when it is greater than the largest data point or less than
the smallest data point.
Think of it this way, using the example data set from page
89: the mean for {3, 7, 7, 8, 10} is 7. The standard deviation works out to be
about 2.28. If a new data point is discovered, greater or lesser than 7 ± 2.28
– in other words, smaller than 7 – 2.28 OR larger than 7 + 2.28 – this will
change the standard deviation. However, if all the data points increase by 2
(so the new set becomes {5, 9, 9, 10, 12}), then the standard deviation does
not change at all.



Comments
Post a Comment