Week 2, Day 2: Statistics

Chapter 13: Math Foundations – Statistics Review

Strictly speaking, probability is a subset of statistics which measures the likelihood of events based on ratios. Statistics more broadly construed involves taking a sample from a population and deriving claims about the population based on the sample. More narrowly, we measure the central tendency (averages) and dispersion (spread) of the data in that sample.

 

Central Tendency (Averages)

Mean – what we usually call an average, aka arithmetic mean

Weighted mean – when we calculate the average by percentages of the whole, e.g.

Median – the middle number

Mode – the most frequent value(s)

Each has a different use, and errors will come from misuse or misunderstanding.

 

 

Dispersion (Spread)

Range – the distance (difference) between the least and the greatest terms

Variance – the average of the squares of differences between each data point and the mean

Standard deviation – the square root of the variance

Each measures how far apart the data spreads out or disperses.

 

We can also measure Position (by quartile, percentile, etc.) which can be important for some of the most challenging questions. Take a given data set and arrange it in consecutive numerical order. The median M marks the halfway point (aka Q2 or P50, depending on whether this is quartiles or percentiles). Halfway between M and the Least value L is the end of Quartile 1, or Q1 (or P25). Halfway between M and the Greatest value G (Q4 or P99), is the end of the third quartile, Quartile 3 or Q3 (or P75).

We take any set of values and divide it into four groups for quartiles or one hundred groups for percentiles. If the set is not easily divisible by 4 (or 100), then we rely on the median values in each set.

The difference between Q3 and Q1 (or Q3 – Q1) is interquartile range. Outliers fall outside this range and therefore do not affect it. Boxplots show the quartiles visually and can also be called box-and-whiskers-plots.

 

From Math 1:



Central Tendency

Central tendency measures how the data in a given set clusters or clumps together, or in other words where the middle point of the data is. However, there are three meanings for “middle” or “average” that we must distinguish carefully.

 

Mean

The mean or arithmetic mean is the most commonly used measure of where the center of the data lies. This is what people usually mean when they say “find the average” of several values or data points. I remember this one with the mnemonic, “When she said I was average, she was just being mean.”

 

We take the sum (represented by the capital Greek letter sigma or Σ) and divide it by the number of data points (represented by n) to get the mean (represented by the lowercase Greek letter mu or μ). Thus, we get the following formula:

μ = Σ ¸ n

 

Weighted mean

The weighted mean is a shortcut when dealing with multiple data points that have similar values. An example is on p. 84. You can put the information into a table that indicates both frequency (f) and value (x):

 

f

x

computation

 

3

70

3 * 70 = 210

 

7

85

7 * 85 = 595

Total:

10

 

210 + 595 = 805; 805¸10 = 80.5

Thus the 70 is weighted by the three and the 85 by the seven. This is how teachers compute grades regularly.

 

Median

The median is the middle number in a list of numbers arranged from least to greatest (ascending order) or from greatest to least (descending order). I remember that the middle number is the median because they both have an “i” in them.

 

When there is an odd number of data points, there is one middle number, so the median is easy to locate. For example: {1, 2, 3, 4, 5} has five data points, an odd number. The 3 is in the middle, so the 3 is the median.

 

When there is an even number of data points, you must take the two middle numbers and find their mean. For example: {6, 7, 8, 9} has four data points, an even number. The 7 and 8 are in the middle, so find their mean (7 + 8) ¸ 2 = 7.5, and that is the median for this set.

 

Most problems you will encounter at higher levels of difficulty will require you to put the numbers in order before you can find the median – and some of these numbers will repeat, so you have to include all repetitions to do the calculation. For example: {1, 2, 1, 0, 5, 5, 3, 4, 4, 4} is not in order, and is a set with ten data points, an even number. {0, 1, 1, 2, 3, 4, 4, 4, 5, 5} is this set arranged in ascending order. The two middle numbers are now 3 and 4, which give us the median 3.5.

 

Mode

The mode is the number that occurs with the highest frequency in a set where values occur more than once. In other words, I remember it because the mode is the most frequent number – both words have an “o” in them.

 

In the set from our previous example: {0, 1, 1, 2, 3, 4, 4, 4, 5, 5}, the one occurs twice, the four occurs three times, and the five occurs twice. The four occurs the most frequently, so the mode is 4.

 

It is possible that a data set will either have no modes or more than one mode, so be careful when counting frequency.

 

Dispersion

Dispersion is how spread out the data is. There are several such measures that tell us the distances from each data point to the others.

 

Range

Many books treat range as a measure of central tendency because it measures the distance from the greatest value to the least value – but it is precisely because of this measure that I treat the range as a measure of spread or dispersion. To calculate the range, find the smallest value (x1, pronounced “x sub 1,” the first data point when arranged in ascending order) and subtract it from the largest value (xn, pronounced “x sub n,” the last data point when arranged in ascending order). Thus the range = xn – x1 or (if G = greatest value and L = least value) G – L.

 

Variance

Variance measures how far each data point varies from the mean, and is required to determine the standard deviation. More detail comes in the next section. I have never seen variance per se on the GRE, but it is essential.

 

Standard Deviation

Standard deviation, simply stated, measures the average of the distance from each data point from the mean – though it is a little more complicated than that. In fact, it is the square root of the variance. The formula makes use of the Greek letters and math symbols we have been using so far, and looks very complex:


The lowercase Greek letter sigma (σ) stands for standard deviation. The symbol for each individual data point is xi, pronounced “x sub i.” I usually write a lowercase n instead of a capital N for the number of the total population included. Before we take the square root of the result, we have the variance, but after finding the square root of the result we have the standard deviation.

 

Let me break this process down into its component steps:
1. Find the mean of the data set.
            μ = (x1 + x2 + … + xn) ¸ n
2. Find the difference between each data point and the mean.
            (x1 – μ), (x2 – μ), … (xn – μ)
3. Square the differences.
            (x1 – μ)2, (x2 – μ)2, … (xn – μ)2
4. Take the sum of the squares.
            Σ(xi – μ)2 = (x1 – μ)2 + (x2 – μ)2 +… + (xn – μ)2
5. Divide the sum by the number of data points. ß The quotient is the variance.
            Σ(xi – μ)2 ¸ n = quotient = variance
6. Take the square root of the quotient. ß The square root is the standard deviation.
            √[Σ(xi – μ)2 ¸ n] = square root = standard deviation = σ

 

A large standard deviation is when most of the data spreads out far away from the mean. A small standard deviation is when most of the data is very close to the mean.

 

Below you will see the standard bell curve, called that because it is shaped sort of like a bell. Most averages when graphed will take on this shape, such as average height or weight. Most people will cluster within one standard deviation of the mean, in the center of the curve, while very few people will be beyond two standard deviations of the mean – at the very, very small or the very, very large on either the left or right extreme ends of the curve.

 

It is important to memorize the percentages inside the curve because the GRE may ask you about the probability that a measure will occur within the first two standard deviations above the mean, for example.



Usually this means that about 68.2% of the population will exist within 1 standard deviation of the mean, while the remaining 31.8% of the population is 1 standard deviation or more away from the mean value. Exactly half (50%) of the population will fall before the mean, and exactly half will fall after. Memorizing these percentages makes probabilities easier to discern when dealing with standard deviation problems.

 


When the data points increase or decrease all by the same amount, the measures of central tendency (mean, median, and mode) all change, but range, standard deviation, and variance do not. Adding a brand new data point may change the standard deviation and variance, but ONLY when it is outside of the existing standard deviation. Adding a new data point only changes the range when it is greater than the largest data point or less than the smallest data point.

 

Think of it this way, using the example data set from page 89: the mean for {3, 7, 7, 8, 10} is 7. The standard deviation works out to be about 2.28. If a new data point is discovered, greater or lesser than 7 ± 2.28 – in other words, smaller than 7 – 2.28 OR larger than 7 + 2.28 – this will change the standard deviation. However, if all the data points increase by 2 (so the new set becomes {5, 9, 9, 10, 12}), then the standard deviation does not change at all.


 


Comments

Popular posts from this blog

Week 5, Day 1: Quantitative Comparison

Week 1, Day 2: Intro to Exam & Counting Methods (Sets, Permutations, Combinations)

Week 5, Day 2: Algebra Review