Understanding Grouped Data
What is Grouped Data and Why Do We Use It?
Imagine you are looking at the heights of every student in your entire school. You have a long, messy list of hundreds of numbers. It is very difficult to see any patterns or understand what the "typical" height is. This is where grouped data comes to the rescue.
Grouped data is data that has been organized into ranges or categories, called class intervals. Instead of listing every single value, we count how many values fall into each range. This process simplifies the data, making it much easier to read, visualize, and analyze.
For example, instead of listing the height of each of 50 students, we can group them:
| Height Range (cm) | Number of Students |
|---|---|
| 150 - 154 | 5 |
| 155 - 159 | 12 |
| 160 - 164 | 20 |
| 165 - 169 | 10 |
| 170 - 174 | 3 |
This table is a frequency distribution. It immediately tells us that most students are between 160 cm and 164 cm tall. We lose the exact individual heights, but we gain a clear, overall picture of the dataset.
Building a Frequency Distribution Table
Creating a frequency distribution table involves a few key steps. Let's work through an example with the test scores of 30 students:
45, 52, 58, 61, 65, 67, 68, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 90, 92, 95
Range = Highest Value - Lowest Value = $95 - 45 = 50$
Class Width $= \frac{Range}{Number of Classes} = \frac{50}{6} \approx 8.33$. We round this up to 10 for convenience.
Our final frequency distribution table looks like this:
| Test Scores (Class Interval) | Tally Marks | Frequency (f) |
|---|---|---|
| 41 - 50 | | | 1 |
| 51 - 60 | || | 2 |
| 61 - 70 | |||| | 4 |
| 71 - 80 | |||| |||| | | 11 |
| 81 - 90 | |||| |||| | 9 |
| 91 - 100 | ||| | 3 |
| Total | 30 |
Finding the Mean from Grouped Data
Since we do not have the original data points, we cannot simply add all values and divide. Instead, we use the class mark (or midpoint) to represent all values in an interval.
Class Mark $(x) = \frac{Lower Limit + Upper Limit}{2}$
The formula for the mean $(\bar{x})$ of grouped data is:
$\bar{x} = \frac{\sum (f \times x)}{\sum f}$
Where $f$ is the frequency of the class and $x$ is the class mark.
| Class Interval | Frequency (f) | Class Mark (x) | f $ \times $ x |
|---|---|---|---|
| 41 - 50 | 1 | 45.5 | 45.5 |
| 51 - 60 | 2 | 55.5 | 111.0 |
| 61 - 70 | 4 | 65.5 | 262.0 |
| 71 - 80 | 11 | 75.5 | 830.5 |
| 81 - 90 | 9 | 85.5 | 769.5 |
| 91 - 100 | 3 | 95.5 | 286.5 |
| Total | 30 ($\sum f$) | 2305.0 ($\sum f x$) |
Now, we calculate the mean:
$\bar{x} = \frac{2305.0}{30} = 76.83$
So, the estimated mean test score is approximately 76.83.
Estimating the Median for Grouped Data
The median is the middle value that separates the higher half from the lower half of the data. For grouped data, we find the median class and then use a formula.
The formula for the median of grouped data is:
$Median = L + \left( \frac{\frac{n}{2} - CF}{f} \right) \times w$
Where:
$L$ = Lower boundary of the median class
$n$ = Total frequency ($\sum f$)
$CF$ = Cumulative frequency of the class before the median class
$f$ = Frequency of the median class
$w$ = Class width
First, we need to find the median class. The median is at the $\frac{n}{2} = \frac{30}{2} = 15$th position. We look for the class where the cumulative frequency first exceeds 15.
| Class Interval | Frequency (f) | Cumulative Frequency (CF) |
|---|---|---|
| 41 - 50 | 1 | 1 |
| 51 - 60 | 2 | 3 |
| 61 - 70 | 4 | 7 |
| 71 - 80 | 11 | 18 |
| 81 - 90 | 9 | 27 |
| 91 - 100 | 3 | 30 |
The cumulative frequency first exceeds 15 in the class 71 - 80. This is our median class.
$L = 70.5$ (the lower boundary of the median class)
$n = 30$
$CF = 7$ (the cumulative frequency before the median class)
$f = 11$ (the frequency of the median class)
$w = 10$ (the class width)
$Median = 70.5 + \left( \frac{15 - 7}{11} \right) \times 10 = 70.5 + \left( \frac{8}{11} \right) \times 10 = 70.5 + 7.27 = 77.77$
The estimated median test score is approximately 77.77.
Analyzing a Real-World Dataset
Let's apply our knowledge to a practical scenario. A environmental science class collected data on the daily water consumption (in liters) of 40 households in their neighborhood. The raw data was messy, so they decided to group it to find the average consumption and identify the most common consumption range.
| Water Consumption (L) | Number of Households (f) | Class Mark (x) | f $ \times $ x |
|---|---|---|---|
| 100 - 119 | 4 | 109.5 | 438.0 |
| 120 - 139 | 9 | 129.5 | 1165.5 |
| 140 - 159 | 15 | 149.5 | 2242.5 |
| 160 - 179 | 8 | 169.5 | 1356.0 |
| 180 - 199 | 4 | 189.5 | 758.0 |
| Total | 40 | 5960.0 |
Mean Calculation: $\bar{x} = \frac{5960.0}{40} = 149$ liters. The average daily water consumption is about 149 liters per household.
Modal Class: The class with the highest frequency is 140 - 159 liters. This is the most common range of water consumption.
This analysis quickly provides valuable insights for the class's report on local resource usage.
Common Mistakes and Important Questions
Q: What is the difference between a class limit and a class boundary?
A: Class limits are the stated minimum and maximum values of a class (e.g., 150 - 154 cm). Class boundaries are the precise points that separate classes without gaps. For the class 150 - 154, the lower boundary is 149.5 and the upper boundary is 154.5. We use boundaries for accurate calculations like the median.
Q: Why is the mean for grouped data an estimate and not an exact value?
A: When we group data, we lose the original values. To calculate the mean, we assume that all values in a class are equal to the class mark (midpoint). This is an approximation. The actual values in the class could be higher or lower than the midpoint, so the calculated mean is a very good estimate, but not the exact mean of the original raw data.
Q: A common mistake is miscounting the cumulative frequency. How can I avoid this?
A: Always double-check your cumulative frequency column. It should always end with the total frequency ($\sum f$). A good method is to add the frequency of the current row to the cumulative frequency of the previous row. If your final cumulative frequency does not match the total, you know there is an error in your tally or addition.
Footnote
1 CF (Cumulative Frequency): The running total of frequencies. It shows the number of observations that lie above or below a particular value in a data set.
2 Class Mark (or Midpoint): The central value of a class interval, calculated as (Lower Limit + Upper Limit) / 2. It is used to represent all values in that class for calculations.
3 Frequency Distribution: A statistical table that shows the number of observations (frequency) that fall into each of several specified intervals or categories.
