When we are trying to get insights from the data, one of the first steps to take is summarizing this collection of raw numbers into one number that represents the majority. The way to do this is by getting the mean, the medium, and the mode of the data. They allow us to find the center of the data depending on the context and to have one number that tells us the history of the data analyzed. With the measures of Central Tendency, we try to find a way to summarize or find the location where the data concentrates the most values.
Mean
We have used or seen the mean everywhere in sports like points per game or shooting percentage in basketball. The mean is the average or arithmetic mean of all the numbers present in our dataset. For example, if you have a list of students with the following grades:
[90, 89, 76, 56, 78, 23, 88, 100, 70, 96]
The mean is found by adding all the numbers together. Then, divide the sum by the number of grades. In the previous list, we have:
(90 + 89 + 76 + 56 + 78 + 23 + 88 + 100 + 70 + 96) / 10 = 76.6
The mean is 76.6
Median
The mean is the value right in the middle of the list. It is easier to calculate but the numbers need to be sorted. Once they are sorted, you pick the number that is in the middle of the list. If the length of your list is even, so there is not one single number at the center of the data points, it is necessary to get the mean of the two numbers in the middle. This is an example with eleven numbers sorted:
[18, 23, 56, 70, 76, 78, 88, 89, 90, 96, 100]
In this case, 78 is the median value. But, if there were only 10 values in the list, you need to get the average of both numbers in the middle:
[23, 56, 70, 76, 78, 88, 89, 90, 96, 100]
The median is (78 + 88)/2 = 83
Mode
The mode is the value that occurs the most in the list. If we have the following list:
[90, 89, 76, 76, 78, 76, 23, 88, 76, 100, 76, 89]
The mode does not need to be present in a group of data points. Also, the mode might not be relevant if its frequency is not large compared to other values.
Gotchas
Everything depends on the context of the data in order to interpret it and make better business decisions. For instance, if you are calculating the average salary of all the employees of a company, you need to consider outliers. The CEO or president of the company might have a salary of 3 million dollars which can increase the average salary drastically and give you false information about the overall salaries.
The following list of salaries for company X has 10 people including Julia Wright who is the CEO:
Name | Salary |
---|---|
Jane Johnston | 35,000 |
Mark Stevens | 41,000 |
Joe Martin | 30,000 |
William Thompson | 35,000 |
Brooks Daniels | 67,000 |
Miriam Pollard | 50,000 |
Peter Simmons | 60,000 |
Mayleen Fowler | 50,000 |
Domenic Dobson | 34,000 |
Julia Wright | 580,000 |
If somebody wants to work at the company, you cannot say that the average salary is 98,200 because the person would expect a paycheck way higher than the one she will actually receive. In reality, only one person makes more than 100,000 and she is the CEO. The average might be accurate but not representative of the actual salary that the person might get. In this case, it is better to use the median as the center of the data because the CEO’s salary is an outlier affecting the average data. The options are using the medium or removing the outlier for this calculation to get a better number for the average salary. The median is 45,500 and the mean without the outlier is 44,667. Both represent the data closer to reality as nobody will make more than 100,000 without being the CEO.
In some instances, you might have missing values that prevent you from having a more accurate calculation. In this case, you could use the mode value when you determine that it makes sense or just remove the row completely. It really depends on the context of your data.