Basic Statistical Skills
Mean
When to use: To find the mathematical average of a continuous dataset (e.g., average river depth, mean annual temperature, average pebble size).
Advantages
- Uses every value in the dataset, making it mathematically precise.
- Essential for higher-level statistics like Standard Deviation.
Disadvantages
- Can be significantly skewed by extreme values or anomalies.
- May provide invalid numbers that need manual correction (e.g., 2.4 pedestrians).
Median
When to use: To find the middle value of an ordered dataset. Ideal when data is heavily skewed (e.g., income levels, property values).
Advantages
- Unaffected by extreme outliers, providing a more representative value in skewed datasets.
- More accurately reflects the 'typical' value when the data contains significant anomalies.
Disadvantages
- Ignores the specific values of most data points, focusing only on their ranked position.
- Requires the dataset to be ordered, which can be time-consuming for large samples.
Mode
When to use: To identify the most frequently occurring value or category. Best suited for nominal (categorical) data, like land use types or dominant ethnic groups.
Advantages
- The only measure of central tendency applicable to categorical data.
- Unaffected by extreme numerical outliers.
Disadvantages
- A dataset may have no mode or multiple modes (bimodal/multimodal), leading to ambiguity.
- Does not consider all data points in its calculation.
Statistics using Ranges
Range
When to use: To quickly calculate the difference between the highest and lowest values in a dataset.
Advantages
- Simple and quick to calculate.
- Provides an immediate, though basic, understanding of the data's total spread.
Disadvantages
- Extremely sensitive to outliers as it only uses the two most extreme values.
- Provides no information about the distribution of data between the extremes.
Interquartile Range (IQR)
When to use: To measure the spread of the middle 50% of the data. Often used to compare data distributions and identify potential outliers (e.g., using box plots).
Advantages
- Eliminates the influence of extreme upper and lower outliers.
- Effective for comparing the internal spread of different datasets.
Disadvantages
- Ignores 50% of the dataset (the lowest 25% and highest 25%).
- More complex to calculate than the standard range.
Standard Deviation
When to use: To measure the average distance of all data points from the mean. A small SD indicates data is clustered tightly around the mean (high reliability); a large SD indicates data is widely spread out.
Advantages
- Allows you to see how much scores vary around the mean.
- Not as affected by extreme values.
Disadvantages
- It is hard to calculate.
- Has assumptions, meaning it cannot be used on skewed or irregular data.
Statistical Tests & Advanced Skills
Spearman's Rank
When to use: To measure the strength and direction of a relationship between two variables. Ideal for testing hypotheses such as "Does pebble size decrease with distance from the source?" or "Is there a correlation between a town's deprivation index and distance from the CBD?".
Advantages
- Provides a precise numerical value (-1 to +1) indicating the strength and direction of a correlation.
- Can be used with ranked data and is not reliant on a normal data distribution. This means it can be used to confirm confusing correlation patterns.
Disadvantages
- Manual ranking of large datasets can be laborious and prone to error.
- Correlation does not imply causation; it cannot prove that one variable causes a change in the other.
Mann-Whitney U Test
When to use: To test for a significant difference between two independent sets of data that are not normally distributed. It is the non-parametric equivalent of the t-test. For example, "Is there a significant difference in the Environmental Quality Score between two different residential streets?"
Advantages
- Effective for skewed data as it does not require a normal distribution.
- Clearly determines if the difference between two data sets is statistically significant.
Disadvantages
- The manual calculation is lengthy and can be prone to human error.
- Accuracy can be reduced with very small sample sizes (typically below 5).
- Can only be used to test the difference between two data sets.
Chi-Squared Test
When to use: To compare observed categorical data with expected data to see if there is a significant association between them. For example, "Is there a significant difference in the preferred shopping location (e.g., city centre, retail park) between different age groups?".
Advantages
- Useful for testing hypotheses using categoric data / variables.
Disadvantages
- Extremely sensitive to the size of the sample used.
- It only indicates if a relationship is exists (if it is significant), not how strong that relationship is.