Unlocking Insights: A Beginner's Guide to Statistics with Python
Written on
Chapter 1: Understanding Tabular Data
In our data-driven world, we frequently encounter tabular data, such as that found in Excel spreadsheets. By grasping fundamental statistical concepts, we can extract valuable insights from these datasets.
Statistics represents a branch of mathematics focused on the collection, analysis, interpretation, and presentation of numerical data. This introductory series will explore how basic statistical techniques can be applied to datasets to enhance our understanding.
Section 1.1: Key Statistical Measures
Arithmetic Mean
Definition: The arithmetic mean is the total sum of all values divided by the count of observations.
Uses: This metric is essential when a straightforward average is needed. It can also illustrate the total impact of various observations relative to their number.
Limitations: The mean is highly sensitive to outliers. For instance, the mean of 5, 10, and 60 is calculated as (5 + 10 + 60) / 3 = 25. Due to this susceptibility, the median is often favored in datasets that may contain extreme values, such as average salaries.
To compute the arithmetic mean in Python, we can use either the statistics or numpy library:
import statistics as st
data = [5, 10, 15, 20, 25]
x = st.mean(data)
print(x)
Or alternatively:
import numpy as np
data = [5, 10, 15, 20, 25]
x = np.mean(data)
print(x)
Median
Definition: The median is the middle value in an ordered dataset. For datasets with an even number of entries, the median is calculated as the average of the two middle values.
Example: For the dataset [1, 2, 3, 4] (where n = 4):
Median = ((4/2) + ((4+2)/2))/2 = (2 + 3)/2 = 2.5
For an odd-numbered dataset like [5, 6, 7, 8, 100]:
Median = (5 + 1)/2 = 3 (the third ordered value, which is 7).
Uses: The median is particularly useful for ordered datasets that are not influenced by extreme values.
To find the median in Python:
import statistics as st
data = [5, 6, 9, 20] # Even dataset
x = st.median(data)
print(x)
Or for an odd-numbered dataset:
import numpy as np
data = [5, 6, 7, 8, 100]
x = np.median(data)
print(x)
Mode
Definition: The mode is the value that appears most frequently within a dataset. A dataset may have multiple modes.
Uses: This measure helps identify the most common values or categories within a dataset and is unaffected by outliers.
To calculate the mode in Python:
import statistics as st
data = [1, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 10]
x = st.mode(data)
print(x)
Weighted Mean
Definition: Unlike the arithmetic mean, the weighted mean assigns different weights to each value in the dataset.
Uses: This allows for a mean calculation that reflects the importance of each observation based on its weight. Weights must total 100%.
Example: In an investment portfolio with various returns and weights, the weighted mean can be computed as follows:
Weighted mean = (0.1 * 10) + (0.15 * 15) + (0.2 * 20) + (0.25 * 25) + (0.3 * 30) = 22.5%.
In Python, we can create a DataFrame for this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 0.10, 10], [2, 0.15, 15], [3, 0.20, 20], [4, 0.25, 25], [5, 0.30, 30]]),
columns=['Asset', 'Return %', 'Weight %'])
And then calculate the weighted mean:
for index, row in df.iterrows():
x = sum(df['Return %'] * df['Weight %'])
print(x)
# Output: 22.5
Geometric Mean
Definition: This calculates the average value over a set of numbers.
Uses: Frequently used for sets of values intended for multiplication or exponential growth, such as compound interest rates.
For instance, to find the geometric mean for Asset 1:
import statistics as st
asset_1_growth_rate = [0.05, 0.02, -0.06]
initial = 1
asset_1_growth_rate[:] = [x + initial for x in asset_1_growth_rate]
asset_1_gm = st.geometric_mean(asset_1_growth_rate)
print(((asset_1_gm) - 1) * 100)
# Output: 0.22%
A lower geometric mean may indicate greater variability or inconsistency compared to oth