Easily Calculate Python Correlation

[

NumPy, SciPy, and pandas: Correlation With Python

Correlation coefficients quantify the association between variables or features of a dataset. These statistics are of high importance for science and technology, and Python has great tools that you can use to calculate them. SciPy, NumPy, and pandas correlation methods are fast, comprehensive, and well-documented.

In this tutorial, you’ll learn:

What Pearson, Spearman, and Kendall correlation coefficients are
How to use SciPy, NumPy, and pandas correlation functions
How to visualize data, regression lines, and correlation matrices with Matplotlib

Correlation

Statistics and data science are often concerned about the relationships between two or more variables (or features) of a dataset. Each data point in the dataset is an observation, and the features are the properties or attributes of those observations.

Every dataset you work with uses variables and observations. For example, you might be interested in understanding the following:

How the height of basketball players is correlated to their shooting accuracy
Whether there’s a relationship between employee work experience and salary
What mathematical dependence exists between the population density and the gross domestic product of different countries

In the examples above, the height, shooting accuracy, years of experience, salary, population density, and gross domestic product are the features or variables. The data related to each player, employee, and each country are the observations.

Example: NumPy Correlation Calculation

One of the libraries Python provides for calculations related to numerical data is NumPy. You can use the numpy.corrcoef() function to calculate the correlation matrix, which represents the correlation between multiple variables.

Here’s an example that shows how to calculate the correlation matrix using numpy:

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
correlation_matrix = np.corrcoef(data.T)

print(correlation_matrix)

Example: SciPy Correlation Calculation

SciPy is a powerful library for scientific computing in Python. It provides a function called scipy.stats.pearsonr() that calculates the Pearson correlation coefficient and p-value between two arrays of data.

Here’s an example that shows how to calculate the Pearson correlation coefficient and p-value using scipy:

from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

correlation_coefficient, p_value = pearsonr(x, y)

print("Pearson correlation coefficient:", correlation_coefficient)
print("p-value:", p_value)

Example: pandas Correlation Calculation

pandas is a powerful library for data manipulation and analysis. It provides a function called pandas.DataFrame.corr() that calculates the pairwise correlation of columns in a DataFrame.

Here’s an example that shows how to calculate the pairwise correlation using pandas:

import pandas as pd

data = {
    'x': [1, 2, 3, 4, 5],
    'y': [2, 4, 6, 8, 10],
    'z': [3, 6, 9, 12, 15]
}

df = pd.DataFrame(data)
correlation_matrix = df.corr()

print(correlation_matrix)

Linear Correlation

Linear correlation measures the strength and direction of the linear relationship between two variables. There are various methods to calculate linear correlation, including the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.

Pearson Correlation Coefficient

The Pearson correlation coefficient, also known as Pearson’s r, is a measure of the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

Linear Regression: SciPy Implementation

Linear regression is a technique that fits a straight line to a set of data points in such a way that the sum of the squared distances between the observed and predicted values is minimized. SciPy provides a function called scipy.stats.linregress() that can be used to calculate the regression line that best fits the given data.

Pearson Correlation: NumPy and SciPy Implementation

You can use NumPy and SciPy to calculate the Pearson correlation coefficient between two arrays of data. The numpy.corrcoef() function can be used to calculate the correlation matrix, and the scipy.stats.pearsonr() function can be used to calculate the Pearson correlation coefficient and p-value.

Pearson Correlation: pandas Implementation

pandas provides a function called pandas.DataFrame.corr() that calculates the pairwise correlation of columns in a DataFrame. By default, it calculates the Pearson correlation coefficient.

Rank Correlation

Rank correlation is a measure of the strength and direction of the monotonic relationship between two variables. Unlike linear correlation, it doesn’t assume a linear relationship between the variables. Instead, it measures the consistency of the rankings of the variables.

Spearman Correlation Coefficient

The Spearman correlation coefficient, also known as Spearman’s rho, is a measure of the monotonic relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive monotonic relationship, -1 indicating a perfect negative monotonic relationship, and 0 indicating no monotonic relationship.

Kendall Correlation Coefficient

The Kendall correlation coefficient, also known as Kendall’s tau, is a measure of the strength and direction of the ordinal relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive ordinal relationship, -1 indicating a perfect negative ordinal relationship, and 0 indicating no ordinal relationship.

Rank: SciPy Implementation

SciPy provides a function called scipy.stats.rankdata() that can be used to calculate the ranks of a set of data points. Once you have the ranks, you can calculate the rank correlation coefficient using the scipy.stats.spearmanr() or scipy.stats.kendalltau() functions.

Rank Correlation: NumPy and SciPy Implementation

You can use NumPy and SciPy to calculate the rank correlation coefficients between two arrays of data. The scipy.stats.spearmanr() and scipy.stats.kendalltau() functions can be used to calculate the Spearman and Kendall correlation coefficients, respectively.

Rank Correlation: pandas Implementation

pandas provides a function called pandas.DataFrame.corr() that calculates the pairwise correlation of columns in a DataFrame. By specifying the method parameter as spearman or kendall, you can calculate the Spearman or Kendall correlation coefficient, respectively.

Visualization of Correlation

Visualizing the data is essential to gain insights into the correlation between variables. Two common visualizations for correlation analysis are x-y plots with a regression line and heatmaps of correlation matrices.

X-Y Plots With a Regression Line

X-Y plots with a regression line can be used to visualize the linear relationship between two continuous variables. Matplotlib provides functions like matplotlib.pyplot.plot() and matplotlib.pyplot.scatter() that can be used to create scatter plots and regression lines.

Heatmaps of Correlation Matrices

Heatmaps of correlation matrices provide a visual representation of the correlation between multiple variables. Matplotlib provides a function called matplotlib.pyplot.imshow() that can be used to create heatmaps, and the pandas.DataFrame.corr() function can be used to calculate the correlation matrix.

Conclusion

Correlation coefficients are important statistics for understanding the relationships between variables in a dataset. Python provides powerful libraries like NumPy, SciPy, and pandas that make it easy to calculate and visualize correlations. By using these tools, you can gain valuable insights from your data and make informed decisions.