Easily Calculate Python Correlation
NumPy, SciPy, and pandas: Correlation With Python
Correlation coefficients quantify the association between variables or features of a dataset. These statistics are of high importance for science and technology, and Python has great tools that you can use to calculate them. SciPy, NumPy, and pandas correlation methods are fast, comprehensive, and well-documented.
In this tutorial, you’ll learn:
- What Pearson, Spearman, and Kendall correlation coefficients are
- How to use SciPy, NumPy, and pandas correlation functions
- How to visualize data, regression lines, and correlation matrices with Matplotlib
Correlation
Statistics and data science are often concerned about the relationships between two or more variables (or features) of a dataset. Each data point in the dataset is an observation, and the features are the properties or attributes of those observations.
Every dataset you work with uses variables and observations. For example, you might be interested in understanding the following:
- How the height of basketball players is correlated to their shooting accuracy
- Whether there’s a relationship between employee work experience and salary
- What mathematical dependence exists between the population density and the gross domestic product of different countries
In the examples above, the height, shooting accuracy, years of experience, salary, population density, and gross domestic product are the features or variables. The data related to each player, employee, and each country are the observations.
Example: NumPy Correlation Calculation
One of the libraries Python provides for calculations related to numerical data is NumPy. You can use the numpy.corrcoef()
function to calculate the correlation matrix, which represents the correlation between multiple variables.
Here’s an example that shows how to calculate the correlation matrix using numpy:
Example: SciPy Correlation Calculation
SciPy is a powerful library for scientific computing in Python. It provides a function called scipy.stats.pearsonr()
that calculates the Pearson correlation coefficient and p-value between two arrays of data.
Here’s an example that shows how to calculate the Pearson correlation coefficient and p-value using scipy:
Example: pandas Correlation Calculation
pandas is a powerful library for data manipulation and analysis. It provides a function called pandas.DataFrame.corr()
that calculates the pairwise correlation of columns in a DataFrame.
Here’s an example that shows how to calculate the pairwise correlation using pandas:
Linear Correlation
Linear correlation measures the strength and direction of the linear relationship between two variables. There are various methods to calculate linear correlation, including the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.
Pearson Correlation Coefficient
The Pearson correlation coefficient, also known as Pearson’s r, is a measure of the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.
Linear Regression: SciPy Implementation
Linear regression is a technique that fits a straight line to a set of data points in such a way that the sum of the squared distances between the observed and predicted values is minimized. SciPy provides a function called scipy.stats.linregress()
that can be used to calculate the regression line that best fits the given data.
Pearson Correlation: NumPy and SciPy Implementation
You can use NumPy and SciPy to calculate the Pearson correlation coefficient between two arrays of data. The numpy.corrcoef()
function can be used to calculate the correlation matrix, and the scipy.stats.pearsonr()
function can be used to calculate the Pearson correlation coefficient and p-value.
Pearson Correlation: pandas Implementation
pandas provides a function called pandas.DataFrame.corr()
that calculates the pairwise correlation of columns in a DataFrame. By default, it calculates the Pearson correlation coefficient.
Rank Correlation
Rank correlation is a measure of the strength and direction of the monotonic relationship between two variables. Unlike linear correlation, it doesn’t assume a linear relationship between the variables. Instead, it measures the consistency of the rankings of the variables.
Spearman Correlation Coefficient
The Spearman correlation coefficient, also known as Spearman’s rho, is a measure of the monotonic relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive monotonic relationship, -1 indicating a perfect negative monotonic relationship, and 0 indicating no monotonic relationship.
Kendall Correlation Coefficient
The Kendall correlation coefficient, also known as Kendall’s tau, is a measure of the strength and direction of the ordinal relationship between two variables. It ranges from -1 to 1, with 1 indicating a perfect positive ordinal relationship, -1 indicating a perfect negative ordinal relationship, and 0 indicating no ordinal relationship.
Rank: SciPy Implementation
SciPy provides a function called scipy.stats.rankdata()
that can be used to calculate the ranks of a set of data points. Once you have the ranks, you can calculate the rank correlation coefficient using the scipy.stats.spearmanr()
or scipy.stats.kendalltau()
functions.
Rank Correlation: NumPy and SciPy Implementation
You can use NumPy and SciPy to calculate the rank correlation coefficients between two arrays of data. The scipy.stats.spearmanr()
and scipy.stats.kendalltau()
functions can be used to calculate the Spearman and Kendall correlation coefficients, respectively.
Rank Correlation: pandas Implementation
pandas provides a function called pandas.DataFrame.corr()
that calculates the pairwise correlation of columns in a DataFrame. By specifying the method
parameter as spearman
or kendall
, you can calculate the Spearman or Kendall correlation coefficient, respectively.
Visualization of Correlation
Visualizing the data is essential to gain insights into the correlation between variables. Two common visualizations for correlation analysis are x-y plots with a regression line and heatmaps of correlation matrices.
X-Y Plots With a Regression Line
X-Y plots with a regression line can be used to visualize the linear relationship between two continuous variables. Matplotlib provides functions like matplotlib.pyplot.plot()
and matplotlib.pyplot.scatter()
that can be used to create scatter plots and regression lines.
Heatmaps of Correlation Matrices
Heatmaps of correlation matrices provide a visual representation of the correlation between multiple variables. Matplotlib provides a function called matplotlib.pyplot.imshow()
that can be used to create heatmaps, and the pandas.DataFrame.corr()
function can be used to calculate the correlation matrix.
Conclusion
Correlation coefficients are important statistics for understanding the relationships between variables in a dataset. Python provides powerful libraries like NumPy, SciPy, and pandas that make it easy to calculate and visualize correlations. By using these tools, you can gain valuable insights from your data and make informed decisions.