Python
]
Logistic Regression in Python
Logistic regression is the most basic algorithm for classfication. It extends the idea of linear regression to cases where the dependent variable only has a discrete number of outcomes, which are also called classes. Depending on the number of possible outcomes, we can classify logistic regression models into two types: binary logistic regression and multiple logistic regression.
Instead of modeling Y
as a linear function of X
directly, we model the probability that is equal to class 1, given X.
For this tutorial, we are going to use logistic regression to predict whether a patient has 10-year-risk of future coronary heart disease. The dataset can be downloaded here.
We can import all the required libraries and our dataset:
Data preprocessing is an important initial step when doing machine learning projects. seaborn
is a great library that can provide a visualization of missing data. The cmap
here refers to the mapping from data values to color space, and RdPu
is one color scheme option among many available. An alternative method is to simply display the sum of missings value in each data columnm.
Now we have this pretty little graph :)
It looks like do not have much missing data. So we can exclude the rows with missing values.
Let’s move on and fit a logistic regression model to our data. First we need to filter out the features that really matters.
const -9.129843 male 0.561446 age 0.065896 cigsPerDay 0.019226 totChol 0.002272 sysBP 0.017534 glucose 0.007280
The results show that const
, male
, age
, cigsPerDay
, totChol
, sysBP
, glucose
are features that matter!
From the data-description, we know that male
is a nominal variable. To apply a egression analysis on any dataset, normally we have to first tranform categorical features to dummy variables using the get_dummies()
function from pandas. Dummy variables assign numerical values to the original categorical levels so that the computers can compute on them :) (More about one-hot encoding)
However for categorical variable with only two levels, there is no need to create dummy variables. As sex
is binary in our dataset, we can leave it as it is.
how acccurate is our model?
The accuracy turns out to be 85.5%. Good enough ;)
References: