- The problem statement
In this kernel, we will try to predict whether a tumor is “benign” (noncancerous) or “malignant” (cancerous), using information like its radius, texture etc. We implement Logistic Regression with Python and Scikit-Learn.
To achieve this, we will build a Model to predict whether a tumor is “benign” (noncancerous) or “malignant” (cancerous). We will train our model using Logistic Regression. I have used the Breast Cancer Wisconsin (Diagnostic) Data Set for this project.
So, let’s dive into business.
2. Import dataset
The next step is to import the dataset.
3. Import libraries
- The next step in building the model is to import the necessary libraries.
4. Exploratory data analysis
- We have imported the data.
- Now, its time to explore the data to gain insights about it.
From the information, we can see that there are no missing values in the data frame except an unnamed column which we shall drop
5. Splitting data into separate training, validation and test set
While building this machine learning models, it is quite important to split the dataset into three parts:
- Training set — used to train the model, i.e., compute the loss and adjust the model’s weights using an optimization technique.
- Validation set— used to evaluate the model during training, tune model hyper-parameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well.
- Test set — used to compare different models or approaches and report the model’s final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.
As a general rule of thumb we would use around 60% of the data for the training set, 20% for the validation set and 20% for the test set.
To split the data frame into test, validation and test sets, we will define a helper function to do that, with the help of the numpy library
6. Identifying Input and Target Columns
Let’s create a list of input columns, and also identify the target column.
We can now create inputs and targets for the training, validation and test sets for further processing and model training.
7. Scaling Numeric Features
Another good practice is to scale numeric features to a small range of values e.g. (0,1) or (-1,1). Scaling numeric features ensures that no particular feature has a disproportionate impact on the model’s loss. Optimization algorithms also work better in practice with smaller numbers. The numeric columns in our dataset have varying ranges.
We would use the `MinMaxScaler` from `sklearn.preprocessing` to scale values to the $(0,1)$ range.
First, we `fit` the scaler to the data i.e. compute the range of values for each numeric column.
We can now inspect the minimum and maximum values in each column.
We can now separately scale the training, validation and test sets using the `transform` method of `scaler`.
Let’s verify all values in each column lie in the range (0,1)
8. Training The Model Using Logistic Regression
To train a logistic regression model, we can use the LogisticRegression class from Scikit-learn.
We can train the model using
Each weight is applied to the value in a specific column of the input. Higher the weight, greater the impact of the column on the prediction.
9. Making Predictions and Evaluating The Model
We can now use the trained model to make predictions on the training, test and validation sets
We can output a probabilistic prediction using `predict_proba`.
The numbers above indicate the probabilities for the target classes “No” and “Yes”.
We can test the accuracy of the model’s predictions by computing the percentage of matching values in `train_preds` and `train_targets`. This can be done using the `accuracy_score` function from `sklearn.metrics`.
The model achieves an accuracy of 97% on the training set. We can visualize the breakdown of correctly and incorrectly classified inputs using a confusion matrix.
Let’s define a helper function to generate predictions, compute the accuracy score and plot a confusion matrix for a given st of inputs.
Let’s compute the model’s accuracy on the validation and test sets too.
For the test set
The accuracy of the model on the test and validation set are above 96%, which suggests that our model generalizes well to a data it hasn’t seen before.
But how good is 96% accuracy? A good way to verify whether a model has actually learned something useful is to compare its results to a “random” or “dumb” model.
We would create two models: one that guesses randomly and another that always return “No”. Both of these models completely ignore the inputs given to them.
Let’s check the accuracies of these two models on the test set.
Our random model achieves an accuracy of 51% and our “always No” model achieves an accuracy of 60%. Thankfully, our model is better than a “dumb” or “random” model!
10. Making Predictions on a Single Input
Since our model has been trained to a satisfactory accuracy, it can be used to make predictions on new data.
We would convert the dictionary into a Pandas dataframe, similar to `raw_df`. This can be done by passing a list containing the given dictionary to the `pd.DataFrame` constructor.
We’ve now created a Pandas dataframe with the same columns as `raw_df` (except `diagnosis`, which needs to be predicted). The dataframe contains just one row of data, containing the given input. We must now apply the same transformations applied while training the model:
- Imputation of missing values using the `imputer` created earlier if need be
- Scaling numerical features using the `scaler` created earlier
- Encoding categorical features using the `encoder` created earlier if need be
We can now make a prediction using `model.predict`.
Our model predicts that this input is cancerous (malignant)! We can also check the probability of the prediction.
Looks like our model is quite confident about its prediction reporting a probability of 99%
Let’s define a helper function to make predictions for individual inputs.
We can now use this function to make predictions for individual inputs.
Results and Conclusion
- The logistic regression model accuracy score is 0.97067. So, the model does a very good job in predicting whether a tumor is “benign” (noncancerous) or “malignant” (cancerous)
- The model shows no signs of over-fitting.
The work done in this project is inspired from the following:
- Machine Learning with Python: Zero to GBMs — https://jovian.ai/learn/machine-learning-with-python-zero-to-gbms
- The dataset was gotten from kaggle —
Thank you for reading this kernel. I hope you enjoyed it.
Your comments and feedback are most welcome.