Logistic Regression: Prediction of ATORNY on Claimants Dataset

 

Logistic Regression

What is Logistic Regression:

Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is a binary or dichotomous variable (only two possible outcomes).

 

How Logistic Regression Works

1.     Logistic regression estimates the probability that a given instance belongs to a particular category. Instead of modeling this directly, it models the log-odds or the logarithm of the odds:

logistic function is used to model the probability of a binary outcome. Here’s the formula:


In the context of a logistic regression model, the log odds are modeled as a linear combination of the input features. The logistic regression equation can be written as: logit(p)=log(p1p)=β0+β1x1+β2x2++βnxn\text{logit}(p) = \log \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n where:

  • pp is the probability of the positive class.
  • x1,x2,,xnx_1, x_2, \ldots, x_n are the input features.
  • β0,β1,,βn\beta_0, \beta_1, \ldots, \beta_n are the coefficients (weights) of the model.

Odds

The odds of an event occurring is the ratio of the probability that the event occurs to the probability that it does not occur. If pp is the probability of the event occurring, then the odds are given by: Odds=p1p\text{Odds} = \frac{p}{1 - p}

Log Odds (Logit)

The log odds is the natural logarithm of the odds: Log Odds=log(p1p)\text{Log Odds} = \log \left( \frac{p}{1 - p} \right)

Log odds transform probabilities, which range between 0 and 1, into a continuous scale that ranges from negative infinity to positive infinity. This transformation is useful because it allows logistic regression to use linear combinations of input features to predict the probability of a binary outcome.



Uses of Logistic Regression

Binary Classification:

Logistic regression is widely used for binary classification problems where the outcome is dichotomous (e.g., yes/no, true/false, success/failure).

Medical Fields:

 Used to predict the presence or absence of a disease based on patient characteristics.

Marketing:

To predict whether a customer will buy a product or not based on their past behavior and demographic information.

Finance:

To predict credit default risk.

Why Use the Logit Function?

Linear Relationship:

 In logistic regression, we need a linear relationship between the independent variables (predictors) and the transformed dependent variable. The logit function helps achieve this linearity.

Range:

The logit function maps probabilities (which are bounded between 0 and 1) to the entire real number line (

−∞ to +∞), making it suitable for regression analysis.

Explanation of Dataset

Your dataset contains information about claims and their attributes, which can be used to predict whether an attorney is involved in the claim.

 

CASENUM: Case number (not relevant for prediction).

ATTORNEY: Binary outcome variable indicating whether an attorney is involved (0 = No, 1 = Yes).

CLMSEX: Claimant's sex (e.g., 0 = Female, 1 = Male).

CLMINSUR: Claimant's insurance status (e.g., 0 = No insurance, 1 = Has insurance).

SEATBELT: Whether the claimant was wearing a seatbelt (e.g., 0 = No, 1 = Yes).

CLMAGE: Age of the claimant.

LOSS: Amount of loss claimed.


Evaluation Metrix:

Confusion Matrix:

A confusion matrix is a table used to evaluate the performance of a classification model. It compares the predicted values with the actual values to show how well the model is performing. It breaks down the predictions into four categories:


True Positives (TP):

The model predicted the positive class (e.g., "1" or "yes") correctly.

True Negatives (TN):

The model predicted the negative class (e.g., "0" or "no") correctly.

False Positives (FP):

The model predicted the positive class, but the actual class was negative (also known as a "Type I Error").

False Negatives (FN):

The model predicted the negative class, but the actual class was positive (also known as a "Type II Error").

For Example,

                                Fig. Confusion Matrix

                                                Fig. Confusion Matrix Breakdown

Explanation of the Table:

True Negatives (TN) = 436:

The model predicted 0 (no attorney hired), and the actual value was also 0 (no attorney hired).


False Negatives (FN) = 139:

The model predicted 0 (no attorney hired), but the actual value was 1 (attorney hired).


False Positives (FP) = 249:

The model predicted 1 (attorney hired), but the actual value was 0 (no attorney hired).


True Positives (TP) = 516:

The model predicted 1 (attorney hired), and the actual value was also 1 (attorney hired).


Totals:

Total Predicted ATTORNEY = 0:  575 (436 TN + 139 FN)

Total Predicted ATTORNEY = 1:  765 (249 FP + 516 TP)

Total Actual ATTORNEY = 0:   685 (436 TN + 249 FP)

Total Actual ATTORNEY = 1:   655 (139 FN + 516 TP)

Total Data Points: 1340 (Total cases analyzed)

This table helps in interpreting the performance of your model by clearly separating the different types of correct and incorrect predictions.


Key Metrics:

  • Accuracy: Measures how often the model made the correct prediction.

    • Accuracy=(TN+TP)(TN+FN+FP+TP)=436+516436+139+249+516=95213400.71\text{Accuracy} = \frac{(TN + TP)}{(TN + FN + FP + TP)} = \frac{436 + 516}{436 + 139 + 249 + 516} = \frac{952}{1340} \approx 0.71 or 71%
  • Precision (for ATTORNEY = 1): The proportion of actual attorney-hired cases among those predicted as attorney-hired.

    • Precision=TP(TP+FP)=516516+2490.674\text{Precision} = \frac{TP}{(TP + FP)} = \frac{516}{516 + 249} \approx 0.674  or 67.4%
  • Recall (Sensitivity for ATTORNEY = 1): The proportion of actual attorney-hired cases correctly identified by the model.

    • Recall=TP(TP+FN)=516516+1390.788 or 78.8%
  • F1-Score: The harmonic mean of precision and recall.

    • F1=2×Precision×RecallPrecision+Recall0.726F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \approx 0.726 or 72.6%
Conclusions:
  • The model correctly identified 516 cases where an attorney was hired (true positives) and 436 cases where no attorney was hired (true negatives).

  • It misclassified 249 cases by predicting an attorney was hired when they weren't (false positives) and missed 139 cases where an attorney was hired but was not predicted (false negatives).

  • Overall, the model has a balanced performance with an accuracy of 71%, but with room for improvement in reducing false positives and false negatives.

  • Go Through Jupyter Notebook 👇👇👇👇

    Prediction of Attorney Using Logistic Regression 


    Download Claimants Dataset (claimants.csv)

    टिप्पणी पोस्ट करा

    0 टिप्पण्या