Telegram-канал datasciencefun - Data Science & Machine Learning: Unsorted

Data Science & Machine Learning

19 January 2026 06:11

✅ Data Science Project Series: Part 1 - Loan Prediction.

Project goal
Predict loan approval using applicant data.

Business value
- Faster decisions
- Lower default risk
- Clear interview story

Dataset
Use the common Loan Prediction dataset from analytics practice platforms.

Target
Loan_Status
Y approved
N rejected

Tech stack
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn

Step 1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2. Load data

df = pd.read_csv("loan_prediction.csv")
df.head()

Step 3. Basic checks

df.shape
df.info()
df.isnull().sum()

Step 4. Data cleaning

Fill missing values

df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
categorical_cols = ['Gender','Married','Dependents','Self_Employed']
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

Step 5. Exploratory Data Analysis

Credit history vs approval

sns.countplot(x='Credit_History', hue='Loan_Status', data=df)
plt.show()
Income distribution.python
sns.histplot(df['ApplicantIncome'], kde=True)
plt.show()

Insight
Applicants with credit history have far higher approval rates.

Step 6. Feature engineering
Create total income.

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']

# Log transform loan amount
df['LoanAmount_log'] = np.log(df['LoanAmount'])

Step 7. Encode categorical variables

le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])

Step 8. Split features and target

X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Step 9. Build model
Logistic Regression.

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Step 10. Predictions

y_pred = model.predict(X_test)

Step 11. Evaluation

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
confusion_matrix(y_test, y_pred)
Classification report.python
print(classification_report(y_test, y_pred))

Typical result
- Accuracy around 80 percent
- Strong precision for approved loans
- Recall needs focus for rejected loans

Step 12. Model improvement ideas
- Use Random Forest
- Tune hyperparameters
- Handle class imbalance
- Track recall for rejected cases

Resume bullet example
- Built loan approval prediction model using Logistic Regression
- Achieved ~80 percent accuracy
- Identified credit history as top approval driver

Interview explanation flow
- Start with bank risk problem
- Explain feature impact
- Justify Logistic Regression
- Discuss recall vs accuracy

Double Tap ♥️ For More