URL-Based Phishing Detection with Machine Learning

URL-Based Phishing Detection with Machine Learning

Phishing attacks exploit user trust by creating deceptive websites that mimic legitimate ones. Traditional blacklist-based detection fails against novel phishing sites. Machine learning offers an alternative: classify URLs as phishing or legitimate based on structural features, without requiring prior knowledge of specific malicious domains.

This analysis compares six approaches—Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, XGBoost, and Neural Networks—on a dataset of 100,077 URLs characterized by 19 features.

Kaggle Notebook: Advanced URL Analysis for Phishing Detection

Dataset and Features

Each URL is represented by 19 numerical features capturing structural characteristics:

Length and Character Counts:

Behavioral Indicators:

Target variable phishing is binary: 0 for legitimate, 1 for phishing.

The dataset contains 36.3% phishing URLs (36,346 phishing vs 63,731 legitimate), providing reasonable class balance.

Feature Analysis

URL Length and Dot Frequency

Phishing URLs exhibit longer lengths and higher dot counts compared to legitimate URLs. The distributions require log transformation due to right skew:

# Log-transformed distributions show clear separation
plt.boxplot(np.log(data.loc[data['phishing'] == 0,'url_length']))  # Normal
plt.boxplot(np.log(data.loc[data['phishing'] == 1,'url_length']))  # Phishing

Phishing URLs stretch longer, potentially to disguise malicious intent with misleading path components. Extra dots may mimic legitimate subdomains while subtly altering domain structure.

Special Character Patterns

Character frequency analysis reveals distinct patterns:

Hyphens, Underscores, Slashes: Phishing URLs average higher counts. Attackers may use complex structures mimicking legitimate multi-level paths.

Query Parameters (?, =, &): Phishing URLs incorporate more question marks, equals signs, and ampersands. Multiple parameters can hide malicious code or create trustworthy appearances.

Rare Symbols: Asterisks, hashtags, and dollar signs appear almost exclusively in phishing URLs. While infrequent overall, their presence is a strong signal.

Percent Signs: Legitimate URLs use percent encoding more frequently, likely for proper URL encoding of special characters.

Redirections

The distribution of redirections shows minimal difference between classes. This feature may have limited discriminative power unless combined with other indicators.

Model Implementations

Logistic Regression

Baseline linear model with L2 regularization:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Performance:

Coefficient Analysis:

Significant positive predictors (increase phishing probability):

Significant negative predictors (increase legitimate probability):

Surprisingly, hyphens and dots negatively correlate with phishing despite earlier observations. This suggests multicollinearity or interaction effects—high counts may be common in both classes, with other features providing discrimination.

Decision Tree

Non-linear model capturing feature interactions:

model = DecisionTreeClassifier(
    ccp_alpha=0.0001,
    criterion='gini',
    max_depth=32,
    min_samples_split=5,
    random_state=1
)

Performance:

Feature Importance: url_length, n_slash, and n_dots rank highest, confirming their discriminative power.

Random Forest

Ensemble of decision trees reducing overfitting:

model = RandomForestClassifier(
    n_estimators=80,
    max_depth=18,
    max_features='sqrt',
    min_samples_split=12,
    criterion='gini'
)

Performance:

Improved over single decision tree with fewer false positives (1,533 vs 1,566) and false negatives (1,628 vs 1,816).

Gradient Boosting

Sequential ensemble building on prior tree errors:

model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.01,
    max_depth=12,
    subsample=0.8,
    max_features=0.5,
    random_state=42
)

Performance:

Best false positive rate (1,353), critical for minimizing false alarms on legitimate sites.

XGBoost

Optimized gradient boosting with regularization:

model = xgb.XGBClassifier(
    objective='binary:logistic',
    colsample_bytree=0.8,
    learning_rate=0.008,
    max_depth=10,
    alpha=10,
    n_estimators=400
)

Performance:

Balanced performance with controlled overfitting through regularization.

Neural Network

Deep learning with standardized features:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = Sequential([
    Input(shape=(19,)),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=25, validation_split=0.2)

Performance:

Lowest false negative rate (1,679), critical for security applications where missed phishing sites have high cost.

Comparative Analysis

Model Test Acc AUC FP FN Notes
Logistic Regression 85.7% 0.93 1,566 1,816 Baseline, interpretable
Decision Tree 88.7% 0.95 1,566 1,816 Captures non-linearity
Random Forest 89.5% 0.96 1,533 1,628 Reduced overfitting
Gradient Boosting 89.5% 0.96 1,353 1,803 Best FP rate
XGBoost 89.1% 0.96 1,482 1,787 Balanced
Neural Network 89.3% - 1,453 1,679 Best FN rate

Trade-offs:

Key Findings

Effective Features: URL length, slash count, at-symbol presence, and asterisk count strongly indicate phishing. Query parameter density (?, =, &) also signals potential attacks.

Model Selection: Ensemble methods (Random Forest, Gradient Boosting, XGBoost) consistently outperform single models. Neural networks achieve lowest false negative rate at the cost of interpretability.

Deployment Considerations: False positive/negative trade-off depends on context. Security-critical applications may prefer neural networks (minimize missed attacks); user-facing systems may prefer gradient boosting (minimize false alarms).

Limitations: URL features alone cannot detect sophisticated phishing using legitimate-looking structures. Hybrid approaches combining URL analysis with content inspection and reputation systems would provide stronger defense.

Conclusion

Machine learning effectively detects phishing URLs through structural feature analysis. Ensemble methods achieve ~90% accuracy with AUC of 0.96. The neural network minimizes false negatives (1,679), while gradient boosting minimizes false positives (1,353).

Feature engineering reveals that attackers create longer URLs with unusual special character distributions. At-symbols, asterisks, and excessive slashes serve as strong phishing indicators.

For production deployment, Random Forest offers the best balance of accuracy, speed, and interpretability. Critical applications should consider neural networks despite reduced interpretability, given their superior ability to detect attacks.

Full implementation: Kaggle Notebook