URL-Based Phishing Detection with Machine Learning
URL-Based Phishing Detection with Machine Learning
Phishing attacks exploit user trust by creating deceptive websites that mimic legitimate ones. Traditional blacklist-based detection fails against novel phishing sites. Machine learning offers an alternative: classify URLs as phishing or legitimate based on structural features, without requiring prior knowledge of specific malicious domains.
This analysis compares six approaches—Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, XGBoost, and Neural Networks—on a dataset of 100,077 URLs characterized by 19 features.
Kaggle Notebook: Advanced URL Analysis for Phishing Detection
Dataset and Features
Each URL is represented by 19 numerical features capturing structural characteristics:
Length and Character Counts:
url_length: Total character countn_dots,n_hypens,n_underline,n_slash: Counts of common structural charactersn_questionmark,n_equal,n_and: Query parameter indicatorsn_at,n_exclamation,n_space,n_tilde,n_comma,n_plus: Special character frequenciesn_asterisk,n_hastag,n_dollar,n_percent: Rare symbols
Behavioral Indicators:
n_redirection: Number of URL redirects
Target variable phishing is binary: 0 for legitimate, 1 for phishing.
The dataset contains 36.3% phishing URLs (36,346 phishing vs 63,731 legitimate), providing reasonable class balance.
Feature Analysis
URL Length and Dot Frequency
Phishing URLs exhibit longer lengths and higher dot counts compared to legitimate URLs. The distributions require log transformation due to right skew:
# Log-transformed distributions show clear separation
plt.boxplot(np.log(data.loc[data['phishing'] == 0,'url_length'])) # Normal
plt.boxplot(np.log(data.loc[data['phishing'] == 1,'url_length'])) # Phishing
Phishing URLs stretch longer, potentially to disguise malicious intent with misleading path components. Extra dots may mimic legitimate subdomains while subtly altering domain structure.
Special Character Patterns
Character frequency analysis reveals distinct patterns:
Hyphens, Underscores, Slashes: Phishing URLs average higher counts. Attackers may use complex structures mimicking legitimate multi-level paths.
Query Parameters (?, =, &): Phishing URLs incorporate more question marks, equals signs, and ampersands. Multiple parameters can hide malicious code or create trustworthy appearances.
Rare Symbols: Asterisks, hashtags, and dollar signs appear almost exclusively in phishing URLs. While infrequent overall, their presence is a strong signal.
Percent Signs: Legitimate URLs use percent encoding more frequently, likely for proper URL encoding of special characters.
Redirections
The distribution of redirections shows minimal difference between classes. This feature may have limited discriminative power unless combined with other indicators.
Model Implementations
Logistic Regression
Baseline linear model with L2 regularization:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Performance:
- Accuracy: 85.7%
- AUC: 0.93
- True Negatives: 17,593 | False Positives: 1,566
- False Negatives: 1,816 | True Positives: 9,049
Coefficient Analysis:
Significant positive predictors (increase phishing probability):
n_at: coefficient = 4.36 (p < 0.001)n_asterisk: coefficient = 2.01 (p < 0.01)n_slash: coefficient = 1.03 (p < 0.001)n_questionmark: coefficient = 0.90 (p < 0.001)url_length: coefficient = 0.07 (p < 0.001)
Significant negative predictors (increase legitimate probability):
n_plus: coefficient = -1.05 (p < 0.001)n_comma: coefficient = -0.90 (p < 0.001)n_hypens: coefficient = -0.61 (p < 0.001)n_and: coefficient = -0.62 (p < 0.001)n_dots: coefficient = -0.47 (p < 0.001)
Surprisingly, hyphens and dots negatively correlate with phishing despite earlier observations. This suggests multicollinearity or interaction effects—high counts may be common in both classes, with other features providing discrimination.
Decision Tree
Non-linear model capturing feature interactions:
model = DecisionTreeClassifier(
ccp_alpha=0.0001,
criterion='gini',
max_depth=32,
min_samples_split=5,
random_state=1
)
Performance:
- Train Accuracy: 89.0%
- Test Accuracy: 88.7%
- AUC: 0.95
- True Negatives: 17,593 | False Positives: 1,566
- False Negatives: 1,816 | True Positives: 9,049
Feature Importance: url_length, n_slash, and n_dots rank highest, confirming their discriminative power.
Random Forest
Ensemble of decision trees reducing overfitting:
model = RandomForestClassifier(
n_estimators=80,
max_depth=18,
max_features='sqrt',
min_samples_split=12,
criterion='gini'
)
Performance:
- Train Accuracy: 90.8%
- Test Accuracy: 89.5%
- AUC: 0.96
- True Negatives: 17,626 | False Positives: 1,533
- False Negatives: 1,628 | True Positives: 9,237
Improved over single decision tree with fewer false positives (1,533 vs 1,566) and false negatives (1,628 vs 1,816).
Gradient Boosting
Sequential ensemble building on prior tree errors:
model = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.01,
max_depth=12,
subsample=0.8,
max_features=0.5,
random_state=42
)
Performance:
- Train Accuracy: 91.1%
- Test Accuracy: 89.5%
- AUC: 0.96
- True Negatives: 17,806 | False Positives: 1,353
- False Negatives: 1,803 | True Positives: 9,062
Best false positive rate (1,353), critical for minimizing false alarms on legitimate sites.
XGBoost
Optimized gradient boosting with regularization:
model = xgb.XGBClassifier(
objective='binary:logistic',
colsample_bytree=0.8,
learning_rate=0.008,
max_depth=10,
alpha=10,
n_estimators=400
)
Performance:
- Train Accuracy: 89.3%
- Test Accuracy: 89.1%
- AUC: 0.96
- True Negatives: 17,677 | False Positives: 1,482
- False Negatives: 1,787 | True Positives: 9,078
Balanced performance with controlled overfitting through regularization.
Neural Network
Deep learning with standardized features:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = Sequential([
Input(shape=(19,)),
Dense(128, activation='relu'),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=25, validation_split=0.2)
Performance:
- Test Accuracy: 89.3%
- True Negatives: 17,706 | False Positives: 1,453
- False Negatives: 1,679 | True Positives: 9,186
Lowest false negative rate (1,679), critical for security applications where missed phishing sites have high cost.
Comparative Analysis
| Model | Test Acc | AUC | FP | FN | Notes |
|---|---|---|---|---|---|
| Logistic Regression | 85.7% | 0.93 | 1,566 | 1,816 | Baseline, interpretable |
| Decision Tree | 88.7% | 0.95 | 1,566 | 1,816 | Captures non-linearity |
| Random Forest | 89.5% | 0.96 | 1,533 | 1,628 | Reduced overfitting |
| Gradient Boosting | 89.5% | 0.96 | 1,353 | 1,803 | Best FP rate |
| XGBoost | 89.1% | 0.96 | 1,482 | 1,787 | Balanced |
| Neural Network | 89.3% | - | 1,453 | 1,679 | Best FN rate |
Trade-offs:
- Gradient Boosting: Minimizes false positives, preferred when blocking legitimate sites is costly
- Neural Network: Minimizes false negatives, preferred when missing phishing sites is costly
- Random Forest: Best overall balance with competitive performance and faster training
Key Findings
Effective Features: URL length, slash count, at-symbol presence, and asterisk count strongly indicate phishing. Query parameter density (?, =, &) also signals potential attacks.
Model Selection: Ensemble methods (Random Forest, Gradient Boosting, XGBoost) consistently outperform single models. Neural networks achieve lowest false negative rate at the cost of interpretability.
Deployment Considerations: False positive/negative trade-off depends on context. Security-critical applications may prefer neural networks (minimize missed attacks); user-facing systems may prefer gradient boosting (minimize false alarms).
Limitations: URL features alone cannot detect sophisticated phishing using legitimate-looking structures. Hybrid approaches combining URL analysis with content inspection and reputation systems would provide stronger defense.
Conclusion
Machine learning effectively detects phishing URLs through structural feature analysis. Ensemble methods achieve ~90% accuracy with AUC of 0.96. The neural network minimizes false negatives (1,679), while gradient boosting minimizes false positives (1,353).
Feature engineering reveals that attackers create longer URLs with unusual special character distributions. At-symbols, asterisks, and excessive slashes serve as strong phishing indicators.
For production deployment, Random Forest offers the best balance of accuracy, speed, and interpretability. Critical applications should consider neural networks despite reduced interpretability, given their superior ability to detect attacks.
Full implementation: Kaggle Notebook