Leads Classification Project¶
Author: Robert Long
This is a quick project I completed as part of the MIT Professional Certificate in Data Science and Machine Learning. The task involved classifying marketing leads using a dataset of user demographics and engagement behaviors.
The goal was to explore the data, perform minimal preprocessing, and train a simple classification model to predict lead conversion outcomes. This project was completed in approximately one hour and demonstrates a fast, effective baseline analysis using Python and pandas.
As part of the project requirements, we focused on tree-based models—specifically Decision Trees and Random Forests—for their interpretability and strong baseline performance on structured data.
While the modeling is basic, the project illustrates how even a rapid analysis can yield actionable insights—and set the foundation for more robust machine learning workflows.
TL;DR¶
This is a quick classification project completed as part of the MIT Professional Certificate in Data Science and Machine Learning. Using decision trees and random forests, I explored the predictive patterns behind lead conversion for a marketing dataset.
Key takeaways:
- Time spent on website and first interaction via mobile app were the strongest predictors of conversion.
- A mild class imbalance (30% conversion) informs both modeling choices and metric selection.
- The random forest performed slightly better, but the decision tree offers faster, explainable results.
- Business recommendations include prioritizing mobile UX, encouraging profile completion, and using behavioral signals for smarter follow-up.
This project took ~1 hour and demonstrates fast, structured analysis, model selection with trade-off reasoning, and actionable business insight generation.
Load the Dataset¶
We begin by loading the dataset into a pandas DataFrame. Previewing the first few observations helps us get an initial sense of the structure and content.
The info() method reveals how many values are missing according to pandas. Note, however, that this check only captures true missing values (e.g., NaN). It does not detect strings like 'missing', 'NA', or other placeholders that may represent missing data in a non-standard way. As such, it can be misleading if those values exist—but in this case, all data is recognized as valid by pandas.
import pandas as pd
# Load the dataset
file_path = 'ExtraaLearn.csv'
df = pd.read_csv(file_path)
# Display basic info
df.info(), df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
(None,
ID age current_occupation first_interaction profile_completed \
0 EXT001 57 Unemployed Website High
1 EXT002 56 Professional Mobile App Medium
2 EXT003 52 Professional Website Medium
3 EXT004 53 Unemployed Website High
4 EXT005 23 Student Website High
website_visits time_spent_on_website page_views_per_visit \
0 7 1639 1.861
1 2 83 0.320
2 3 330 0.074
3 4 464 2.057
4 4 600 16.914
last_activity print_media_type1 print_media_type2 digital_media \
0 Website Activity Yes No Yes
1 Website Activity No No No
2 Website Activity No No Yes
3 Website Activity No No No
4 Email Activity No No No
educational_channels referral status
0 No No 1
1 Yes No 0
2 No No 0
3 No No 1
4 No No 0 )
Check for Hidden Missing Values¶
Since I don’t fully trust the non-null counts reported by info(), it’s good practice to manually inspect the data for hidden missing values—especially within string columns.
After checking for common placeholders like 'missing', 'NA', and empty strings, I didn’t find any hidden missing values. It appears that all values are properly formatted and recognized by pandas.
# Select string (object) columns
string_cols = df.select_dtypes(include='object').columns
# Check unique values per string column
unique_values = {col: df[col].unique() for col in string_cols}
# Display counts of unique values (including possible hidden missing indicators)
unique_counts = {col: df[col].value_counts(dropna=False) for col in string_cols}
unique_values
{'ID': array(['EXT001', 'EXT002', 'EXT003', ..., 'EXT4610', 'EXT4611', 'EXT4612'],
dtype=object),
'current_occupation': array(['Unemployed', 'Professional', 'Student'], dtype=object),
'first_interaction': array(['Website', 'Mobile App'], dtype=object),
'profile_completed': array(['High', 'Medium', 'Low'], dtype=object),
'last_activity': array(['Website Activity', 'Email Activity', 'Phone Activity'],
dtype=object),
'print_media_type1': array(['Yes', 'No'], dtype=object),
'print_media_type2': array(['No', 'Yes'], dtype=object),
'digital_media': array(['Yes', 'No'], dtype=object),
'educational_channels': array(['No', 'Yes'], dtype=object),
'referral': array(['No', 'Yes'], dtype=object)}
Train-Test Split¶
While some practitioners prefer to split the data before performing any exploratory data analysis (EDA), I’m cautious about potential information leakage into the model if insights from the full dataset shape feature engineering or model selection.
To mitigate that risk, we’ll first split the data—reserving 20% for testing—and perform EDA strictly on the training set. This ensures that our insights and decisions are based only on data the model will be allowed to learn from.
from sklearn.model_selection import train_test_split
# Split the data
train_df, test_df = train_test_split(df, test_size=0.20, stratify=df['status'],
random_state=42)
Exploratory Data Analysis (EDA)¶
Target Feature Balance¶
We start by examining the balance of the target variable, status. Understanding class distribution is crucial—it influences model selection and may signal the need for techniques like class weighting or resampling (e.g., SMOTE).
In this dataset, approximately 30% of observations have status = 1, indicating a mild class imbalance. While not extreme, this may impact models such as logistic regression, which tend to perform best with a more balanced distribution (ideally closer to 40/60). As such, this imbalance is worth noting and may require correction during modeling.
train_df.shape, train_df['status'].value_counts(normalize=True)
((3689, 15), status 0 0.701545 1 0.298455 Name: proportion, dtype: float64)
Pearson Correlation of Numeric Features¶
I calculated Pearson correlations among the numeric features. Notably, the target variable status is included in this analysis. While using Pearson correlation with a binary variable isn’t strictly incorrect, it’s not ideal—it can exaggerate linear relationships. However, since status is the only binary variable in this set, we’ll keep it in for a quick overview.
The overall correlations between features are low, suggesting limited multicollinearity. The one exception is a noticeable correlation between status and time_spent_on_website, indicating that time spent may be a strong predictor of conversion. This is a valuable insight.
I briefly considered creating a new feature, time_spent_on_website / website_visits (i.e., time per visit), but decided against it. While this could capture deeper behavioral patterns, it felt like feature engineering for its own sake rather than something clearly motivated by the data. So, for now, I’ve chosen to skip it.
import matplotlib.pyplot as plt
import seaborn as sns
# Select numeric columns
numeric_cols = train_df.select_dtypes(include=['int64', 'float64']).columns
# Calculate correlation matrix
correlation_matrix = train_df[numeric_cols].corr()
# Set up the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Heatmap of Numeric Features")
plt.tight_layout()
plt.show()
Distribution of Numeric Features¶
I examined the distributions of the numeric features. None of them appear to follow a normal distribution. This may influence model choice and preprocessing steps, particularly for models sensitive to distributional assumptions.
# Update numeric columns since we transformed categorical ones
numeric_cols_updated = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Seaborn pairplot (sampled to avoid overload)
import seaborn as sns
import matplotlib.pyplot as plt
# To avoid overplotting, take a sample (optional, adjustable)
sample_df = train_df[numeric_cols_updated].sample(n=500, random_state=42)
# Plot
sns.pairplot(sample_df)
plt.suptitle("Pairplot of Numeric Features (Sample of 500)", y=1.02)
plt.show()
Box Plots and Outliers¶
The box plots reveal numerous outliers in features like website_visits and page_views_per_visit, both of which show strong right skew. However, none of the outlier values appear implausible or clearly erroneous.
I tend to avoid removing outliers unless they represent obvious data entry errors or reflect a population I don't intend to model. In this case, the outliers may capture real user behavior—perhaps rare, but meaningful.
While removing them might improve model performance metrics on paper, it risks overfitting by artificially simplifying the data. I’ve chosen to retain them to preserve the integrity and generalizability of the model.
# Create box-whisker plots for numeric features
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
axes = axes.flatten()
# removed status from numeric_cols
for idx, col in enumerate([col for col in numeric_cols if col != 'status']):
sns.boxplot(data=train_df, y=col, ax=axes[idx])
axes[idx].set_title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()
Data Pre-processing¶
Feature Engineering and Mapping¶
Several string-based categorical variables need to be mapped to numeric format for modeling. Most of these are binary in nature, making them straightforward to encode.
current_occupationcan be converted to a binary feature by groupingStudentandUnemployedtogether (non-professional = 0).last_activitycan be reclassified to reflect digital vs. phone-based engagement.profile_completedhas an inherent ordinal structure and can be mapped accordingly.
Binary Mapping (Yes/No ➝ 1/0):¶
first_interaction:Mobile App= 1,Website= 0print_media_type1:Yes= 1,No= 0print_media_type2:Yes= 1,No= 0digital_media:Yes= 1,No= 0educational_channels:Yes= 1,No= 0referral:Yes= 1,No= 0
Custom Binary Mapping:¶
current_occupation:Professional= 1, all others = 0last_activity:Phone Activity= 1, all others = 0
Ordinal Mapping:¶
profile_completed:High= 3,Medium= 2,Low= 1
These mappings simplify the categorical features and ensure compatibility with most scikit-learn models.
# Mapping functions
occupation_map = lambda x: 1 if x == 'Professional' else 0
interaction_map = lambda x: 1 if x == 'Mobile App' else 0
yes_no_map = lambda x: 1 if x == 'Yes' else 0
profile_map = {'Low': 1, 'Medium': 2, 'High': 3}
activity_map = lambda x: 1 if x == 'Phone Activity' else 0
# Apply transformations to both train and test sets
def transform(df):
df = df.copy()
df['current_occupation'] = df['current_occupation'].map(occupation_map)
df['first_interaction'] = df['first_interaction'].map(interaction_map)
df['print_media_type1'] = df['print_media_type1'].map(yes_no_map)
df['print_media_type2'] = df['print_media_type2'].map(yes_no_map)
df['digital_media'] = df['digital_media'].map(yes_no_map)
df['educational_channels'] = df['educational_channels'].map(yes_no_map)
df['referral'] = df['referral'].map(yes_no_map)
df['profile_completed'] = df['profile_completed'].map(profile_map)
df['last_activity'] = df['last_activity'].map(activity_map)
return df
train_df_transformed = transform(train_df)
test_df_transformed = transform(test_df)
train_df_transformed.head()
| ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 647 | EXT648 | 45 | 1 | 0 | 2 | 5 | 77 | 8.676 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2201 | EXT2202 | 63 | 0 | 0 | 2 | 1 | 65 | 4.031 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3362 | EXT3363 | 54 | 0 | 0 | 3 | 2 | 90 | 3.816 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 617 | EXT618 | 56 | 1 | 1 | 3 | 4 | 1857 | 1.360 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1213 | EXT1214 | 42 | 1 | 0 | 3 | 5 | 1193 | 2.113 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Binary Feature Distributions¶
Next, we examine the distributions of the binary features. Some are fairly balanced, while others show significant skew. These imbalances can influence model performance, especially in cases where the model implicitly assumes equal class probabilities.
Depending on the model used, techniques like class weighting or feature scaling may help mitigate any downstream effects.
# Select binary columns
binary_cols = [
'current_occupation', 'first_interaction', 'print_media_type1',
'print_media_type2', 'digital_media', 'educational_channels',
'referral', 'last_activity'
]
# Calculate proportions of 0s and 1s
binary_proportions = {
col: train_df_transformed[col].value_counts(normalize=True).sort_index()
for col in binary_cols
}
# Convert to DataFrame for display
binary_proportions_df = pd.DataFrame(binary_proportions).T
binary_proportions_df.columns = ['Proportion_0', 'Proportion_1']
binary_proportions_df
| Proportion_0 | Proportion_1 | |
|---|---|---|
| current_occupation | 0.430740 | 0.569260 |
| first_interaction | 0.552182 | 0.447818 |
| print_media_type1 | 0.892383 | 0.107617 |
| print_media_type2 | 0.947682 | 0.052318 |
| digital_media | 0.888046 | 0.111954 |
| educational_channels | 0.849553 | 0.150447 |
| referral | 0.979127 | 0.020873 |
| last_activity | 0.731092 | 0.268908 |
Model: Decision Tree¶
We trained a decision tree classifier and experimented with several hyperparameters (details below). Model performance was evaluated using accuracy, which is a reasonable metric here given the mild class imbalance. However, this choice should be context-dependent.
Accuracy provides a general sense of performance across the confusion matrix, but it does not differentiate between the costs of false positives and false negatives. In a real-world setting, this trade-off depends heavily on the organization’s lead strategy:
- If pursuing leads is expensive, we may want to minimize false positives to avoid wasting resources on low-quality leads.
- If following up is inexpensive and high-converting leads are high value, it may be better to accept more false negatives, knowing that catching even a subset of qualified leads is still profitable.
Metric selection should align with the business objectives—not just model performance.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
# Define features and target
X_train = train_df_transformed.drop(columns=['ID', 'status'])
y_train = train_df_transformed['status']
# Set up the Decision Tree and hyperparameter grid
dtree = DecisionTreeClassifier(random_state=42)
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}
# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(dtree, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best model and classification report
best_tree = grid_search.best_estimator_
y_pred = best_tree.predict(X_train)
report = classification_report(y_train, y_pred, output_dict=True)
pd.DataFrame(report).T
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.884418 | 0.934312 | 0.908681 | 2588.000000 |
| 1 | 0.821990 | 0.712988 | 0.763619 | 1101.000000 |
| accuracy | 0.868257 | 0.868257 | 0.868257 | 0.868257 |
| macro avg | 0.853204 | 0.823650 | 0.836150 | 3689.000000 |
| weighted avg | 0.865786 | 0.868257 | 0.865386 | 3689.000000 |
Visualization of the Decision Tree¶
The decision tree visualization shows how the model splits across various features to classify leads. However, due to the tree’s depth and number of branches, the bottom layer (leaf nodes) becomes difficult to read.
While deeper trees can risk overfitting by capturing too much noise from the training data, we applied regularization techniques (such as limiting max depth and minimum samples per split) to help mitigate that risk. The tree remains interpretable at higher levels and reflects meaningful structure in the data.
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(20, 10))
plot_tree(
best_tree,
feature_names=X_train.columns,
class_names=['Not Converted', 'Converted'],
filled=True,
rounded=True,
fontsize=10
)
plt.title("Best Decision Tree")
plt.show()
Feature Importance¶
The feature importance scores provide insight into which variables the decision tree relied on most when making predictions. These values reflect the relative contribution of each feature to reducing impurity across all splits.
In this case, website activity stands out as a particularly influential predictor—highlighting that engagement behavior is a key signal for lead conversion.
import pandas as pd
import matplotlib.pyplot as plt
# Get feature importances
importances = best_tree.feature_importances_
features = X_train.columns
# Create a DataFrame for easy sorting and visualization
feat_imp_df = pd.DataFrame({
'Feature': features,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feat_imp_df['Feature'], feat_imp_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importances from Decision Tree')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Hyperparameters¶
Below are the optimal hyperparameters identified through Grid Search:
- Criterion: Gini index for measuring node purity
- Max Depth: 5 levels
- Min Samples per Split: Low (allowing even small splits)
This combination suggests that the model benefited from some depth to capture meaningful patterns, but didn’t require aggressive regularization at the split level. The data likely contained reasonably clear structure, allowing the tree to perform well without overfitting—even when splitting on smaller subsets.
# The final params
print("Best Hyperparameters:")
print(grid_search.best_params_)
Best Hyperparameters:
{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
Model: Random Forest¶
The random forest model produced results very similar to the decision tree, with a slight improvement in recall for the status = 1 class. While this marginal gain is valuable, it comes at the cost of interpretability and increased computational complexity.
In many cases, I would favor the decision tree due to its explainability and faster processing—especially when working with stakeholders who need to understand and trust the model's behavior.
However, if the company prioritizes predictive performance over transparency, and processing time is not a concern, the random forest may be the better choice. This trade-off is something that should be discussed with stakeholders before deploying the model.
# Re-import and reprocess due to kernel reset
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Define features and target
X_train = train_df_transformed.drop(columns=['ID', 'status'])
y_train = train_df_transformed['status']
# Define Random Forest model and grid
rf = RandomForestClassifier(random_state=42)
rf_param_grid = {
'n_estimators': [100, 200],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'criterion': ['gini', 'entropy']
}
# Perform Grid Search
rf_grid_search = GridSearchCV(rf, rf_param_grid, cv=5, scoring='accuracy', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)
# Evaluate
best_rf = rf_grid_search.best_estimator_
y_pred_rf = best_rf.predict(X_train)
rf_report = classification_report(y_train, y_pred_rf, output_dict=True)
pd.DataFrame(rf_report).T
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.904534 | 0.948223 | 0.925863 | 2588.000000 |
| 1 | 0.862705 | 0.764759 | 0.810785 | 1101.000000 |
| accuracy | 0.893467 | 0.893467 | 0.893467 | 0.893467 |
| macro avg | 0.883619 | 0.856491 | 0.868324 | 3689.000000 |
| weighted avg | 0.892050 | 0.893467 | 0.891517 | 3689.000000 |
Feature Importance¶
The feature importance rankings from the random forest closely mirror those of the decision tree. This consistency reinforces the idea that certain features—such as user engagement behaviors—are strong and reliable predictors of lead conversion, regardless of model complexity.
import pandas as pd
import matplotlib.pyplot as plt
# Extract and organize
rf_importances = best_rf.feature_importances_
rf_features = X_train.columns
# Create a DataFrame for sorting and display
rf_feat_imp_df = pd.DataFrame({
'Feature': rf_features,
'Importance': rf_importances
}).sort_values(by='Importance', ascending=False)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(rf_feat_imp_df['Feature'], rf_feat_imp_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importances from Random Forest')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Do We Need to Prune?¶
This question came up as part of the course: should the decision tree be pruned?
Pruning is a form of regularization used to prevent overfitting by trimming back branches that capture noise instead of signal. In this case, explicit pruning after training isn’t necessary because we already applied regularization during training:
- We limited the maximum depth of the tree.
- We constrained the minimum number of samples required to split a node.
- We used 5-fold cross-validation to validate performance across different data splits.
These techniques effectively control the model’s complexity. So while pruning can be useful in more complex or overfitted trees, the current model is already well-regularized and no further pruning is needed.
Including this analysis reinforces a key takeaway: regularization doesn’t have to be post-hoc—it can (and should) be part of model design.
Actionable Insights & Recommendations¶
Key Takeaways from Feature Importance¶
Both the Decision Tree and Random Forest consistently identified the following as the most predictive features of successful lead conversion (status = 1):
- Time spent on website – Higher engagement time is strongly associated with increased conversion likelihood.
- First interaction channel – Leads who first engage via the mobile app tend to convert at higher rates.
- Profile completion level – More complete profiles are predictive of higher conversion potential.
- Current occupation – Professionals convert more frequently than students or the unemployed.
- Last activity type – Phone interactions are more predictive of conversion than web or email engagement.
Business Recommendations¶
- Prioritize mobile app optimization: Since mobile-first leads convert more often, investing in mobile UX and onboarding could significantly boost performance.
- Encourage profile completion: Implement nudges, tooltips, or incentives to improve profile completeness during onboarding.
- Segment by occupation: Tailor messaging and outreach for professionals, who exhibit stronger conversion behavior.
- Act on high-intent signals: Use behavioral indicators like high time-on-site or phone interaction to trigger accelerated follow-up or priority routing to sales.
- Reallocate budget from low-impact channels: Print media and some digital touchpoints showed minimal predictive power; consider redirecting those resources.
Model Choice Guidance¶
- Random Forest delivered slightly better accuracy and recall, particularly for converted leads—making it the stronger model in terms of performance.
- Decision Tree, while slightly less performant, provides greater transparency and faster computation. It’s ideal for teams that value explainability.
Additional Modeling Considerations¶
While tree-based models were effective and provided useful interpretability, I’d be interested in exploring alternative approaches as well. Methods such as logistic regression, gradient boosting (e.g., XGBoost), or even neural networks (for richer feature interaction modeling) could offer different performance profiles or insights.
Trying a wider range of models would also help test the stability of the feature importance rankings and confirm whether simpler linear methods could perform competitively in this context.
Suggested Approach:
Use the Random Forest in production for its stronger performance, but leverage the Decision Tree to explain model behavior, guide strategic decisions, and support internal buy-in.
Future Work & Limitations¶
Limitations¶
- Model Scope: This analysis focused exclusively on decision trees and random forests. While these models performed well, they may not be optimal depending on future business objectives or new data.
- Feature Engineering: Minimal feature engineering was performed to keep the project lightweight. More nuanced features—such as interaction time per visit or temporal trends—could improve performance.
- Class Imbalance: While only mildly imbalanced (~30% positive class), the project did not explore techniques like SMOTE or class weighting, which could further refine recall or precision.
- Business Context: The dataset lacks explicit cost or revenue metrics. Without knowing the true cost of false positives/negatives, model evaluation relied on generalized assumptions rather than business-specific ROI.
Future Work¶
- Explore Alternative Models: Try logistic regression, gradient boosting (e.g., XGBoost), or neural networks to compare interpretability, scalability, and performance.
- Feature Enrichment: Create derived features such as time per visit or activity recency. Apply one-hot encoding or embedding approaches where needed.
- Cost-Sensitive Evaluation: Incorporate cost-weighted metrics or custom scoring to align with business goals—e.g., prioritize precision if follow-up is expensive.
- Deployment Readiness: Wrap the model into a pipeline, assess inference speed, and prepare for A/B testing or pilot deployment.
This project served as a fast baseline and could be expanded into a more robust lead-scoring system with additional time and context.