Model Report — CBRE Classifier

Overview

This classifier analyzes open-ended customer feedback on internal software applications and assigns one or more category labels — Technical, Performance, UX, and Data/Security. Labels are not mutually exclusive; a single comment may belong to multiple categories.

The model uses a two-stage approach: a pre-trained sentence transformer (all-MiniLM-L6-v2) encodes each comment into a dense semantic embedding, and an independent logistic regression binary classifier is trained per label on those embeddings.

Dataset

149

Labeled Comments

4

Category Labels

14

Applications

1.14

Avg Labels / Comment

Label Distribution (positive examples)

UX

~122 of 149

Data/Security

~42 of 149

Technical

~27 of 149

Performance

~19 of 149

Class imbalance: UX dominates the dataset (~82% positive rate), while Technical, Performance, and Data/Security are minority classes. This affects recall on those labels and is the primary driver of lower minority-class F1 scores.

Model Architecture & Training

Pipeline

Raw Comment
text string

→

Sentence
Transformer
all-MiniLM-L6-v2

→

384-dim
Embedding
semantic vector

→

4× Logistic
Regression
one per label

→

Multi-label
Prediction
+ confidence

Base Encoder

sentence-transformers/all-MiniLM-L6-v2 is used as a frozen feature extractor. It is a 6-layer distilled BERT model (22M parameters) pre-trained on over 1 billion sentence pairs, producing 384-dimensional mean-pooled embeddings. Using a pre-trained encoder avoids the need for large amounts of labeled data to achieve good generalization.

Classifier Head

A MultiOutputClassifier wraps four independent LogisticRegression models (L2 regularization, C=1.0, max_iter=1000), one per label. Each outputs a calibrated probability alongside a binary prediction using a 0.5 threshold.

Train / Test Split

Data was split 85% / 15% (stratification not applied due to multi-label constraints), yielding 126 training and 23 test examples. After evaluation, the final model is retrained on all 149 examples and saved to models/classifier.joblib.

Test Results

Evaluated on 23 held-out comments not seen during training. Metrics are per-label binary classification (positive class = label present).

Technical Accuracy 91%

Class	Precision	Recall	F1	Support
No	0.91	1.00	0.95	20
Yes	1.00	0.33	0.50	3
Weighted avg	0.92	0.91	0.89	23

Performance Accuracy 91%

Class	Precision	Recall	F1	Support
No	0.91	1.00	0.95	21
Yes	0.00	0.00	0.00	2
Weighted avg	0.83	0.91	0.87	23

UX Accuracy 74%

Class	Precision	Recall	F1	Support
No	0.00	0.00	0.00	6
Yes	0.74	1.00	0.85	17
Weighted avg	0.55	0.74	0.63	23

Data/Security Accuracy 91%

Class	Precision	Recall	F1	Support
No	0.90	1.00	0.95	19
Yes	1.00	0.50	0.67	4
Weighted avg	0.92	0.91	0.90	23

Interpretation & Limitations

What works well

UX detection (F1 = 0.85): The dominant label is identified with perfect recall — all UX comments are caught. The trade-off is low precision on the "No" class, reflecting the model's tendency to over-predict UX.
Data/Security precision (1.00): When the model does predict this label, it is always correct. Terminology in this category (security, compliance, audit, encryption) is semantically distinct and well-separated in embedding space.
Technical precision (1.00): Same story — zero false positives on the minority class.

Where it struggles

Performance recall (0.00 on 2 examples): The test set contained only 2 positive Performance examples — too few to draw strong conclusions. The model errs toward predicting "No" for rare classes.
UX "No" recall (0.00): Because UX is so dominant in training data, the model defaults to predicting UX=Yes for ambiguous inputs.
Small test set (n=23): With 23 held-out examples, individual misclassifications have a large impact on metrics. Results should be interpreted as directional, not definitive.

Recommendations

Add more labeled data, particularly for Technical and Performance categories, to improve minority-class recall.
Adjust classification thresholds per label (e.g., lower the Performance threshold below 0.5) to trade precision for recall on rare classes.
Consider SetFit (few-shot fine-tuning) once more labeled examples are available, to adapt the encoder weights to this domain rather than using it as a frozen extractor.
Retrain periodically as new labeled feedback is collected to keep the model current with evolving application language and user concerns.