Overview
This report covers the SetFit (Sentence Transformer Fine-tuning) version of the CBRE customer feedback classifier. Unlike the baseline approach that used a frozen encoder, SetFit fine-tunes the sentence transformer itself through contrastive learning on automatically generated sentence pairs before fitting the classifier head. This makes it particularly effective for small labeled datasets.
The classifier assigns one or more category labels — Technical, Performance, UX, and Data/Security — to open-ended customer feedback on internal software applications. Labels are not mutually exclusive.
Dataset
Label Distribution (positive examples)
Model Architecture & Training
Two-Phase Training
- Generates sentence pairs from labeled examples
- 5,040 pairs from 126 training comments
- 20 iterations per label · batch size 16
- Fine-tunes encoder weights via cosine similarity loss
- Pulls same-label comments together in embedding space
- ~30 seconds on Apple Silicon (MPS)
- Encodes all training examples with the fine-tuned encoder
- Fits 4 independent logistic regression classifiers
- One-vs-rest strategy for multi-label output
- Outputs per-label probability + binary prediction
- <1 second to fit
Inference Pipeline
Encoderall-MiniLM-L6-v2
Embeddingdomain-adapted
Regressionone-vs-rest
Confidencemulti-label
Hyperparameters
| Parameter | Value | Description |
|---|---|---|
num_epochs | 1 | Contrastive training epochs |
num_iterations | 20 | Sentence pairs generated per class per label |
batch_size | 16 | Pairs per gradient update |
multi_target_strategy | one-vs-rest | Independent binary classifier per label |
base_model | all-MiniLM-L6-v2 | 22M param distilled BERT, 384-dim output |
Train / Test Split
Data was split 85% / 15%, yielding 126 training and 23 test examples.
The evaluation model is trained on the split only. After reporting metrics, the final production
model is retrained on all 149 examples and saved to models/setfit/.
Test Results
Evaluated on 23 held-out comments not seen during training. Metrics are per-label binary classification (positive class = label present).
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 0.91 | 1.00 | 0.95 | 20 |
| Yes | 1.00 | 0.33 | 0.50 | 3 |
| Weighted avg | 0.92 | 0.91 | 0.89 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 1.00 | 1.00 | 1.00 | 21 |
| Yes | 1.00 | 1.00 | 1.00 | 2 |
| Weighted avg | 1.00 | 1.00 | 1.00 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 1.00 | 0.50 | 0.67 | 6 |
| Yes | 0.85 | 1.00 | 0.92 | 17 |
| Weighted avg | 0.89 | 0.87 | 0.85 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 1.00 | 1.00 | 1.00 | 19 |
| Yes | 1.00 | 1.00 | 1.00 | 4 |
| Weighted avg | 1.00 | 1.00 | 1.00 | 23 |
SetFit vs. Frozen Encoder Baseline
F1 score on the positive class (label present) across 23 held-out test examples, comparing SetFit fine-tuning against the previous frozen encoder + logistic regression approach.
| Label | Frozen Encoder F1 | SetFit F1 | Change |
|---|---|---|---|
| Technical | 0.50 | 0.50 | — no change |
| Performance | 0.00 | 1.00 | +1.00 ▲ perfect |
| UX | 0.85 | 0.92 | +0.07 ▲ |
| Data/Security | 0.67 | 1.00 | +0.33 ▲ perfect |
Interpretation & Limitations
What works well
- Performance & Data/Security (F1 = 1.00): Perfect precision and recall on both labels. Contrastive training draws clear boundaries between slowness/latency language and security/compliance language.
- UX recall (1.00): Every UX-related comment is correctly identified — the most operationally valuable result for service improvement prioritization.
- Technical precision (1.00): Zero false positives. Every comment flagged as Technical genuinely describes a technical issue.
Where it struggles
- Technical recall (0.33): Only 1 of 3 Technical test examples was correctly identified. With just 3 positive test examples, a single miss has outsized metric impact — more labeled Technical data is needed.
- UX "No" recall (0.50): 3 of 6 non-UX comments were incorrectly predicted as UX. The dominant-class bias persists, though all actual positive UX predictions are now correct (precision 1.00 for "No").
- Small test set (n=23): Individual misclassifications swing metrics significantly. Results are directionally informative but not statistically robust until the dataset grows beyond ~500 examples.
Recommendations
- Expand Technical labeling — target 50+ positive examples. Technical comments are semantically diverse (integration issues, config complexity, update breakage) and benefit most from additional examples.
- Increase
num_iterationsto 40–60 to generate more contrastive pairs and improve encoder specialization for rare classes. - Tune the UX classification threshold below 0.5 to reduce false positives on the dominant class without sacrificing recall.
- Retrain as data grows — SetFit retrains in ~25 seconds on this dataset, making periodic retraining as new labeled feedback arrives entirely practical.