SetFit Model Report — CBRE Classifier

Overview

Framework

SetFit

Sentence Transformers Fine-tuning

Base Model

all-MiniLM-L6-v2

22M parameters · 384-dim embeddings

Strategy

One-vs-Rest

Multi-label classification

This report covers the SetFit (Sentence Transformer Fine-tuning) version of the CBRE customer feedback classifier. Unlike the baseline approach that used a frozen encoder, SetFit fine-tunes the sentence transformer itself through contrastive learning on automatically generated sentence pairs before fitting the classifier head. This makes it particularly effective for small labeled datasets.

The classifier assigns one or more category labels — Technical, Performance, UX, and Data/Security — to open-ended customer feedback on internal software applications. Labels are not mutually exclusive.

Dataset

149

Labeled Comments

4

Category Labels

14

Applications

1.14

Avg Labels / Comment

Label Distribution (positive examples)

UX

~122 of 149

Data/Security

~42 of 149

Technical

~27 of 149

Performance

~19 of 149

Class imbalance: UX dominates at ~82% positive rate. SetFit's contrastive pair generation balances positive and negative pairs per label during encoder fine-tuning — a structural advantage over approaches that train only the classifier head on imbalanced data.

Model Architecture & Training

Two-Phase Training

Phase 1 — Contrastive Fine-tuning

Encoder Adaptation

Generates sentence pairs from labeled examples
5,040 pairs from 126 training comments
20 iterations per label · batch size 16
Fine-tunes encoder weights via cosine similarity loss
Pulls same-label comments together in embedding space
~30 seconds on Apple Silicon (MPS)

Phase 2 — Classifier Fitting

Head Training

Encodes all training examples with the fine-tuned encoder
Fits 4 independent logistic regression classifiers
One-vs-rest strategy for multi-label output
Outputs per-label probability + binary prediction
<1 second to fit

Inference Pipeline

Raw Commenttext input

→

Fine-tuned
Encoderall-MiniLM-L6-v2

→

384-dim
Embeddingdomain-adapted

→

4× Logistic
Regressionone-vs-rest

→

Labels +
Confidencemulti-label

Hyperparameters

Parameter	Value	Description
`num_epochs`	1	Contrastive training epochs
`num_iterations`	20	Sentence pairs generated per class per label
`batch_size`	16	Pairs per gradient update
`multi_target_strategy`	one-vs-rest	Independent binary classifier per label
`base_model`	all-MiniLM-L6-v2	22M param distilled BERT, 384-dim output

Train / Test Split

Data was split 85% / 15%, yielding 126 training and 23 test examples. The evaluation model is trained on the split only. After reporting metrics, the final production model is retrained on all 149 examples and saved to models/setfit/.

Test Results

Evaluated on 23 held-out comments not seen during training. Metrics are per-label binary classification (positive class = label present).

Technical Accuracy 91%

Class	Precision	Recall	F1	Support
No	0.91	1.00	0.95	20
Yes	1.00	0.33	0.50	3
Weighted avg	0.92	0.91	0.89	23

Performance Accuracy 100%

Class	Precision	Recall	F1	Support
No	1.00	1.00	1.00	21
Yes	1.00	1.00	1.00	2
Weighted avg	1.00	1.00	1.00	23

UX Accuracy 87%

Class	Precision	Recall	F1	Support
No	1.00	0.50	0.67	6
Yes	0.85	1.00	0.92	17
Weighted avg	0.89	0.87	0.85	23

Data/Security Accuracy 100%

Class	Precision	Recall	F1	Support
No	1.00	1.00	1.00	19
Yes	1.00	1.00	1.00	4
Weighted avg	1.00	1.00	1.00	23

SetFit vs. Frozen Encoder Baseline

F1 score on the positive class (label present) across 23 held-out test examples, comparing SetFit fine-tuning against the previous frozen encoder + logistic regression approach.

Label	Frozen Encoder F1	SetFit F1	Change
Technical	0.50	0.50	— no change
Performance	0.00	1.00	+1.00 ▲ perfect
UX	0.85	0.92	+0.07 ▲
Data/Security	0.67	1.00	+0.33 ▲ perfect

Key improvement: By fine-tuning the encoder through contrastive learning, SetFit learns domain-specific representations where performance bottleneck language and security/compliance language become clearly separable — eliminating the false negatives that plagued the frozen-encoder approach on minority classes.

Interpretation & Limitations

What works well

Performance & Data/Security (F1 = 1.00): Perfect precision and recall on both labels. Contrastive training draws clear boundaries between slowness/latency language and security/compliance language.
UX recall (1.00): Every UX-related comment is correctly identified — the most operationally valuable result for service improvement prioritization.
Technical precision (1.00): Zero false positives. Every comment flagged as Technical genuinely describes a technical issue.

Where it struggles

Technical recall (0.33): Only 1 of 3 Technical test examples was correctly identified. With just 3 positive test examples, a single miss has outsized metric impact — more labeled Technical data is needed.
UX "No" recall (0.50): 3 of 6 non-UX comments were incorrectly predicted as UX. The dominant-class bias persists, though all actual positive UX predictions are now correct (precision 1.00 for "No").
Small test set (n=23): Individual misclassifications swing metrics significantly. Results are directionally informative but not statistically robust until the dataset grows beyond ~500 examples.

Recommendations

Expand Technical labeling — target 50+ positive examples. Technical comments are semantically diverse (integration issues, config complexity, update breakage) and benefit most from additional examples.
Increase num_iterations to 40–60 to generate more contrastive pairs and improve encoder specialization for rare classes.
Tune the UX classification threshold below 0.5 to reduce false positives on the dominant class without sacrificing recall.
Retrain as data grows — SetFit retrains in ~25 seconds on this dataset, making periodic retraining as new labeled feedback arrives entirely practical.

Training time: Phase 1 contrastive fine-tuning completed in ~30s and Phase 2 classifier fitting in under 1s on Apple Silicon (MPS). Full retraining on all 149 examples takes approximately 25 seconds — fast enough to retrain on demand.