Overview
This classifier analyzes open-ended customer feedback on internal software applications and assigns one or more category labels — Technical, Performance, UX, and Data/Security. Labels are not mutually exclusive; a single comment may belong to multiple categories.
The model uses a two-stage approach: a pre-trained
sentence transformer (all-MiniLM-L6-v2) encodes each comment
into a dense semantic embedding, and an independent
logistic regression binary classifier is trained per label on those embeddings.
Dataset
Label Distribution (positive examples)
Model Architecture & Training
Pipeline
text string
Transformer
all-MiniLM-L6-v2
Embedding
semantic vector
Regression
one per label
Prediction
+ confidence
Base Encoder
sentence-transformers/all-MiniLM-L6-v2 is used as a frozen feature extractor.
It is a 6-layer distilled BERT model (22M parameters) pre-trained on over 1 billion sentence
pairs, producing 384-dimensional mean-pooled embeddings. Using a pre-trained encoder avoids
the need for large amounts of labeled data to achieve good generalization.
Classifier Head
A MultiOutputClassifier wraps four independent
LogisticRegression models (L2 regularization, C=1.0, max_iter=1000),
one per label. Each outputs a calibrated probability alongside a binary prediction using
a 0.5 threshold.
Train / Test Split
Data was split 85% / 15% (stratification not applied due to multi-label constraints),
yielding 126 training and 23 test examples.
After evaluation, the final model is retrained on all 149 examples and saved to
models/classifier.joblib.
Test Results
Evaluated on 23 held-out comments not seen during training. Metrics are per-label binary classification (positive class = label present).
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 0.91 | 1.00 | 0.95 | 20 |
| Yes | 1.00 | 0.33 | 0.50 | 3 |
| Weighted avg | 0.92 | 0.91 | 0.89 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 0.91 | 1.00 | 0.95 | 21 |
| Yes | 0.00 | 0.00 | 0.00 | 2 |
| Weighted avg | 0.83 | 0.91 | 0.87 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 0.00 | 0.00 | 0.00 | 6 |
| Yes | 0.74 | 1.00 | 0.85 | 17 |
| Weighted avg | 0.55 | 0.74 | 0.63 | 23 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| No | 0.90 | 1.00 | 0.95 | 19 |
| Yes | 1.00 | 0.50 | 0.67 | 4 |
| Weighted avg | 0.92 | 0.91 | 0.90 | 23 |
Interpretation & Limitations
What works well
- UX detection (F1 = 0.85): The dominant label is identified with perfect recall — all UX comments are caught. The trade-off is low precision on the "No" class, reflecting the model's tendency to over-predict UX.
- Data/Security precision (1.00): When the model does predict this label, it is always correct. Terminology in this category (security, compliance, audit, encryption) is semantically distinct and well-separated in embedding space.
- Technical precision (1.00): Same story — zero false positives on the minority class.
Where it struggles
- Performance recall (0.00 on 2 examples): The test set contained only 2 positive Performance examples — too few to draw strong conclusions. The model errs toward predicting "No" for rare classes.
- UX "No" recall (0.00): Because UX is so dominant in training data, the model defaults to predicting UX=Yes for ambiguous inputs.
- Small test set (n=23): With 23 held-out examples, individual misclassifications have a large impact on metrics. Results should be interpreted as directional, not definitive.
Recommendations
- Add more labeled data, particularly for Technical and Performance categories, to improve minority-class recall.
- Adjust classification thresholds per label (e.g., lower the Performance threshold below 0.5) to trade precision for recall on rare classes.
- Consider SetFit (few-shot fine-tuning) once more labeled examples are available, to adapt the encoder weights to this domain rather than using it as a frozen extractor.
- Retrain periodically as new labeled feedback is collected to keep the model current with evolving application language and user concerns.