Customer Feedback Classifier

Model Training, Testing & Results

Overview

This classifier analyzes open-ended customer feedback on internal software applications and assigns one or more category labels — Technical, Performance, UX, and Data/Security. Labels are not mutually exclusive; a single comment may belong to multiple categories.

The model uses a two-stage approach: a pre-trained sentence transformer (all-MiniLM-L6-v2) encodes each comment into a dense semantic embedding, and an independent logistic regression binary classifier is trained per label on those embeddings.

Dataset

149
Labeled Comments
4
Category Labels
14
Applications
1.14
Avg Labels / Comment

Label Distribution (positive examples)

UX
~122 of 149
Data/Security
~42 of 149
Technical
~27 of 149
Performance
~19 of 149
Class imbalance: UX dominates the dataset (~82% positive rate), while Technical, Performance, and Data/Security are minority classes. This affects recall on those labels and is the primary driver of lower minority-class F1 scores.

Model Architecture & Training

Pipeline

Raw Comment
text string
Sentence
Transformer
all-MiniLM-L6-v2
384-dim
Embedding
semantic vector
4× Logistic
Regression
one per label
Multi-label
Prediction
+ confidence

Base Encoder

sentence-transformers/all-MiniLM-L6-v2 is used as a frozen feature extractor. It is a 6-layer distilled BERT model (22M parameters) pre-trained on over 1 billion sentence pairs, producing 384-dimensional mean-pooled embeddings. Using a pre-trained encoder avoids the need for large amounts of labeled data to achieve good generalization.

Classifier Head

A MultiOutputClassifier wraps four independent LogisticRegression models (L2 regularization, C=1.0, max_iter=1000), one per label. Each outputs a calibrated probability alongside a binary prediction using a 0.5 threshold.

Train / Test Split

Data was split 85% / 15% (stratification not applied due to multi-label constraints), yielding 126 training and 23 test examples. After evaluation, the final model is retrained on all 149 examples and saved to models/classifier.joblib.

Test Results

Evaluated on 23 held-out comments not seen during training. Metrics are per-label binary classification (positive class = label present).

Technical Accuracy 91%
ClassPrecisionRecallF1Support
No0.911.000.9520
Yes1.000.330.503
Weighted avg0.920.910.8923
Performance Accuracy 91%
ClassPrecisionRecallF1Support
No0.911.000.9521
Yes0.000.000.002
Weighted avg0.830.910.8723
UX Accuracy 74%
ClassPrecisionRecallF1Support
No0.000.000.006
Yes0.741.000.8517
Weighted avg0.550.740.6323
Data/Security Accuracy 91%
ClassPrecisionRecallF1Support
No0.901.000.9519
Yes1.000.500.674
Weighted avg0.920.910.9023

Interpretation & Limitations

What works well

Where it struggles

Recommendations