Learning Patterns

Supervised Tasks and Adaptation

Noé Durandard (noe.durandard@psl.eu)

Introduction to NLP — MSc. DH EdC-PSL

November 11, 2025

1 Recap

1.0.1 The Language Modeling Problem

Language Model

A Language Model (LM) estimates the probability of pieces of text. Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:

What is \(P(w_{1},w_{2},\cdots,w_{S})\)?

How to compute \(P\)?

Count-based (count): \(n\)-gram
Neural method (learn): Transformer
- Masked Language Modeling (encoder-only, BERT)
- Causal Language Modeling (decoder-only, GPT)

\(\to\) encode:

gramaticality
semantic plausability
stylistic consistency
knowledge (?)

1.0.2 Document Representation & Unsupervised Tasks

2 Supervised Learning

2.1 Introduction

2.1.1 Supervised Learning

Goal

Learn patterns that map inputs (texts, sentences, etc.) to outputs (labels, categories, values) based on annotated examples.

What you need?

Training data: documents annotated with labels (‘ground-truth’)
A model that can make predictions
A loss function that measures the model performance
An optimisation process that nudge the model towards lower loss

\(\to\) The model will ‘learn’ from examples (and corrections), improving its predictions through training.

2.1.2 Supervised Learning NLP Tasks

Classification: Assign a label to an entire text or sentence
- Sentiment analysis (positive / negative reviews)
- Topic classification (political / cultural / literary texts)
- Spam detection, author identification, genre prediction
Regression: Predict a continuous value
- Predicting a readability score or publication year from a text

Sequence Labelling: Assign a label to tokens in a sequence (classification)
- Named Entity Recognition (NER): tagging names of people, places, organizations
- Part-of-Speech tagging
Question Answering (QA)
- The model identifies an answer span in a text based on a given question
Other tasks
- Text summarization, machine translation, paraphrase detection
  (still supervised — but outputs are more complex)

2.1.3 Idealised Supervised NLP Workflow

Define the task
Collect and annotate data
Prepare the data

Train the model
Evaluate and interpret results
Analyze outputs and errors
(Deploy?)

2.2 Main Approaches in modern NLP

2.2.1 Representation + standard ML

Approach

Use precomputed text embeddings as numerical representations of texts, then train a standard ML model on top.

Embed text
- BoW, TF-IDF, Word2Vec, Sentence-BERT, …
Train a standard ML model on the labelled embeddings
- Logistic Regression, SVM, Random Forest, MLP, …
Predict in the embedding space
- new document \(\to\) frozen embedding model \(\to\) trained ML model \(\to\) prediction

2.2.1.1 Rep+ML Overview

2.2.2 Supervised Fine-tuning

Approach

Adapt a pretrained transformer model end-to-end to a specific downstream task by updating its parameters through training on labeled data.

Choose a pretrained model & add a task-specific head
- pre-trained LM already knows language structure (remember LM lecture)
- add a task-specific (classification, regression, …) head on top of LM
Fine-tune the model
- update the model’s weights during training thanks to the labeled data
Predict directly from text
- text \(\to\) fine-tuned LM \(\to\) prediction

2.2.2.1 SFT Overview

Two steps of BERT development (extracted and modified from (Alammar 2018)).

2.3 Summary

2.3.1 Takeaways

Representation + ML	Fine-Tuning
Freeze embeddings, train small model	Adapt all model weights
Fast, lightweight	Higher compute cost
Works with few labels	Needs more data
Easier to interpret & reuse	Task-specific, less interpretable
Strong for exploratory DH tasks	Best for high-performance NLP

Key point

Representation + ML: modular, interpretable, resource-light

Fine-Tuning: powerful, task-adaptive, but resource-intensive

Keep in mind

‘Ground-truth’ might not be absolute
Model choice reflects task and data
Tune with intention
Interpret critically

3 Classification

3.1 What is a classification task?

3.1.1 Description

Goal

Assign each input text to one of a set of predefined categories or classes.

3.1.2 Overview

3.2 The Confusion Matrix & Performance Metrics

3.2.1 Confusion Matrix

Definition

Matrix where each cell counts examples by how the predictions compare to the ground truth.

\(\to\) Used to compute different performance metrics (see also: )

3.2.2 Performance Metrics

Metric	Formula	Interpretation
Accuracy	\(\frac{TP + TN}{TP + TN + FP + FN}\)	› Overall proportion of correct predictions
Positive Predicted Value	\(\frac{TP}{TP + FP}\)	› How accurate the positive predictions are (=precision)
True Positive Rate	\(\frac{TP}{TP + FN}\)	› Coverage of actual positive sample (=recall)
F1-score	\(\frac{2\times\mathrm{PPV}\times\mathrm{TPR}}{\mathrm{PPV}+\mathrm{TPR}}\)	› Harmonic mean of precision and recall
False Positive Rate	\(\frac{FP}{FP+TN}\)	› Proportion of incorrect positive predictions
Predicted Positive Rate	\(\frac{TP+FP}{TP+TN+FP+FN}\)	› Proportion of samples predicted as positive

3.2.3 Quick Example

Accuracy
- (2+4)/(2+1+2+4) = 6/9 \(\approx\) 0.66
Precision
- 2/(2+2) = 2/4 = 0.5
Recall
- 2/(2+1) = 2/3 \(\approx\) 0.66
F1-score
- \(\frac{2\times\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}\)
…

3.3 Is that enough?

3.3.1 Some examples…

… reinforcing structural discriminations, through:

Reported Bias & Runaway Feedback Loops
- Minorities targeting in predictive justice (Angwin et al. 2016); predictive policing (Lum and Isaac 2016; Rotaru et al. 2022)

Latent Bias & Sampling Bias
- Racial healthcare disparities (Obermeyer et al. 2019)
- Gender bias in credit scoring (Telford 2019) and recruitment practices (Dastin 2022)
- Classist biases in welfare systems (QdN 2023)
…

(a) Number of days with targeted policing for drug crimes in areas flagged by PredPol analysis of Oakland police data. (b) Targeted policing for drug crimes, by race. (c) Estimated drug use by race (extracted from (Lum and Isaac 2016)).

CAF’s suspicion score (extracted from (QdN 2023)).

3.4 Fairness Metrics

3.4.1 Conceptual Overview

Three core concepts (Barocas, Hardt, and Narayanan 2023):

Independence — predictions are statistically independent of sensitive attributes
- e.g., same positive rate for all groups
Separation — error rates are similar across groups, conditional on the true label
- e.g., same false-positive / false-negative rates for each group
Sufficiency — predictions are equally reliable across groups
- e.g., for a given predicted probability, the actual correctness is the same

3.4.2 Underlying idea

Fairness across groups

The model should perform similarly across groups.

\(\to\) compute performance metrics for differnt groups:
let \(p\) refer to privileged group and \(n\) to non-privileged group:

\(m_\mathrm{p}=\mathrm{score}(\mathrm{TP}_\mathrm{p},\mathrm{FP}_\mathrm{p},\mathrm{TN}_\mathrm{p},\mathrm{FN}_\mathrm{p}),\quad m_\mathrm{n}=\mathrm{score}(\mathrm{TP}_\mathrm{n},\mathrm{FP}_\mathrm{n},\mathrm{TN}_\mathrm{n},\mathrm{FN}_\mathrm{n})\)

\(\to\) compare them with a ratio. A “fair” model should ensure (RoT: \(\epsilon\approx0.8\)):

\[ \epsilon \leq \frac{m_\mathrm{n}}{m_\mathrm{p}} \leq 1/\epsilon\]

3.4.3 Practical Measures of Fairness

Name	Metric	Concept	Interpretation (Wiśniewski and Biecek 2022)
Statistical Parity	PPR	Independence (equivalent)	whether both groups have equal likelihood of being predicted as positive
Equal Opportunity	TPR	Separation (relaxation)	likelihood of correctly recognising a positive is equal regardless of group
Predictive Equality	FPR	Separation (relaxation)	likelihood of incorrectly classifying a negative as positive is equal regardless of group
Predictive Parity	PPV	Sufficiency (relaxation)	whether positive are equally reliable across groups
Accuracy Difference	ACC	—	whether the model performance is consistent across groups

[💡Reminder: based on per-group performance]

Metric	Formula	Interpretation
Predicted Positive Rate (PPR)	\(\frac{TP+FP}{TP+TN+FP+FN}\)	› Proportion of samples predicted as positive
True Positive Rate (TPR)	\(\frac{TP}{TP + FN}\)	› Coverage of actual positive sample (=recall)
False Positive Rate (FPR)	\(\frac{FP}{FP+TN}\)	› Proportion of incorrect positive predictions
Positive Predicted Value	\(\frac{TP}{TP + FP}\)	› How accurate the positive predictions are (=precision)
Accuracy	\(\frac{TP + TN}{TP + TN + FP + FN}\)	› Overall proportion of correct predictions

3.5 Key Message

Strong Model \(\nRightarrow\) Fair Model
(Fair Model \(\nRightarrow\) Strong Model)

4 Hands-on

4.1 Hands-On Challenge

References

Alammar, Jay. 2018. “The Illustrated Bert, Elmo, and Co. (How NLP Cracked Transfer Learning).” Visualizing Machine Learning One Concept at a Time. https://jalammar.github.io/illustrated-bert/.

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. “Machine Bias.” Propublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

Barocas, Solon, Moritz Hardt, and Arvind Narayanan. 2023. Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org.

Dastin, Jeffrey. 2022. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against Women.” In Ethics of Data and Analytics, 296–99. Auerbach Publications.

Lum, Kristian, and William Isaac. 2016. “To Predict and Serve?” Significance 13 (5): 14–19. https://doi.org/https://doi.org/10.1111/j.1740-9713.2016.00960.x.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

QdN. 2023. “Scoring of Welfare Beneficiaries: The Indecency of CAF’s Algorithm Now Undeniable.” La Quadrature Du Net. https://www.laquadrature.net/en/2023/11/27/scoring-of-welfare-beneficiaries-the-indecency-of-cafs-algorithm-now-undeniable/.

Rotaru, Victor, Yi Huang, Timmy Li, James Evans, and Ishanu Chattopadhyay. 2022. “Event-Level Prediction of Urban Crime Reveals a Signature of Enforcement Bias in US Cities.” Nature Human Behaviour 6 (8): 1056–68. https://doi.org/10.1038/s41562-022-01372-0.

Telford, Taylor. 2019. “Apple Card Algorithm Sparks Gender Bias Allegations Against Goldman Sachs.” Washington Post 11.

Wiśniewski, Jakub, and Przemysław Biecek. 2022. “Fairmodels: A Flexible Tool for Bias Detection, Visualization, and Mitigation.” https://arxiv.org/abs/2104.00507.