Learning Patterns

Supervised Tasks and Adaptation

Noé Durandard (noe.durandard@psl.eu)

Introduction to NLP — MSc. DH EdC-PSL

November 11, 2025

1 Recap

1.0.1 The Language Modeling Problem

Language Model

A Language Model (LM) estimates the probability of pieces of text. Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:

What is \(P(w_{1},w_{2},\cdots,w_{S})\)?

How to compute \(P\)?

  • Count-based (count): \(n\)-gram
  • Neural method (learn): Transformer
    • Masked Language Modeling (encoder-only, BERT)
    • Causal Language Modeling (decoder-only, GPT)

\(\to\) encode:

  • gramaticality
  • semantic plausability
  • stylistic consistency
  • knowledge (?)

1.0.2 Document Representation & Unsupervised Tasks

2 Supervised Learning

2.1 Introduction

2.1.1 Supervised Learning

Goal

Learn patterns that map inputs (texts, sentences, etc.) to outputs (labels, categories, values) based on annotated examples.

What you need?

  • Training data: documents annotated with labels (‘ground-truth’)
  • A model that can make predictions
  • A loss function that measures the model performance
  • An optimisation process that nudge the model towards lower loss

\(\to\) The model will ‘learn’ from examples (and corrections), improving its predictions through training.

2.1.2 Supervised Learning NLP Tasks

  • Classification: Assign a label to an entire text or sentence
    • Sentiment analysis (positive / negative reviews)
    • Topic classification (political / cultural / literary texts)
    • Spam detection, author identification, genre prediction
  • Regression: Predict a continuous value
    • Predicting a readability score or publication year from a text
  • Sequence Labelling: Assign a label to tokens in a sequence (classification)
    • Named Entity Recognition (NER): tagging names of people, places, organizations
    • Part-of-Speech tagging
  • Question Answering (QA)
    • The model identifies an answer span in a text based on a given question
  • Other tasks
    • Text summarization, machine translation, paraphrase detection
      (still supervised — but outputs are more complex)

2.1.3 Idealised Supervised NLP Workflow

  1. Define the task
  2. Collect and annotate data
  3. Prepare the data
  1. Train the model
  2. Evaluate and interpret results
  3. Analyze outputs and errors
  4. (Deploy?)

2.2 Main Approaches in modern NLP

2.2.1 Representation + standard ML

Approach

Use precomputed text embeddings as numerical representations of texts, then train a standard ML model on top.

  1. Embed text
    • BoW, TF-IDF, Word2Vec, Sentence-BERT, …
  2. Train a standard ML model on the labelled embeddings
    • Logistic Regression, SVM, Random Forest, MLP, …
  3. Predict in the embedding space
    • new document \(\to\) frozen embedding model \(\to\) trained ML model \(\to\) prediction

2.2.1.1 Rep+ML Overview

2.2.2 Supervised Fine-tuning

Approach

Adapt a pretrained transformer model end-to-end to a specific downstream task by updating its parameters through training on labeled data.

  1. Choose a pretrained model & add a task-specific head
    • pre-trained LM already knows language structure (remember LM lecture)
    • add a task-specific (classification, regression, …) head on top of LM
  2. Fine-tune the model
    • update the model’s weights during training thanks to the labeled data
  3. Predict directly from text
    • text \(\to\) fine-tuned LM \(\to\) prediction

2.2.2.1 SFT Overview

Two steps of BERT development (extracted and modified from (Alammar 2018)).

2.3 Summary

2.3.1 Takeaways

Representation + ML Fine-Tuning
Freeze embeddings, train small model Adapt all model weights
Fast, lightweight Higher compute cost
Works with few labels Needs more data
Easier to interpret & reuse Task-specific, less interpretable
Strong for exploratory DH tasks Best for high-performance NLP

Key point

Representation + ML: modular, interpretable, resource-light

Fine-Tuning: powerful, task-adaptive, but resource-intensive

Keep in mind

  • ‘Ground-truth’ might not be absolute
  • Model choice reflects task and data
  • Tune with intention
  • Interpret critically

3 Classification

3.1 What is a classification task?

3.1.1 Description

Goal

Assign each input text to one of a set of predefined categories or classes.

3.1.2 Overview

3.2 The Confusion Matrix & Performance Metrics

3.2.1 Confusion Matrix

Definition

Matrix where each cell counts examples by how the predictions compare to the ground truth.

\(\to\) Used to compute different performance metrics (see also: )

3.2.2 Performance Metrics

Metric Formula Interpretation
Accuracy \(\frac{TP + TN}{TP + TN + FP + FN}\) › Overall proportion of correct predictions
Positive Predicted Value \(\frac{TP}{TP + FP}\) › How accurate the positive predictions are (=precision)
True Positive Rate \(\frac{TP}{TP + FN}\) › Coverage of actual positive sample (=recall)
F1-score \(\frac{2\times\mathrm{PPV}\times\mathrm{TPR}}{\mathrm{PPV}+\mathrm{TPR}}\) › Harmonic mean of precision and recall
False Positive Rate \(\frac{FP}{FP+TN}\) › Proportion of incorrect positive predictions 
Predicted Positive Rate  \(\frac{TP+FP}{TP+TN+FP+FN}\)  › Proportion of samples predicted as positive

3.2.3 Quick Example

  • Accuracy
    • (2+4)/(2+1+2+4) = 6/9 \(\approx\) 0.66
  • Precision
    • 2/(2+2) = 2/4 = 0.5
  • Recall
    • 2/(2+1) = 2/3 \(\approx\) 0.66
  • F1-score
    • \(\frac{2\times\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}\)

3.3 Is that enough?

3.3.1 Some examples…

… reinforcing structural discriminations, through:

(a) Number of days with targeted policing for drug crimes in areas flagged by PredPol analysis of Oakland police data. (b) Targeted policing for drug crimes, by race. (c) Estimated drug use by race (extracted from (Lum and Isaac 2016)).

CAF’s suspicion score (extracted from (QdN 2023)).

3.4 Fairness Metrics

3.4.1 Conceptual Overview

Three core concepts (Barocas, Hardt, and Narayanan 2023):

  1. Independence — predictions are statistically independent of sensitive attributes
    • e.g., same positive rate for all groups
  2. Separation — error rates are similar across groups, conditional on the true label
    • e.g., same false-positive / false-negative rates for each group
  3. Sufficiency — predictions are equally reliable across groups
    • e.g., for a given predicted probability, the actual correctness is the same

3.4.2 Underlying idea

Fairness across groups

The model should perform similarly across groups.

\(\to\) compute performance metrics for differnt groups:
let \(p\) refer to privileged group and \(n\) to non-privileged group:

\(m_\mathrm{p}=\mathrm{score}(\mathrm{TP}_\mathrm{p},\mathrm{FP}_\mathrm{p},\mathrm{TN}_\mathrm{p},\mathrm{FN}_\mathrm{p}),\quad m_\mathrm{n}=\mathrm{score}(\mathrm{TP}_\mathrm{n},\mathrm{FP}_\mathrm{n},\mathrm{TN}_\mathrm{n},\mathrm{FN}_\mathrm{n})\)

\(\to\) compare them with a ratio. A “fair” model should ensure (RoT: \(\epsilon\approx0.8\)):

\[ \epsilon \leq \frac{m_\mathrm{n}}{m_\mathrm{p}} \leq 1/\epsilon\]

3.4.3 Practical Measures of Fairness

Name Metric Concept Interpretation (Wiśniewski and Biecek 2022)
Statistical Parity PPR  Independence (equivalent) whether both groups have equal likelihood of being predicted as positive
Equal Opportunity TPR Separation (relaxation) likelihood of correctly recognising a positive is equal regardless of group
Predictive Equality FPR  Separation (relaxation) likelihood of incorrectly classifying a negative as positive is equal regardless of group 
Predictive Parity PPV Sufficiency (relaxation) whether positive are equally reliable across groups
Accuracy Difference ACC whether the model performance is consistent across groups

[💡Reminder: based on per-group performance]

Metric Formula Interpretation
Predicted Positive Rate (PPR)  \(\frac{TP+FP}{TP+TN+FP+FN}\)  › Proportion of samples predicted as positive
True Positive Rate (TPR) \(\frac{TP}{TP + FN}\) › Coverage of actual positive sample (=recall)
False Positive Rate (FPR) \(\frac{FP}{FP+TN}\) › Proportion of incorrect positive predictions 
Positive Predicted Value \(\frac{TP}{TP + FP}\) › How accurate the positive predictions are (=precision)
Accuracy \(\frac{TP + TN}{TP + TN + FP + FN}\) › Overall proportion of correct predictions

3.5 Key Message

  • Strong Model \(\nRightarrow\) Fair Model
  • (Fair Model \(\nRightarrow\) Strong Model)

4 Hands-on

4.1 Hands-On Challenge

References

Alammar, Jay. 2018. “The Illustrated Bert, Elmo, and Co. (How NLP Cracked Transfer Learning).” Visualizing Machine Learning One Concept at a Time. https://jalammar.github.io/illustrated-bert/.
Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias.” Propublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Barocas, Solon, Moritz Hardt, and Arvind Narayanan. 2023. Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org.
Dastin, Jeffrey. 2022. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against Women.” In Ethics of Data and Analytics, 296–99. Auerbach Publications.
Lum, Kristian, and William Isaac. 2016. “To Predict and Serve?” Significance 13 (5): 14–19. https://doi.org/https://doi.org/10.1111/j.1740-9713.2016.00960.x.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.
QdN. 2023. “Scoring of Welfare Beneficiaries: The Indecency of CAF’s Algorithm Now Undeniable.” La Quadrature Du Net. https://www.laquadrature.net/en/2023/11/27/scoring-of-welfare-beneficiaries-the-indecency-of-cafs-algorithm-now-undeniable/.
Rotaru, Victor, Yi Huang, Timmy Li, James Evans, and Ishanu Chattopadhyay. 2022. “Event-Level Prediction of Urban Crime Reveals a Signature of Enforcement Bias in US Cities.” Nature Human Behaviour 6 (8): 1056–68. https://doi.org/10.1038/s41562-022-01372-0.
Telford, Taylor. 2019. “Apple Card Algorithm Sparks Gender Bias Allegations Against Goldman Sachs.” Washington Post 11.
Wiśniewski, Jakub, and Przemysław Biecek. 2022. “Fairmodels: A Flexible Tool for Bias Detection, Visualization, and Mitigation.” https://arxiv.org/abs/2104.00507.