Large Language Models

LLMs: Foundations, Generation & Beyond

Noé Durandard (noe.durandard@psl.eu)

Introduction to NLP — MSc. DH EdC-PSL

November 18, 2025

…in the real world

0.1 Evidences of ChatGPT’s impact

LLMs influence the way we communicate (Yakura et al. 2024; Geng et al. 2024; Anderson, Shah, and Kreminski 2024).

Prevalence of words associated with ChatGPT. Extracted from (Yakura et al. 2024).

0.2 Influence besides vocabulary

LLMs influence users and participate in shaping opinion

*Participants interacting with a model supportive of social media were more likely to say that social media is good for society in a later survey (and vice versa) — from Jakesch et al. (2023).*

collective opinion (Li et al. 2023)
individual opinion (Jakesch et al. 2023; Williams-Ceci et al. 2024)

0.3 Propagation of models’ biases

Language is loaded with sociocultural characteristics (Jiang 2000).

\(\to\) LLMs biases can have harmful consequences, and spread in downstream applications:

lower quality service to minorities and vulnerablue users (Poole-Dayan, Roy, and Kabbara 2024; Cunningham et al. 2024)
demographic-based recommandations (Salinas et al. 2023)
mental health issues and cultural values (Hadar-Shoval et al. 2023)
politically biased text summarization (Yuhan Liu et al. 2024)
…

1 Framing Large Language Models

1.0.1 Definition(?)

Large Language Model - from (Merriam-Webster 2025)

A language model that utilizes deep [learning] methods on an extremely large data set as a basis for predicting and constructing natural-sounding text.

1.1 Large

1.1.1 What is a Large Language Model?

Extracted from (Thompson 2025).

1.1.2 Scaling Laws & The bitter Lesson

The bitter Lesson (Sutton 2019)

“general methods that leverage computation are ultimately the most effective, and by a large margin”…

Knowledge integration helps improve models in the short term (rewarding to researchers, but plateaus and can even inhibit further progress)
Breakthrough progress eventually arrives by an opposing approach based on scaling computation by learn and search

*Language modeling performance improves smoothly as the model size, datasetset size, and amount of compute used for training increase (extracted from (Kaplan et al. 2020)).*

1.2 Causal

1.2.1 Autoregressive

Definition

An autoregressive (AR) model is a representation of a type of stochastic process. It specifies that the output variable (or current state) depends linearly on its own \(p\) previous values: \(X_{t}=\sum _{i=1}^{p}\varphi _{i}X_{t-i}+\varepsilon _{t}\).

In NLP

In NLP, AR models generally refer to next-token prediction models, and the linear assumption can be disregarded.

→ \(n\)-grams \(\in\) LLMs ??

1.2.2 Decoder-only

Only a decoder stack (no encoder)
Causal / masked self-attention
- Each token can only attend to previous tokens
Naturally suited for generation
- Trained for next token prediction
- Produce text left-to-right, one token at a time

*Transformer Architecture*, based on Vaswani (2017), reworked from (Zhang et al. 2023).

Why “decoder-only”?
In the original Transformer (Vaswani et al., 2017), there is an encoder (full attention) and a decoder (causal, masked attention).
LLMs keep only the decoder part: a stack of blocks that predict the next token from left-context only.

Key properties:

Causality !DRAW ATTENTION MATRICE + MASK Mask ensures the model cannot see the future → enforces autoregressive generation.
During training: full sequence + mask.
During inference: generate one token, append it, repeat.
Next-token prediction
The sole training objective is:
Given the prefix, predict the next token.
This surprisingly enables complex capabilities when scaled.
Why use decoder-only for LLMs?
- simpler architecture → easier to scale
- naturally supports text generation (unlike BERT-style encoders)
- flexible prompting: instructions, few-shot examples, chain-of-thought, etc.

Contrast with encoder-only (e.g., BERT):
- Encoder-only models look bidirectionally at the full sentence → good for understanding tasks.
- Decoder-only models look left-to-right → good for generation.
- BERT cannot generate free text; GPT cannot see future tokens during training.

Takeaway:
When people say “LLM” in 2023–2025, they almost always mean a large, decoder-only, causal transformer trained with next-token prediction.

1.3 Language Models

1.3.1 The Language Modeling Problem (again)

Language Model

A Language Model (LM) estimates the probability of pieces of text. Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:

What is \(P(w_{1},w_{2},\cdots,w_{S})\)?

How to compute \(P\)?

~~Count-based (count): n-gram~~
Neural method (learn): Transformer
- ~~Masked Language Modeling (encoder-only, BERT)~~
- Causal Language Modeling (decoder-only, GPT)

\(\to\) encode:

gramaticality
semantic plausability
stylistic consistency
knowledge (?)

Large

(? enough)

Autoregressive

Causal

(→ decoder-only)
Language Models

parameters + data

… yes but …

as in

(more or less)

2 Training

2.1 LLMs training workflow

*Overview of LLMs Training Pipeline. Extracted from (Wolfe 2023), based on (Ouyang et al. 2022).*

2.2 Pre-training

2.2.1 Next token prediction task

\[P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t})\]

*Example generated with `allenai/OLMo-2-0425-1B` (e.g., training).*

*Example generated with `allenai/OLMo-2-0425-1B` (e.g., generation).*

2.2.2 Self-Supervision

Self-Supervised Learning (Wikipedia 2025)

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels.

2.2.3 Training Objective: Cross-Entropy

\(\mathcal{L}[\phi] = - \sum_{i=1}^{I}\sum_{t=1}^{T}\log\left[ P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t},\phi) \right]\)

Penalizes confident wrong predictions heavily
- Low probability to ‘true’ next token \(\Rightarrow\) high loss: \(\displaystyle{\lim_{p_i \to 0}}\left(-\log(p_i)\right)=\infty\)
Rewards accurate predictions
- High probability to ‘true’ next token \(\Rightarrow\) low loss: \(\displaystyle{\lim_{p_i \to 1}}\left(-\log(p_i)\right)=0\)
Maintains Differentiability
- smooth mathematical properties allow effective gradient-based optimization

Here a recap on the situation:

Right-context only
go through LLM + LM head (softmax)
get probs
- training: compute CE & update model
- eval: compute ppl
- inference: sample

During pre-training, the learning process follows a continuous cycle:

Forward Pass: The model processes input text and outputs probability distributions over the entire vocabulary for each position
Loss Calculation: Cross-entropy loss compares the predicted probabilities with the actual next tokens
Backpropagation: Gradients of the loss flow backward through all model parameters
Parameter Updates: An optimizer adjusts the model weights to minimize future loss
Iteration: This process repeats across billions of tokens from diverse text sources

…

OK great for the objective and training… but what are these extremely large dataset?

2.3 Data

2.3.1 Data Sources

Pre-training ≈ learning from the distribution of texts on the internet.

Primarily large-scale web crawls
- Common Crawl, C4, CCNet, Dolma, RefinedWeb, RedPajama
Curated “high-quality” reference works
- Wikipedia, books, textbooks, scientific papers
Code repositories
- GitHub, StackOverflow

*The distribution of data types in pre-training corpora used by different LLMs. Extracted from Yang Liu et al. (2025).*

Language modelling: learns to generate correct text
- learns mainly from the web
With linguistic properties; learns other things
- - remember socio-cultural characteristics imbued in language

+NOTE: even for open-“weights” model, there datasets are rarely (almost never) available

expand a bit on that: this is the main source of knowledge of LLMs (also true for later steps)

AND there are important decisions taken that are loaded with (and cause) biases

This slide is about where the model’s knowledge comes from.

The dominant source is web content, usually scraped through tools like CommonCrawl.
Even when models incorporate curated datasets (e.g., Wikipedia, books), the vast majority of tokens still originate from broad, uncurated internet text.

This creates multiple layers of bias:

Whose voices appear online? In which languages?
Which genres dominate? (tech, marketing, news, encyclopedic)
Which communities are over/underrepresented?
Who has technical access to publishing on the web?

Pre-training corpora encode pre-existing social inequalities: gender, race, geography, class, topic interest, etc.

Reminder: The model reflects the cultural production of the internet, not an objective reality.

2.3.2 Data Filtering

Is that so?

python train_my_llm.py --dataset=extremely_large_dataset

Not that simple…

*Overview of web data processing pipeline for Dolma (Ai2). Extracted from (Soldaini 2023).*

2.3.3 Quality data?

“Quality” filtering choices…

Monolingual and high-resource language bias (Wenzek et al. 2019)
Academic / Formal-style bias (Gururangan et al. 2022)
Textbook-like datasets (Gunasekar et al. 2023)
Wikipedia-heavy corpora (Touvron et al. 2023)
Topic / socio-economic bias (Lucy et al. 2024)

… have consequences!

Marginalization of low-resource languages, dialects, sociolects
Overrepresentation of techno-scientific, Western, elite knowledge
“Highbrow” epistemic bias: formal, rational, encyclopedic styles favored
Missing registers: oral, vernacular, informal, multilingual mixing, community languages
Potentially distorted representations of social groups or cultural practices

How do we define quality?

Important message: “Quality” is not a neutral concept.

Data collection is hard; there are a lot of ad-hoc choices and unreliable methods.
- e.g., the way we filter data by language or “quality”
These choices and decisions have real downstream implications on us!

Each step involves human decisions (thresholds, heuristics, classifiers), and these decisions introduce:

value judgments (what counts as “quality”? “toxicity”? “spam”?)
algorithmic biases (language detectors fail on dialects and low-resource languages)
ultural filtering (certain topics/sites systematically removed)

Filtering tends to privilege:

standard language varieties,
mainstream topics,
“respectable” sources,
communities with access to publishing online.

Thus: LLMs do not merely reflect the world — they reflect a filtered, reweighted version of it.

Connect to DH themes:

Whose cultural production is included/excluded?
How does algorithmic curation reinforce epistemic hierarchies?
How do these biases propagate into downstream applications (summaries, search, chatbots)?
These biases persist even in later steps (SFT, RLHF), because those processes use human-curated instruction data that tends to be English-dominant, Western-centric, and written by annotators with particular socio-economic backgrounds.

Consequences:

Besides low-resources languages: sociolects, dialects, “low”-quality mmh
Highbrow bias
…

2.4 Post(-pre)-training

2.4.1 Supervised (Instruction) Fine-Tuning

Goal: Teach model to follow instructions.

Supervised setting
- instructions & responses
Uses curated input → output examples
- e.g., “Summarize…”, “Translate…”, “Explain like I’m 5…”

*Task categories present in Super-NaturalInstructions dataset (Wang et al. 2022) (extracted from original article).*

Moves from predicting the next token → following human-written instructions
Uses curated input → output examples
(e.g., “Summarize…”, “Translate…”, “Explain like I’m 5…”)
Often trained on high-quality, task-oriented, sometimes human-annotated datasets
(e.g., FLAN, Super-NaturalInstructions, OpenAI’s internal SFT datasets)
Objective is still cross-entropy, but now on instruction–response pairs
Produces a model that is:
- more controllable,
- better aligned with user expectations,
- capable of zero-shot generalization to new tasks
  (because instructions become part of the input)

Intuition:
After pre-training, the model knows a lot, but has no idea what a “task” is. It’s like a very smart text autocomplete.
SFT turns it into something that understands prompts such as “Explain…”, “Write…”, “Compare…”.
We provide examples of desired behavior and train the model to imitate them.

Data:
Usually 10³–10⁶ examples, much smaller than pre-training corpora.
Often human-written or human-curated: this is where quality matters more than quantity.

Challenges:
– Biases from annotators/sources (whose instructions? whose preferences?)
– Risk of overfitting to specific styles
– Still no “value alignment”: the model imitates but doesn’t “prefer” anything.

2.4.2 RL (from Human Feedback)

Goal: Align model with desired behaviors (see e.g., (Bai et al. 2022; Ouyang et al. 2022)).

Extracted from (Lambert et al. 2022).

Finally: Reinforcement Learning From Human Feedback: teach the model

quick description of the process:

preference annotation
train reward model
train llm against reward model

Adds a preference model on top of SFT
- Humans rank multiple candidate outputs
- A reward model (RM) is trained to predict these preferences
Final model is optimized (e.g., PPO) to maximize predicted reward
Improves qualities such as:
- helpfulness
- politeness
- safety
- reduced harmful or undesired behaviors
Famous pipeline: SFT → RM training → RLHF optimization
(introduced widely by InstructGPT (Ouyang et al. 2022))

Intuition:
SFT → “Here’s how to answer.”
RLHF → “Among several possible valid answers, this one is better.”

Instead of teaching the model what to output, we teach it which outputs humans prefer.
This introduces a preference-based optimization loop.

Steps:
1. Supervised stage (SFT): Model learns to produce reasonable answers.
2. Reward model:
- For each prompt, humans rate A vs. B.
- RM is trained to predict these rankings.
3. RL (e.g., PPO):
The model generates answers; RM scores them; gradients push the model toward high-reward behaviors.

Why RLHF?
– Reduces toxicity, harmful content
– Makes outputs more conversational
– Encourages chain-of-thought (if included in preferences)
– Allows encoding of abstract values (helpfulness, harmlessness, honesty)

Limitations / Bias Concerns:
– Human raters introduce cultural, linguistic, political biases
– Reward hacking (model optimizes reward, not truth)
– Alignment may break under distribution shifts
– High cost: requires large annotation teams

3 Inference → Tutorial

3.0.1 Use-cases

LLMs Capabilities. Extracted from (Minaee et al. 2025).

3.1 Prompt Strucutre

3.1.1 Elements of a Prompt

Natural Language Interaction:

System prompt
- set of instructions, guidelines, and contextual information provided to AI models before they engage with user queries.
User prompt
- actual message of the user
  - Instruction: definition of the task
  - Examples: provide examples (as in few-shot)
  - Context: provide additional context to address the problem
  - Quesiton: prompt the model to solve the task

*Screenshot of OLMo playground (https://playground.allenai.org/?model=cs-OLMo-2-1124-7B-Instruct)*

Example for OLMo-2-Instruct:

Example for Qwen2.5-Instruct:

Example for Llama-3.1-Instruct:

messages = [ # contains chat history as a list of dict where each instance has:
    {
      "role": "system",                          # role (*system*|user|assistant)
      "content": "You are a helpful assistant."  # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "Who are you?"                  # content of the message
    },
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
    messages,                             # provide chat history
    add_generation_prompt=True,           # add generation prompt (assistant:...)
    tokenize=True,                        # do tokenize the input
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

OLMo-2-Instruct style:

messages = [ # contains chat history as a list of dict where each instance has:
    {
      "role": "system",                          # role (*system*|user|assistant)
      "content": "You are a helpful assistant."  # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "Who are you?"                  # content of the message
    },
    {
      "role": "assistant",                       # role (system|user|*assistant*)
      "content": "I am an AI assistant."         # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "What can you do for me?"       # content of the message
    },
    {
      "role": "assistant",                       # role (system|user|*assistant*)
      "content": "I can do all sorts of things." # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "What exactly?"                 # content of the message
    },
    # and so on ...
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
    messages,                             # provide chat history
    add_generation_prompt=True,           # add generation prompt (assistant:...)
    tokenize=True,                        # do tokenize the input
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

3.1.3 Nice visualization here (Cho et al. 2024):

3.2 Sampling

3.2.1 Next-token probability distribution

\[P_i=\frac{e^{\frac{z_i}{T}}}{\displaystyle\sum_{j=1}^{N_\mathrm{voc}}e^{\frac{z_j}{T}}}\]

3.2.2 Some Other Generation Parameters

temperature
- Controls “creativity” vs. determinism.
  - ↓ low (0–0.7): predictable, focused, factual
  - ↑ high (1.0+): diverse, surprising, risk of incoherence
top-k
- Keep only the k most likely next tokens.
  - small k → conservative, avoids rare tokens
  - large k → more expressive, less stable
top-p (nucleus sampling)
- Keep the smallest set of tokens whose cumulative probability ≥ p.
  - p≈0.0: only most likely token
  - p≈0.9: diverse but meaningful
  - p≈1.0: no filtering

length_penalty
- Encourages longer or shorter outputs.
  - \(>\) 1: longer answers
  - \(<\) 1: concise answers
repetition_penalty
- Penalizes reusing tokens/sequences.
  - \(>\) 1.0 reduces loops and obsessive repetition
no_repeat_ngram_size
- Hard constraint: forbids repeating n-grams of size n.

4 Evaluation

4.1 Challenges

4.1.1 What’s different?

LLMs are stochastic generative (~intrinsic),

natural language quality (fluency, coherence, relevance)
correctness / factuality
robustness
safety / toxicity

general purpose models (~extrinsic) (Radford et al. 2019).

general knowledge
task-specific abilities
calibration & reliability
societal impact (bias, fairness, harms, …)

\(\to\) no single metric can conveniently evaluate LLMs capabilities and risks (Liang et al. 2023; Bommasani et al. 2022)!

Distinctions here (and in the following) are a bit blurry…

Intrinsic/Extrinsic based on (Bommasani et al. 2022) “Evaluation” taxonomy
Evaluating Large Language Models is significantly more complex than evaluating earlier, task-specific NLP models (like a BERT model fine-tuned for a single classification task). The primary challenge is that LLMs are general-purpose and generative, meaning a single metric often fails to capture their full range of capabilities and risks.
Why evaluation for LLMs is different
- LLM outputs are open-ended, prompt-sensitive, and probabilistic → many traditional metrics (BLEU/ROUGE) are partial proxies.
- Evaluation must cover: quality (fluency, coherence), correctness/factuality, safety/toxicity, robustness, efficiency, and societal impacts (bias, fairness, harms).

4.1.2 What to evaluate?

LLM evaluation is typically organised around (following and adapted from (Guo et al. 2023; Chang et al. 2024)):

NLP Abilities
- NLU, NLG, fluency, coherence, grammaticality, …
(Specific) Knowledge and Abilities
- QA, reasoning, classification, …
- Domain specific, e.g.: education, finance, health, …
Safety and Social Impact
- misinformation, bias, representational harms, disparate treatment, …

*Proposed taxonomy of major categories and sub-categories of LLM evaluation, extracted from (Guo et al. 2023).*

4.1.3 Evaluation Approaches (overview)

Automated
- metric-based (perplexity, accuracy, BLEU, etc.)
- scalable, reproducible
- limited for open-ended tasks
Human
- comparative assessment
- qualitative insights
- “vibe check”

Benchmark-Based
- standardized datasets & tasks
- often MCQA or constrained formats
- useful for comparison but may not generalize
Adversarial
- robustness against prompts, attacks, edge cases
- surfaces safety issues and model brittleness

4.1.4 Leaderboard Example (Fourrier et al. 2024)

4.2 Machine Behavior

4.2.1 Machine Behavior (Rahwan et al. 2019)

“Machine Behavior is concerned with the scientific study of intelligent machines, not as engineering artefacts, but as a class of actors with particular behavioural patterns and ecology. This field overlaps with, but is distinct from, computer science and robotics. It treats machine behaviour empirically. This is akin to how ethology and behavioural ecology study animal behaviour […]”

See also Machine Psychology and related literature, e.g., (Bommasani et al. 2022; Binz and Schulz 2023; Hagendorff et al. 2024; Ye et al. 2025).

4.2.2 Goals

Systematic assessment of behaviors
Studying ML-systems as a “class of actors with particular behavioural patterns and ecology” / socio-cultural artefacts
- Which kinds of (human-relevant) capabilities are present?
- What do they tell about the models?
Using experimental methods inspired from behavioral psychology (Ye et al. 2025) and HSS (Bommasani et al. 2022)

4.2.3 [Side note:] Relation to bias?

Bias (definition attempt)

Systematic tendency in model outputs or behaviors that produce, reflect or amplify disparate, unfair or harmful treatments of individuals or social groups (Barocas and Selbst 2016). This encompasses broad ranges of discrimanative patterns, inter alia: allocational harms, representational harms, stereotyping and unequal system performances, or questionnable correlations (Bender et al. 2021; Blodgett et al. 2020).

Caution

This is a definition attempt. In NLP, the notion of “bias” is often used as an (ill-defined) umbrella term covering diverse ranges of “biased” behaviors. Blodgett et al. (2020) call for more rigorous contextual and operational definitions.

How does it relate “Biases” and Machine Behavior relate?

⚠️ Machine behavior analyses can be used as a proxy to biases… but it is hard to generalize / find direct causal links to unfair / harmful behaviors!

This is a definition attempt anchored in the literature, with two major references:

Barocas from ML/fairness
Bender from NLP Ethics

In practice: very unclear what we are talking about when talking about “biases” in NLP (cf. Blodgett)

Bias (in the LLM/NLP context) = any systematic tendency in model outputs or behaviors that produces unequal, unfair, or harmful treatment of individuals or groups — typically traceable to data, model, or deployment choices. This includes statistical biases (estimators), representational biases in data, and socio-technical harms (disparate impact).
Bias in LLMs refers to the ways in which the model systematically and unfairly disadvantages or privileges certain individuals or groups in its outputs, often by reflecting, internalizing, and amplifying stereotypes and unequal social perceptions present in its vast training data.

4.2.4 Examples: MCQ-based Evaluation

Political Leaning
- Political Compass Test (PCT)*

Cultural norms and values
- World Value Survey (WSV)*
- Global Attitude Survey

Moral values and personality-like traits
- Moral Foundations Questionnaire (MFQ)
- Myers Briggs Type Indicator
- Big-5*

Sex outside marriage is usually immoral.

When jobs are scarce, employers should give priority to people of this country over immigrants.

You make friends easily.

*: Example shown.

4.2.5 Examples: MCQ-based Results

Political Leaning
- Center-left (Ceron et al. (2024), and others)
- Social-Democrat leaning

Cultural norms and values
- Mainly Anglo-saxon world
- W.E.I.R.D.

Moral values and personality-like traits
- ↑ Agreeableness
- ↓ Neuroticism

*Extracted from (Rutinowski et al. 2024).*

5 Summary

Decoder only models trained for next-token prediction
Pre-training data is not “partial”
Deployed models ar not only pre-trained \(\to\) further aligned
Multi-task learner and multiple use cases
Evaluation is hard!
- & biases run deep and have implications!

Not the pinnacle!

*LLM Iceberg… ©Sasha Luccioni (@SashaMTL).*

*Even from a technical perspective… Slide from Yann LeCun (AMS Josiah Willard Gibbs Lecture, “Mathematical Obstacles on the Way to Human-Level AI.” (2025)).*

6 4 weeks / 4 slides

6.1 Language modeling and transformer-based models

6.1.1 Foundations: The Transformer Architecture

6.1.2 Feature Extraction

6.1.3 Classification and Fine-tuning

6.1.4 Language Geneation

6.2 In practice

6.2.1 Minimal (pseudo-)code for diverse tasks

Feature Extraction

from transformers import (
  BertTokenizer,
  BertModel
)
docs = ["YOUR", "DOCS", "HERE"]
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(DEVICE)
tokenized_sentences = tokenizer( # Tokenize the docs
  docs,
  truncation=True,               # Truncate inputs overflowing context length
  padding=True,                  # Pad inputs to maximal sequence length
  return_tensors="pt"
).to(DEVICE)
outputs = model(**tokenized_sentences)  # Run (tokenized) inputs through BERT
embeddings = outputs["last_hidden_state"] # Vector Representations: N samples, Sequence length, Embedding dimensions (you could do something more fancy)
# Retrieve embeddings at the position of the tokens of interest

Document Representation

- BoW / TF-IDF

from sklearn.feature_extraction.text import (
  TfidfVectorizer,
  CountVectorizer
)
docs = ["YOUR", "DOCS", "HERE"]
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(docs)
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(docs)

- SentenceTransformer

from sentence_transformers import SentenceTransformer
docs = ["YOUR", "DOCS", "HERE"]
model_name = "<YOUR/MODEL_NAME>"       # local or from HF hub
model = SentenceTransformer(model_name)
embeddings = model.encode(sentences)

Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity
docs_representations = ... # pre-computed document representations
pairwise_cosine_sims = cosine_similarity(docs_representations) # shape N_docs x N_docs
pairwise_cosine_sims[0,1] # cosine sim between docs 0 and 1

Topic Modeling:

- BERTopic

from bertopic import BERTopic                    # altenatively: use yours! `from <PATH_TO_SCRIPT>.mybertopic import MyBERTopic`
docs = ["YOUR", "DOCS", "HERE"]
topic_model = BERTopic()                         # instantiate the BERTopic object (you can add arguments / choose modules)
topics, probs = topic_model.fit_transform(docs)  # represent docs -> find clusters = topics -> assign 'labels'

- LDA

from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import LdaModel

texts = ["YOUR", "DOCS", "HERE"]                     # your texts here (might be a good idea to do a bit of cleaning at some point...)
dictionary = corpora.Dictionary(texts)                # extracts vocabulary
corpus = [dictionary.doc2bow(text) for text in texts] # documents BoW representations

lda_model = LdaModel(             # Instantiate (and run) LDA algorithm
    corpus=corpus,                # pre-processed documents
    id2word=dictionary,           # corresponding dict. {id:word}
    num_topics=n_topics,          # ! THE NUMBER OF TOPICS HERE !
)

Rep+ML (& eval)

from sklearn.metrics import classification_report
from sklearn.svm import SVC # or any other ML method
X_train, X_test = ... # pre-computed document representations
y_train, y_test = ... # associated labels
clf = SVC()                # + add arguments
clf.fit(X_train, y_train)  # train the ML method
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))  # or the metric(s) you are interested in

SFT (for sequence classification)

from datasets import load_dataset
from transformers import (
  AutoTokenizer, AutoModelForSequenceClassification,
  TrainingArguments, Trainer,
)

raw_dataset = load_dataset("<YOUR_DATASET>")
model_name = "YOUR_MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
  model_name,
   num_labels=2 # Number of labels for the classification task (in the 'dataset')
)

def tokenize_function(example, text_column="text", max_length=512):
    return tokenizer(ex[text_column], padding="max_length", max_length=max_length)
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer")     # put the training arguments here!
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)
trainer.train()

Generation

from transformers import pipeline
model_name = "<YOUR/MODEL_NAME>"       # local or from HF hub
prompts = ["YOUR", "PROMPTS", "HERE"]  # list of prompts
generator = pipeline(      # (note: you can use `pipeline` for diverse tasks)
    "text-generation",     # state the task
    model=model_name,      # give the name of the LM to use
    device_map="auto",     # automatically moves model to GPU if available
)
outputs = generator(
    prompts,                       # list of prompts
    max_new_tokens=max_new_tokens, # you can give the pipeline some generation arguments
    batch_size=batch_size          # pipeline handles the batch internally
)
responses = [o[0]["generated_text"] for o in outputs]

7 Hands-on

7.1 Hands-On Proposition

References

Anderson, Barrett R, Jash Hemant Shah, and Max Kreminski. 2024. “Homogenization Effects of Large Language Models on Human Creative Ideation.” In Creativity and Cognition, 413–25. C&c ’24. ACM. https://doi.org/10.1145/3635636.3656204.

Atari, Mohammad, Mona J Xue, Peter S Park, Damián Blasi, and Joseph Henrich. 2023. “Which Humans?”

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” https://arxiv.org/abs/2204.05862.

Barocas, Solon, and Andrew D Selbst. 2016. “Big Data’s Disparate Impact.” Calif. L. Rev. 104: 671.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.

Binz, Marcel, and Eric Schulz. 2023. “Turning Large Language Models into Cognitive Models.” https://arxiv.org/abs/2306.03917.

Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. “Language (Technology) Is Power: A Critical Survey of ‘Bias’ in NLP.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, edited by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, 5454–76. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485.

Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2022. “On the Opportunities and Risks of Foundation Models.” https://arxiv.org/abs/2108.07258.

Ceron, Tanise, Neele Falk, Ana Barić, Dmitry Nikolaev, and Sebastian Padó. 2024. “Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in Llms.” Transactions of the Association for Computational Linguistics 12: 1378–1400.

Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, et al. 2024. “A Survey on Evaluation of Large Language Models.” ACM Trans. Intell. Syst. Technol. 15 (3). https://doi.org/10.1145/3641289.

Cho, Aeree, Grace C. Kim, Alexander Karpekov, Alec Helbling, Zijie J. Wang, Seongmin Lee, Benjamin Hoover, and Duen Horng Chau. 2024. “Transformer Explainer: Interactive Learning of Text-Generative Models.” https://arxiv.org/abs/2408.04619.

Cunningham, Jay, Su Lin Blodgett, Michael Madaio, Hal Daumé Iii, Christina Harrington, and Hanna Wallach. 2024. “Understanding the Impacts of Language Technologies’ Performance Disparities on African American Language Speakers.” In Findings of the Association for Computational Linguistics: ACL 2024, edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 12826–33. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.761.

Fourrier, Clementine, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. 2024. “Open-LLM Performances Are Plateauing, Let’s Make the Leaderboard Steep Again.” Hugging Face – The AI Community Building the Future. https://huggingface.co/spaces/open-llm-leaderboard/blog.

Geng, Mingmeng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, and Pan Zhou. 2024. “The Impact of Large Language Models in Academia: From Writing to Speaking.” https://arxiv.org/abs/2409.13686.

Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, et al. 2023. “Textbooks Are All You Need.” https://arxiv.org/abs/2306.11644.

Guo, Zishan, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, et al. 2023. “Evaluating Large Language Models: A Comprehensive Survey.” https://arxiv.org/abs/2310.19736.

Gururangan, Suchin, Dallas Card, Sarah Dreier, Emily Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. 2022. “Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, edited by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, 2562–80. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.165.

Hadar-Shoval, Dorit, Kfir Asraf, Yonathan Mizrachi, Yuval Haber, and Zohar Elyoseph. 2023. “The Invisible Embedded ‘Values’ Within Large Language Models: Implications for Mental Health Use.”

Hagendorff, Thilo, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, and Eric Schulz. 2024. “Machine Psychology.” https://arxiv.org/abs/2303.13988.

Jakesch, Maurice, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman. 2023. “Co-Writing with Opinionated Language Models Affects Users’ Views.” In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. CHI ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3544548.3581196.

Jiang, Wenying. 2000. “The Relationship Between Culture and Language.” ELT Journal 54 (4): 328–34.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” https://arxiv.org/abs/2001.08361.

Lambert, Nathan, Louis Castricato, Leandro von Werra, and Alex Havrilla. 2022. “Illustrating Reinforcement Learning from Human Feedback (RLHF).” Hugging Face Blog.

Li, Chao, Xing Su, Haoying Han, Cong Xue, Chunmo Zheng, and Chao Fan. 2023. “Quantifying the Impact of Large Language Models on Collective Opinion Dynamics.” arXiv Preprint arXiv:2308.03313.

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. 2023. “Holistic Evaluation of Language Models.” https://arxiv.org/abs/2211.09110.

Liu, Yang, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2025. “Datasets for Large Language Models: A Comprehensive Survey.” Artificial Intelligence Review 58 (12). https://doi.org/10.1007/s10462-025-11403-7.

Liu, Yuhan, Shangbin Feng, Xiaochuang Han, Vidhisha Balachandran, Chan Young Park, Sachin Kumar, and Yulia Tsvetkov. 2024. “P\(^3\)Sum: Preserving Author’s Perspective in News Summarization with Diffusion Language Models.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), edited by Kevin Duh, Helena Gomez, and Steven Bethard, 2154–73. Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.119.

Lucy, Li, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, and Jesse Dodge. 2024. “AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 7393–7420. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.400.

Merriam-Webster. 2025. “Large Language Model.” In Merriam-Webster.com Dictionary. https://www.merriam-webster.com/dictionary/large%20language%20model.

Minaee, Shervin, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2025. “Large Language Models: A Survey.” https://arxiv.org/abs/2402.06196.

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, 35:27730–44. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

Poole-Dayan, Elinor, Deb Roy, and Jad Kabbara. 2024. “LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users.” https://arxiv.org/abs/2406.17737.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, et al. 2019. “Machine Behaviour.” Nature 568 (7753): 477–86. https://doi.org/10.1038/s41586-019-1138-y.

Röttger, Paul, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. “Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 15295–311. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.816.

Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. 2024. “The Self-Perception and Political Biases of ChatGPT.” Human Behavior and Emerging Technologies 2024 (1): 7115633.

Salinas, Abel, Parth Shah, Yuzhong Huang, Robert McCormack, and Fred Morstatter. 2023. “The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA.” In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. EAAMO ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3617694.3623257.

Soldaini, Luca. 2023. “Ai2 Dolma: 3 Trillion Token Open Corpus for Language Model Pretraining: AI2.” Ai2 RSS. https://allenai.org/blog/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64.

Sutton, Richard. 2019. “The Bitter Lesson.” Incomplete Ideas (Blog) 13 (1): 38.

Thompson, Alan D. 2025. “Models Table (10,000+ LLM Data Points).” LifeArchitect.ai. https://lifearchitect.ai/models-table/.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” https://arxiv.org/abs/2302.13971.

Vaswani, A. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems. https://dl.acm.org/doi/10.5555/3295222.3295349.

Wang, Yizhong, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, et al. 2022. “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, edited by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, 5085–5109. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.340.

Wenzek, Guillaume, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data.” https://arxiv.org/abs/1911.00359.

Wikipedia. 2025. “Self-Supervised Learning.” In Wikipedia - the Free Encyclopedia. https://en.wikipedia.org/wiki/Self-supervised_learning.

Williams-Ceci, Sterling, Maurice Jakesch, Advait Bhat, Kowe Kadoma, Lior Zalmanson, and Mor Naaman. 2024. “Bias in AI Autocomplete Suggestions Leads to Attitude Shift on Societal Issues.”

Wolfe, Cameron R. 2023. “The Story of RLHF: Origins, Motivations, Techniques, and Modern Applications.” The Story of RLHF: Origins, Motivations, Techniques, and Modern Applications. Deep (Learning) Focus. https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations.

Yakura, Hiromu, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, and Iyad Rahwan. 2024. “Empirical Evidence of Large Language Model’s Influence on Human Spoken Communication.” https://arxiv.org/abs/2409.01754.

Ye, Haoran, Jing Jin, Yuhang Xie, Xin Zhang, and Guojie Song. 2025. “Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement.” arXiv Preprint arXiv:2505.08245.

Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2023. Dive into Deep Learning. Cambridge University Press.