LLMs: Foundations, Generation & Beyond
Introduction to NLP — MSc. DH EdC-PSL
November 18, 2025
LLMs influence the way we communicate (Yakura et al. 2024; Geng et al. 2024; Anderson, Shah, and Kreminski 2024).
Prevalence of words associated with ChatGPT. Extracted from (Yakura et al. 2024).
Language is loaded with sociocultural characteristics (Jiang 2000).
\(\to\) LLMs biases can have harmful consequences, and spread in downstream applications:
Large Language Model - from (Merriam-Webster 2025)
A language model that utilizes deep [learning] methods on an extremely large data set as a basis for predicting and constructing natural-sounding text.
Extracted from (Thompson 2025).
The bitter Lesson (Sutton 2019)
“general methods that leverage computation are ultimately the most effective, and by a large margin”…
Definition
An autoregressive (AR) model is a representation of a type of stochastic process. It specifies that the output variable (or current state) depends linearly on its own \(p\) previous values: \(X_{t}=\sum _{i=1}^{p}\varphi _{i}X_{t-i}+\varepsilon _{t}\).
In NLP
In NLP, AR models generally refer to next-token prediction models, and the linear assumption can be disregarded.
→ \(n\)-grams \(\in\) LLMs ??
Language Model
A Language Model (LM) estimates the probability of pieces of text. Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:
What is \(P(w_{1},w_{2},\cdots,w_{S})\)?
How to compute \(P\)?
\(\to\) encode:
Large
(? enough)
Autoregressive
Causal
(→ decoder-only)
Language Modelsparameters + data
… yes but …
as in
(more or less)
\[P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t})\]
Self-Supervised Learning (Wikipedia 2025)
Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels.
\(\mathcal{L}[\phi] = - \sum_{i=1}^{I}\sum_{t=1}^{T}\log\left[ P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t},\phi) \right]\)
Pre-training ≈ learning from the distribution of texts on the internet.
Is that so?
Not that simple…
“Quality” filtering choices…
… have consequences!
Goal: Teach model to follow instructions.
Goal: Align model with desired behaviors (see e.g., (Bai et al. 2022; Ouyang et al. 2022)).
Extracted from (Lambert et al. 2022).
LLMs Capabilities. Extracted from (Minaee et al. 2025).
Natural Language Interaction:
Example for OLMo-2-Instruct:
<|endoftext|><|system|>
You are a helpful assistant.
<|user|>
Who are you?
<|assistant|>
Example for Qwen2.5-Instruct:
<|im_start|>system
You are a helpful assistant. <|im_end|> <|im_start|>user
Who are you? <|im_end|> <|im_start|>assistant
Example for Llama-3.1-Instruct:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023 Today Date: 26 Jul 2024
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Who are you? <|eot_id|><|start_header_id|>assistant<|end_header_id|>
messages = [ # contains chat history as a list of dict where each instance has:
{
"role": "system", # role (*system*|user|assistant)
"content": "You are a helpful assistant." # content of the message
},
{
"role": "user", # role (system|*user*|assistant)
"content": "Who are you?" # content of the message
},
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
messages, # provide chat history
add_generation_prompt=True, # add generation prompt (assistant:...)
tokenize=True, # do tokenize the input
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))OLMo-2-Instruct style:
<|endoftext|><|system|>
You are a helpful assistant.
<|user|>
Who are you?
<|assistant|>
I am an AI assistant.
<|user|>
What can you do for me?
<|assistant|>
I can do all sorts of things.
<|user|>
What exactly?
<|assistant|>
messages = [ # contains chat history as a list of dict where each instance has:
{
"role": "system", # role (*system*|user|assistant)
"content": "You are a helpful assistant." # content of the message
},
{
"role": "user", # role (system|*user*|assistant)
"content": "Who are you?" # content of the message
},
{
"role": "assistant", # role (system|user|*assistant*)
"content": "I am an AI assistant." # content of the message
},
{
"role": "user", # role (system|*user*|assistant)
"content": "What can you do for me?" # content of the message
},
{
"role": "assistant", # role (system|user|*assistant*)
"content": "I can do all sorts of things." # content of the message
},
{
"role": "user", # role (system|*user*|assistant)
"content": "What exactly?" # content of the message
},
# and so on ...
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
messages, # provide chat history
add_generation_prompt=True, # add generation prompt (assistant:...)
tokenize=True, # do tokenize the input
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))\[P_i=\frac{e^{\frac{z_i}{T}}}{\displaystyle\sum_{j=1}^{N_\mathrm{voc}}e^{\frac{z_j}{T}}}\]
LLMs are stochastic generative (~intrinsic),
general purpose models (~extrinsic) (Radford et al. 2019).
\(\to\) no single metric can conveniently evaluate LLMs capabilities and risks (Liang et al. 2023; Bommasani et al. 2022)!
LLM evaluation is typically organised around (following and adapted from (Guo et al. 2023; Chang et al. 2024)):
“Machine Behavior is concerned with the scientific study of intelligent machines, not as engineering artefacts, but as a class of actors with particular behavioural patterns and ecology. This field overlaps with, but is distinct from, computer science and robotics. It treats machine behaviour empirically. This is akin to how ethology and behavioural ecology study animal behaviour […]”
See also Machine Psychology and related literature, e.g., (Bommasani et al. 2022; Binz and Schulz 2023; Hagendorff et al. 2024; Ye et al. 2025).
Bias (definition attempt)
Systematic tendency in model outputs or behaviors that produce, reflect or amplify disparate, unfair or harmful treatments of individuals or social groups (Barocas and Selbst 2016). This encompasses broad ranges of discrimanative patterns, inter alia: allocational harms, representational harms, stereotyping and unequal system performances, or questionnable correlations (Bender et al. 2021; Blodgett et al. 2020).
Caution
This is a definition attempt. In NLP, the notion of “bias” is often used as an (ill-defined) umbrella term covering diverse ranges of “biased” behaviors. Blodgett et al. (2020) call for more rigorous contextual and operational definitions.
Sex outside marriage is usually immoral.
When jobs are scarce, employers should give priority to people of this country over immigrants.
You make friends easily.
*: Example shown.
Feature Extraction
from transformers import (
BertTokenizer,
BertModel
)
docs = ["YOUR", "DOCS", "HERE"]
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(DEVICE)
tokenized_sentences = tokenizer( # Tokenize the docs
docs,
truncation=True, # Truncate inputs overflowing context length
padding=True, # Pad inputs to maximal sequence length
return_tensors="pt"
).to(DEVICE)
outputs = model(**tokenized_sentences) # Run (tokenized) inputs through BERT
embeddings = outputs["last_hidden_state"] # Vector Representations: N samples, Sequence length, Embedding dimensions (you could do something more fancy)
# Retrieve embeddings at the position of the tokens of interestDocument Representation
- BoW / TF-IDF
from sklearn.feature_extraction.text import (
TfidfVectorizer,
CountVectorizer
)
docs = ["YOUR", "DOCS", "HERE"]
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(docs)
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(docs)- SentenceTransformer
Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
docs_representations = ... # pre-computed document representations
pairwise_cosine_sims = cosine_similarity(docs_representations) # shape N_docs x N_docs
pairwise_cosine_sims[0,1] # cosine sim between docs 0 and 1Topic Modeling:
- BERTopic
from bertopic import BERTopic # altenatively: use yours! `from <PATH_TO_SCRIPT>.mybertopic import MyBERTopic`
docs = ["YOUR", "DOCS", "HERE"]
topic_model = BERTopic() # instantiate the BERTopic object (you can add arguments / choose modules)
topics, probs = topic_model.fit_transform(docs) # represent docs -> find clusters = topics -> assign 'labels'- LDA
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import LdaModel
texts = ["YOUR", "DOCS", "HERE"] # your texts here (might be a good idea to do a bit of cleaning at some point...)
dictionary = corpora.Dictionary(texts) # extracts vocabulary
corpus = [dictionary.doc2bow(text) for text in texts] # documents BoW representations
lda_model = LdaModel( # Instantiate (and run) LDA algorithm
corpus=corpus, # pre-processed documents
id2word=dictionary, # corresponding dict. {id:word}
num_topics=n_topics, # ! THE NUMBER OF TOPICS HERE !
)Rep+ML (& eval)
from sklearn.metrics import classification_report
from sklearn.svm import SVC # or any other ML method
X_train, X_test = ... # pre-computed document representations
y_train, y_test = ... # associated labels
clf = SVC() # + add arguments
clf.fit(X_train, y_train) # train the ML method
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred)) # or the metric(s) you are interested inSFT (for sequence classification)
from datasets import load_dataset
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer,
)
raw_dataset = load_dataset("<YOUR_DATASET>")
model_name = "YOUR_MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2 # Number of labels for the classification task (in the 'dataset')
)
def tokenize_function(example, text_column="text", max_length=512):
return tokenizer(ex[text_column], padding="max_length", max_length=max_length)
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments("test-trainer") # put the training arguments here!
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
processing_class=tokenizer,
)
trainer.train()Generation
from transformers import pipeline
model_name = "<YOUR/MODEL_NAME>" # local or from HF hub
prompts = ["YOUR", "PROMPTS", "HERE"] # list of prompts
generator = pipeline( # (note: you can use `pipeline` for diverse tasks)
"text-generation", # state the task
model=model_name, # give the name of the LM to use
device_map="auto", # automatically moves model to GPU if available
)
outputs = generator(
prompts, # list of prompts
max_new_tokens=max_new_tokens, # you can give the pipeline some generation arguments
batch_size=batch_size # pipeline handles the batch internally
)
responses = [o[0]["generated_text"] for o in outputs]Large Language Models