Large Language Models

LLMs: Foundations, Generation & Beyond

Noé Durandard (noe.durandard@psl.eu)

Introduction to NLP — MSc. DH EdC-PSL

November 18, 2025

…in the real world

0.1 Evidences of ChatGPT’s impact

LLMs influence the way we communicate (Yakura et al. 2024; Geng et al. 2024; Anderson, Shah, and Kreminski 2024).

Prevalence of words associated with ChatGPT. Extracted from (Yakura et al. 2024).

0.2 Influence besides vocabulary

  • LLMs influence users and participate in shaping opinion

Participants interacting with a model supportive of social media were more likely to say that social media is good for society in a later survey (and vice versa) — from Jakesch et al. (2023).

0.3 Propagation of models’ biases

Language is loaded with sociocultural characteristics (Jiang 2000).

\(\to\) LLMs biases can have harmful consequences, and spread in downstream applications:

1 Framing Large Language Models

1.0.1 Definition(?)

Large Language Model - from (Merriam-Webster 2025)

A language model that utilizes deep [learning] methods on an extremely large data set as a basis for predicting and constructing natural-sounding text.

1.1 Large

1.1.1 What is a *Large* Language Model?

Extracted from (Thompson 2025).

1.1.2 Scaling Laws & The bitter Lesson

The bitter Lesson (Sutton 2019)

“general methods that leverage computation are ultimately the most effective, and by a large margin”…

  • Knowledge integration helps improve models in the short term (rewarding to researchers, but plateaus and can even inhibit further progress)
  • Breakthrough progress eventually arrives by an opposing approach based on scaling computation by learn and search

Language modeling performance improves smoothly as the model size, datasetset size, and amount of compute used for training increase (extracted from (Kaplan et al. 2020)).

1.2 Causal

1.2.1 Autoregressive

Definition

An autoregressive (AR) model is a representation of a type of stochastic process. It specifies that the output variable (or current state) depends linearly on its own \(p\) previous values: \(X_{t}=\sum _{i=1}^{p}\varphi _{i}X_{t-i}+\varepsilon _{t}\).

In NLP

In NLP, AR models generally refer to next-token prediction models, and the linear assumption can be disregarded.

\(n\)-grams \(\in\) LLMs ??

1.2.2 Decoder-only

  • Only a decoder stack (no encoder)
  • Causal / masked self-attention
    • Each token can only attend to previous tokens
  • Naturally suited for generation
    • Trained for next token prediction
    • Produce text left-to-right, one token at a time

Transformer Architecture, based on Vaswani (2017), reworked from (Zhang et al. 2023).

1.3 Language Models

1.3.1 The Language Modeling Problem (again)

Language Model

A Language Model (LM) estimates the probability of pieces of text. Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:

What is \(P(w_{1},w_{2},\cdots,w_{S})\)?

How to compute \(P\)?

  • Count-based (count): n-gram
  • Neural method (learn): Transformer
    • Masked Language Modeling (encoder-only, BERT)
    • Causal Language Modeling (decoder-only, GPT)

\(\to\) encode:

  • gramaticality
  • semantic plausability
  • stylistic consistency
  • knowledge (?)

Large

(? enough)

Autoregressive

Causal

(→ decoder-only)

Language Models

parameters + data

… yes but …

as in

(more or less)

2 Training

2.1 LLMs training workflow

Overview of LLMs Training Pipeline. Extracted from (Wolfe 2023), based on (Ouyang et al. 2022).

2.2 Pre-training

2.2.1 Next token prediction task

\[P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t})\]

Example generated with allenai/OLMo-2-0425-1B (e.g., training).

Example generated with allenai/OLMo-2-0425-1B (e.g., generation).

2.2.2 Self-Supervision

Self-Supervised Learning (Wikipedia 2025)

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels.

2.2.3 Training Objective: Cross-Entropy

\(\mathcal{L}[\phi] = - \sum_{i=1}^{I}\sum_{t=1}^{T}\log\left[ P(x_{i,t+1} | x_{i,1},\cdots,x_{i,t},\phi) \right]\)


  • Penalizes confident wrong predictions heavily
    • Low probability to ‘true’ next token \(\Rightarrow\) high loss: \(\displaystyle{\lim_{p_i \to 0}}\left(-\log(p_i)\right)=\infty\)
  • Rewards accurate predictions
    • High probability to ‘true’ next token \(\Rightarrow\) low loss: \(\displaystyle{\lim_{p_i \to 1}}\left(-\log(p_i)\right)=0\)
  • Maintains Differentiability
    • smooth mathematical properties allow effective gradient-based optimization

2.3 Data

2.3.1 Data Sources

Pre-traininglearning from the distribution of texts on the internet.

  • Primarily large-scale web crawls
    • Common Crawl, C4, CCNet, Dolma, RefinedWeb, RedPajama
  • Curated “high-quality” reference works
    • Wikipedia, books, textbooks, scientific papers
  • Code repositories
    • GitHub, StackOverflow

The distribution of data types in pre-training corpora used by different LLMs. Extracted from Yang Liu et al. (2025).

2.3.2 Data Filtering

Is that so?

python train_my_llm.py --dataset=extremely_large_dataset

Not that simple…

Overview of web data processing pipeline for Dolma (Ai2). Extracted from (Soldaini 2023).

2.3.3 Quality data?

“Quality” filtering choices…

… have consequences!

  • Marginalization of low-resource languages, dialects, sociolects
  • Overrepresentation of techno-scientific, Western, elite knowledge
  • “Highbrow” epistemic bias: formal, rational, encyclopedic styles favored
  • Missing registers: oral, vernacular, informal, multilingual mixing, community languages
  • Potentially distorted representations of social groups or cultural practices

2.4 Post(-pre)-training

2.4.1 Supervised (Instruction) Fine-Tuning

Goal: Teach model to follow instructions.

  • Supervised setting
    • instructions & responses
  • Uses curated input → output examples
    • e.g., “Summarize…”, “Translate…”, “Explain like I’m 5…”

Task categories present in Super-NaturalInstructions dataset (Wang et al. 2022) (extracted from original article).

2.4.2 RL (from Human Feedback)

Goal: Align model with desired behaviors (see e.g., (Bai et al. 2022; Ouyang et al. 2022)).

Extracted from (Lambert et al. 2022).

3 Inference → Tutorial

3.0.1 Use-cases

LLMs Capabilities. Extracted from (Minaee et al. 2025).

3.1 Prompt Strucutre

3.1.1 Elements of a Prompt

Natural Language Interaction:

  • System prompt
    • set of instructions, guidelines, and contextual information provided to AI models before they engage with user queries.
  • User prompt
    • actual message of the user
      • Instruction: definition of the task
      • Examples: provide examples (as in few-shot)
      • Context: provide additional context to address the problem
      • Quesiton: prompt the model to solve the task

3.1.2 Example: Under the Hood

Example for OLMo-2-Instruct:

<|endoftext|><|system|>
You are a helpful assistant.
<|user|>
Who are you?
<|assistant|>

Example for Qwen2.5-Instruct:

<|im_start|>system
You are a helpful assistant. <|im_end|> <|im_start|>user
Who are you? <|im_end|> <|im_start|>assistant

Example for Llama-3.1-Instruct:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023 Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you? <|eot_id|><|start_header_id|>assistant<|end_header_id|>

messages = [ # contains chat history as a list of dict where each instance has:
    {
      "role": "system",                          # role (*system*|user|assistant)
      "content": "You are a helpful assistant."  # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "Who are you?"                  # content of the message
    },
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
    messages,                             # provide chat history
    add_generation_prompt=True,           # add generation prompt (assistant:...)
    tokenize=True,                        # do tokenize the input
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

OLMo-2-Instruct style:

<|endoftext|><|system|>
You are a helpful assistant.
<|user|>
Who are you?
<|assistant|>
I am an AI assistant.
<|user|>
What can you do for me?
<|assistant|>
I can do all sorts of things.
<|user|>
What exactly?
<|assistant|>

messages = [ # contains chat history as a list of dict where each instance has:
    {
      "role": "system",                          # role (*system*|user|assistant)
      "content": "You are a helpful assistant."  # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "Who are you?"                  # content of the message
    },
    {
      "role": "assistant",                       # role (system|user|*assistant*)
      "content": "I am an AI assistant."         # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "What can you do for me?"       # content of the message
    },
    {
      "role": "assistant",                       # role (system|user|*assistant*)
      "content": "I can do all sorts of things." # content of the message
    },
    {
      "role": "user",                            # role (system|*user*|assistant)
      "content": "What exactly?"                 # content of the message
    },
    # and so on ...
]
inputs = tokenizer.apply_chat_template( # Apply chat template + tokenize
    messages,                             # provide chat history
    add_generation_prompt=True,           # add generation prompt (assistant:...)
    tokenize=True,                        # do tokenize the input
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

3.1.3 Nice visualization here (Cho et al. 2024):

3.2 Sampling

3.2.1 Next-token probability distribution

\[P_i=\frac{e^{\frac{z_i}{T}}}{\displaystyle\sum_{j=1}^{N_\mathrm{voc}}e^{\frac{z_j}{T}}}\]

3.2.2 Some Other Generation Parameters

  • temperature
    • Controls “creativity” vs. determinism.
      • ↓ low (0–0.7): predictable, focused, factual
      • ↑ high (1.0+): diverse, surprising, risk of incoherence
  • top-k
    • Keep only the k most likely next tokens.
      • small k → conservative, avoids rare tokens
      • large k → more expressive, less stable
  • top-p (nucleus sampling)
    • Keep the smallest set of tokens whose cumulative probability ≥ p. 
      • p≈0.0: only most likely token
      • p≈0.9: diverse but meaningful
      • p≈1.0: no filtering
  • length_penalty
    • Encourages longer or shorter outputs.
      • \(>\) 1: longer answers
      • \(<\) 1: concise answers
  • repetition_penalty
    • Penalizes reusing tokens/sequences.
      • \(>\) 1.0 reduces loops and obsessive repetition
  • no_repeat_ngram_size
    • Hard constraint: forbids repeating n-grams of size n.

4 Evaluation

4.1 Challenges

4.1.1 What’s different?

LLMs are stochastic generative (~intrinsic),

  • natural language quality (fluency, coherence, relevance)
  • correctness / factuality
  • robustness
  • safety / toxicity

general purpose models (~extrinsic) (Radford et al. 2019).

  • general knowledge
  • task-specific abilities
  • calibration & reliability
  • societal impact (bias, fairness, harms, …)

\(\to\) no single metric can conveniently evaluate LLMs capabilities and risks (Liang et al. 2023; Bommasani et al. 2022)!

4.1.2 What to evaluate?

LLM evaluation is typically organised around (following and adapted from (Guo et al. 2023; Chang et al. 2024)):

  • NLP Abilities
    • NLU, NLG, fluency, coherence, grammaticality, …
  • (Specific) Knowledge and Abilities
    • QA, reasoning, classification, …
    • Domain specific, e.g.: education, finance, health, …
  • Safety and Social Impact
    • misinformation, bias, representational harms, disparate treatment, …

Proposed taxonomy of major categories and sub-categories of LLM evaluation, extracted from (Guo et al. 2023).

4.1.3 Evaluation Approaches (overview)

  • Automated
    • metric-based (perplexity, accuracy, BLEU, etc.)
    • scalable, reproducible
    • limited for open-ended tasks
  • Human
    • comparative assessment
    • qualitative insights
    • “vibe check”
  • Benchmark-Based
    • standardized datasets & tasks
    • often MCQA or constrained formats
    • useful for comparison but may not generalize
  • Adversarial
    • robustness against prompts, attacks, edge cases
    • surfaces safety issues and model brittleness

4.1.4 Leaderboard Example (Fourrier et al. 2024)

4.2 Machine Behavior

4.2.1 Machine Behavior (Rahwan et al. 2019)

Machine Behavior is concerned with the scientific study of intelligent machines, not as engineering artefacts, but as a class of actors with particular behavioural patterns and ecology. This field overlaps with, but is distinct from, computer science and robotics. It treats machine behaviour empirically. This is akin to how ethology and behavioural ecology study animal behaviour […]”

See also Machine Psychology and related literature, e.g., (Bommasani et al. 2022; Binz and Schulz 2023; Hagendorff et al. 2024; Ye et al. 2025).

4.2.2 Goals

  • Systematic assessment of behaviors
  • Studying ML-systems as a “class of actors with particular behavioural patterns and ecology” / socio-cultural artefacts
    • Which kinds of (human-relevant) capabilities are present?
    • What do they tell about the models?
  • Using experimental methods inspired from behavioral psychology (Ye et al. 2025) and HSS (Bommasani et al. 2022)

4.2.3 [Side note:] Relation to bias?

Bias (definition attempt)

Systematic tendency in model outputs or behaviors that produce, reflect or amplify disparate, unfair or harmful treatments of individuals or social groups (Barocas and Selbst 2016). This encompasses broad ranges of discrimanative patterns, inter alia: allocational harms, representational harms, stereotyping and unequal system performances, or questionnable correlations (Bender et al. 2021; Blodgett et al. 2020).

Caution

This is a definition attempt. In NLP, the notion of “bias” is often used as an (ill-defined) umbrella term covering diverse ranges of “biased” behaviors. Blodgett et al. (2020) call for more rigorous contextual and operational definitions.

4.2.4 Examples: MCQ-based Evaluation

  • Political Leaning
    • Political Compass Test (PCT)*
  • Cultural norms and values
    • World Value Survey (WSV)*
    • Global Attitude Survey
  • Moral values and personality-like traits
    • Moral Foundations Questionnaire (MFQ)
    • Myers Briggs Type Indicator
    • Big-5*

Sex outside marriage is usually immoral.

When jobs are scarce, employers should give priority to people of this country over immigrants.

You make friends easily.

*: Example shown.

4.2.5 Examples: MCQ-based Results

  • Political Leaning
    • Center-left (Ceron et al. (2024), and others)
    • Social-Democrat leaning
  • Cultural norms and values
    • Mainly Anglo-saxon world
    • W.E.I.R.D.
  • Moral values and personality-like traits
    • ↑ Agreeableness
    • ↓ Neuroticism

Extracted from (Röttger et al. 2024).

Extracted from (Atari et al. 2023).

Extracted from (Rutinowski et al. 2024).

5 Summary

  • Decoder only models trained for next-token prediction
  • Pre-training data is not “partial”
  • Deployed models ar not only pre-trained \(\to\) further aligned
  • Multi-task learner and multiple use cases
  • Evaluation is hard!
    • & biases run deep and have implications!

Not the pinnacle!

LLM Iceberg… ©Sasha Luccioni (@SashaMTL).

Even from a technical perspective… Slide from Yann LeCun (AMS Josiah Willard Gibbs Lecture, “Mathematical Obstacles on the Way to Human-Level AI.” (2025)).

6 4 weeks / 4 slides

6.1 Language modeling and transformer-based models

6.1.1 Foundations: The Transformer Architecture

Transformer Architecture, based on (Vaswani 2017), reworked from (Zhang et al. 2023).

6.1.2 Feature Extraction

6.1.3 Classification and Fine-tuning

6.1.4 Language Geneation

6.2 In practice

6.2.1 Minimal (pseudo-)code for diverse tasks

Feature Extraction

from transformers import (
  BertTokenizer,
  BertModel
)
docs = ["YOUR", "DOCS", "HERE"]
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(DEVICE)
tokenized_sentences = tokenizer( # Tokenize the docs
  docs,
  truncation=True,               # Truncate inputs overflowing context length
  padding=True,                  # Pad inputs to maximal sequence length
  return_tensors="pt"
).to(DEVICE)
outputs = model(**tokenized_sentences)  # Run (tokenized) inputs through BERT
embeddings = outputs["last_hidden_state"] # Vector Representations: N samples, Sequence length, Embedding dimensions (you could do something more fancy)
# Retrieve embeddings at the position of the tokens of interest

Document Representation

- BoW / TF-IDF
from sklearn.feature_extraction.text import (
  TfidfVectorizer,
  CountVectorizer
)
docs = ["YOUR", "DOCS", "HERE"]
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(docs)
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(docs)
- SentenceTransformer
from sentence_transformers import SentenceTransformer
docs = ["YOUR", "DOCS", "HERE"]
model_name = "<YOUR/MODEL_NAME>"       # local or from HF hub
model = SentenceTransformer(model_name)
embeddings = model.encode(sentences)

Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity
docs_representations = ... # pre-computed document representations
pairwise_cosine_sims = cosine_similarity(docs_representations) # shape N_docs x N_docs
pairwise_cosine_sims[0,1] # cosine sim between docs 0 and 1

Topic Modeling:

- BERTopic
from bertopic import BERTopic                    # altenatively: use yours! `from <PATH_TO_SCRIPT>.mybertopic import MyBERTopic`
docs = ["YOUR", "DOCS", "HERE"]
topic_model = BERTopic()                         # instantiate the BERTopic object (you can add arguments / choose modules)
topics, probs = topic_model.fit_transform(docs)  # represent docs -> find clusters = topics -> assign 'labels'
- LDA
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import LdaModel

texts = ["YOUR", "DOCS", "HERE"]                     # your texts here (might be a good idea to do a bit of cleaning at some point...)
dictionary = corpora.Dictionary(texts)                # extracts vocabulary
corpus = [dictionary.doc2bow(text) for text in texts] # documents BoW representations

lda_model = LdaModel(             # Instantiate (and run) LDA algorithm
    corpus=corpus,                # pre-processed documents
    id2word=dictionary,           # corresponding dict. {id:word}
    num_topics=n_topics,          # ! THE NUMBER OF TOPICS HERE !
)

Rep+ML (& eval)

from sklearn.metrics import classification_report
from sklearn.svm import SVC # or any other ML method
X_train, X_test = ... # pre-computed document representations
y_train, y_test = ... # associated labels
clf = SVC()                # + add arguments
clf.fit(X_train, y_train)  # train the ML method
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))  # or the metric(s) you are interested in

SFT (for sequence classification)

from datasets import load_dataset
from transformers import (
  AutoTokenizer, AutoModelForSequenceClassification,
  TrainingArguments, Trainer,
)

raw_dataset = load_dataset("<YOUR_DATASET>")
model_name = "YOUR_MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
  model_name,
   num_labels=2 # Number of labels for the classification task (in the 'dataset')
)

def tokenize_function(example, text_column="text", max_length=512):
    return tokenizer(ex[text_column], padding="max_length", max_length=max_length)
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer")     # put the training arguments here!
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)
trainer.train()

Generation

from transformers import pipeline
model_name = "<YOUR/MODEL_NAME>"       # local or from HF hub
prompts = ["YOUR", "PROMPTS", "HERE"]  # list of prompts
generator = pipeline(      # (note: you can use `pipeline` for diverse tasks)
    "text-generation",     # state the task
    model=model_name,      # give the name of the LM to use
    device_map="auto",     # automatically moves model to GPU if available
)
outputs = generator(
    prompts,                       # list of prompts
    max_new_tokens=max_new_tokens, # you can give the pipeline some generation arguments
    batch_size=batch_size          # pipeline handles the batch internally
)
responses = [o[0]["generated_text"] for o in outputs]

7 Hands-on

7.1 Hands-On Proposition

References

Anderson, Barrett R, Jash Hemant Shah, and Max Kreminski. 2024. “Homogenization Effects of Large Language Models on Human Creative Ideation.” In Creativity and Cognition, 413–25. C&c ’24. ACM. https://doi.org/10.1145/3635636.3656204.
Atari, Mohammad, Mona J Xue, Peter S Park, Damián Blasi, and Joseph Henrich. 2023. “Which Humans?”
Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” https://arxiv.org/abs/2204.05862.
Barocas, Solon, and Andrew D Selbst. 2016. “Big Data’s Disparate Impact.” Calif. L. Rev. 104: 671.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.
Binz, Marcel, and Eric Schulz. 2023. “Turning Large Language Models into Cognitive Models.” https://arxiv.org/abs/2306.03917.
Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. “Language (Technology) Is Power: A Critical Survey of ‘Bias’ in NLP.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, edited by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, 5454–76. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485.
Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2022. “On the Opportunities and Risks of Foundation Models.” https://arxiv.org/abs/2108.07258.
Ceron, Tanise, Neele Falk, Ana Barić, Dmitry Nikolaev, and Sebastian Padó. 2024. “Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in Llms.” Transactions of the Association for Computational Linguistics 12: 1378–1400.
Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, et al. 2024. “A Survey on Evaluation of Large Language Models.” ACM Trans. Intell. Syst. Technol. 15 (3). https://doi.org/10.1145/3641289.
Cho, Aeree, Grace C. Kim, Alexander Karpekov, Alec Helbling, Zijie J. Wang, Seongmin Lee, Benjamin Hoover, and Duen Horng Chau. 2024. “Transformer Explainer: Interactive Learning of Text-Generative Models.” https://arxiv.org/abs/2408.04619.
Cunningham, Jay, Su Lin Blodgett, Michael Madaio, Hal Daumé Iii, Christina Harrington, and Hanna Wallach. 2024. “Understanding the Impacts of Language Technologies Performance Disparities on African American Language Speakers.” In Findings of the Association for Computational Linguistics: ACL 2024, edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 12826–33. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.761.
Fourrier, Clementine, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. 2024. “Open-LLM Performances Are Plateauing, Let’s Make the Leaderboard Steep Again.” Hugging Face – The AI Community Building the Future. https://huggingface.co/spaces/open-llm-leaderboard/blog.
Geng, Mingmeng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, and Pan Zhou. 2024. “The Impact of Large Language Models in Academia: From Writing to Speaking.” https://arxiv.org/abs/2409.13686.
Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, et al. 2023. “Textbooks Are All You Need.” https://arxiv.org/abs/2306.11644.
Guo, Zishan, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, et al. 2023. “Evaluating Large Language Models: A Comprehensive Survey.” https://arxiv.org/abs/2310.19736.
Gururangan, Suchin, Dallas Card, Sarah Dreier, Emily Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. 2022. “Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, edited by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, 2562–80. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.165.
Hadar-Shoval, Dorit, Kfir Asraf, Yonathan Mizrachi, Yuval Haber, and Zohar Elyoseph. 2023. “The Invisible Embedded ‘Values’ Within Large Language Models: Implications for Mental Health Use.”
Hagendorff, Thilo, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, and Eric Schulz. 2024. “Machine Psychology.” https://arxiv.org/abs/2303.13988.
Jakesch, Maurice, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman. 2023. “Co-Writing with Opinionated Language Models Affects Users’ Views.” In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. CHI ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3544548.3581196.
Jiang, Wenying. 2000. “The Relationship Between Culture and Language.” ELT Journal 54 (4): 328–34.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” https://arxiv.org/abs/2001.08361.
Lambert, Nathan, Louis Castricato, Leandro von Werra, and Alex Havrilla. 2022. “Illustrating Reinforcement Learning from Human Feedback (RLHF).” Hugging Face Blog.
Li, Chao, Xing Su, Haoying Han, Cong Xue, Chunmo Zheng, and Chao Fan. 2023. “Quantifying the Impact of Large Language Models on Collective Opinion Dynamics.” arXiv Preprint arXiv:2308.03313.
Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. 2023. “Holistic Evaluation of Language Models.” https://arxiv.org/abs/2211.09110.
Liu, Yang, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2025. “Datasets for Large Language Models: A Comprehensive Survey.” Artificial Intelligence Review 58 (12). https://doi.org/10.1007/s10462-025-11403-7.
Liu, Yuhan, Shangbin Feng, Xiaochuang Han, Vidhisha Balachandran, Chan Young Park, Sachin Kumar, and Yulia Tsvetkov. 2024. P\(^3\)Sum: Preserving Authors Perspective in News Summarization with Diffusion Language Models.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), edited by Kevin Duh, Helena Gomez, and Steven Bethard, 2154–73. Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.119.
Lucy, Li, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, and Jesse Dodge. 2024. AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 7393–7420. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.400.
Merriam-Webster. 2025. “Large Language Model.” In Merriam-Webster.com Dictionary. https://www.merriam-webster.com/dictionary/large%20language%20model.
Minaee, Shervin, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2025. “Large Language Models: A Survey.” https://arxiv.org/abs/2402.06196.
Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, 35:27730–44. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
Poole-Dayan, Elinor, Deb Roy, and Jad Kabbara. 2024. “LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users.” https://arxiv.org/abs/2406.17737.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, et al. 2019. “Machine Behaviour.” Nature 568 (7753): 477–86. https://doi.org/10.1038/s41586-019-1138-y.
Röttger, Paul, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. “Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 15295–311. Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.816.
Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. 2024. “The Self-Perception and Political Biases of ChatGPT.” Human Behavior and Emerging Technologies 2024 (1): 7115633.
Salinas, Abel, Parth Shah, Yuzhong Huang, Robert McCormack, and Fred Morstatter. 2023. “The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA.” In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. EAAMO ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3617694.3623257.
Soldaini, Luca. 2023. “Ai2 Dolma: 3 Trillion Token Open Corpus for Language Model Pretraining: AI2.” Ai2 RSS. https://allenai.org/blog/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64.
Sutton, Richard. 2019. “The Bitter Lesson.” Incomplete Ideas (Blog) 13 (1): 38.
Thompson, Alan D. 2025. “Models Table (10,000+ LLM Data Points).” LifeArchitect.ai. https://lifearchitect.ai/models-table/.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” https://arxiv.org/abs/2302.13971.
Vaswani, A. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems. https://dl.acm.org/doi/10.5555/3295222.3295349.
Wang, Yizhong, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, et al. 2022. “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, edited by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, 5085–5109. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.340.
Wenzek, Guillaume, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data.” https://arxiv.org/abs/1911.00359.
Wikipedia. 2025. “Self-Supervised Learning.” In Wikipedia - the Free Encyclopedia. https://en.wikipedia.org/wiki/Self-supervised_learning.
Williams-Ceci, Sterling, Maurice Jakesch, Advait Bhat, Kowe Kadoma, Lior Zalmanson, and Mor Naaman. 2024. “Bias in AI Autocomplete Suggestions Leads to Attitude Shift on Societal Issues.”
Wolfe, Cameron R. 2023. “The Story of RLHF: Origins, Motivations, Techniques, and Modern Applications.” The Story of RLHF: Origins, Motivations, Techniques, and Modern Applications. Deep (Learning) Focus. https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations.
Yakura, Hiromu, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, and Iyad Rahwan. 2024. “Empirical Evidence of Large Language Model’s Influence on Human Spoken Communication.” https://arxiv.org/abs/2409.01754.
Ye, Haoran, Jing Jin, Yuhang Xie, Xin Zhang, and Guojie Song. 2025. “Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement.” arXiv Preprint arXiv:2505.08245.
Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2023. Dive into Deep Learning. Cambridge University Press.