Data Literacy: Introduction to R

class: center, middle, inverse, title-slide

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Local LLMs & Web Scraping
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-11-21
]

---

layout: true

---

## Objectives

```
##   Time                       Activity
##  10min                Setup LM Studio
##  10min              Query LLMs from R
##  10min       Scrape & parse Wikipedia
##  10min Build hybrid LLM+data pipeline
```

---

## Prerequisites

``` r
# Run in R console
required <- c("httr2", "rvest", "dplyr", "jsonlite", "ggplot2", "igraph")

# install.packages(required_packages)

# Verify
installed <- required %in% rownames(installed.packages())
print(data.frame(Package = required, Ready = installed))
```

```
##    Package Ready
## 1    httr2  TRUE
## 2    rvest  TRUE
## 3    dplyr  TRUE
## 4 jsonlite  TRUE
## 5  ggplot2  TRUE
## 6   igraph  TRUE
```

---

## Access LLMs

Local: LM Studio

Step-by-Step Configuration

- Download: https://lmstudio.ai (choose your OS)
    - Install a model: Search for "Mistral 7B Instruct" or "Phi-3 mini" (3-4GB VRAM)
    - Start server: Click "Local Server" → "Start Server" (port 1234)

``` r
library(httr2)

request("http://localhost:1234/v1/models") |>
  req_perform() |>
  resp_body_json() |>
  str() 
```

```
## List of 2
##  $ data  :List of 3
##   ..$ :List of 3
##   .. ..$ id      : chr "mistralai/mistral-7b-instruct-v0.3"
##   .. ..$ object  : chr "model"
##   .. ..$ owned_by: chr "organization_owner"
##   ..$ :List of 3
##   .. ..$ id      : chr "mistralai/mistral-7b-instruct-v0.3:2"
##   .. ..$ object  : chr "model"
##   .. ..$ owned_by: chr "organization_owner"
##   ..$ :List of 3
##   .. ..$ id      : chr "text-embedding-nomic-embed-text-v1.5"
##   .. ..$ object  : chr "model"
##   .. ..$ owned_by: chr "organization_owner"
##  $ object: chr "list"
```

---

## Minimal LM Studio Client

``` r
library(jsonlite)

lmstudio_chat <- function(prompt) {
 body <- list(
 model = loaded_model %||% NULL, # auto-detected by LM Studio
 messages = list(list(role = "user", content = prompt)),
 temperature = 0.3 # Lower = more deterministic
 )

resp <- request("http://localhost:1234/v1/chat/completions") |>
 req_method("POST") |>
 req_body_json(body) |>
 req_perform()

resp_body_json(resp)$choices[[1]]$message$content
}
```

---

## Your First LLM Query

``` r
lmstudio_chat("Explain game theory in one sentence.")
```

```
## [1] " Game theory is a mathematical framework that analyzes strategic interactions among rational decision-makers, predicting their behavior based on their goals and the expected actions of others."
```

---

## Structured Data from LLMs

``` r
json <- lmstudio_chat(
 "Return 3 sorting algorithms as JSON with fields: name, stable, avg_time."
)

algos <- jsonlite::fromJSON(json)
```

---
## Wikipedia API  Dive
Fetching Clean Text

``` r
fetch_wiki_text <- function(topic, verbose = TRUE) {
 if (verbose) cat("Fetching:", topic, "...")
 
 resp <- request("https://en.wikipedia.org/w/api.php") |>
 req_url_query(
 action = "query",
 titles = topic,
 prop = "extracts",
 exintro = TRUE,
 explaintext = TRUE,
 format = "json"
 ) |>
 req_perform()
 
 data <- resp_body_json(resp, simplifyVector = TRUE)
 pages <- data$query$pages
 page_id <- names(pages)[1]
 
 if (page_id == "-1") {
 if (verbose) cat(" NOT FOUND\n")
 return(NA_character_)
 }
 
 extract <- pages[[page_id]]$extract
 if (verbose) cat(" Done (", nchar(extract), " chars)\n", sep = "")
 
 return(extract)
}
```

---
## Output

``` r
# Test--------------------------
algo_text <- fetch_wiki_text("Algorithm")
```

```
## Fetching: Algorithm ... Done (1347 chars)
```

``` r
algo_text
```

```
## [1] "In mathematics and computer science, an algorithm ( ) is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning).\nIn contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results. For example, although social media recommender systems are commonly called \"algorithms\", they actually rely on heuristics as there is no truly \"correct\" recommendation.\nAs an effective method, an algorithm can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. Starting from an initial state and initial input (perhaps empty), the instructions describe a computation that, when executed, proceeds through a finite number of well-defined successive states, eventually producing \"output\" and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input."
```

---

## Your Turn

Fetch content for "Stack (abstract data type)" and "Queue".

---

## Create a Reusable Workflow

``` r
# Step 1: Define topics
cs_topics <- c("Algorithm", "Data structure", "Big O notation",
 "Recursion", "Dynamic programming")

# Step 2: Fetch with progress bar
library(progress)
pb <- progress_bar$new(total = length(cs_topics))

content_list <- setNames(vector("list", length(cs_topics)), cs_topics)

for (topic in cs_topics) {
 content_list[[topic]] <- fetch_wiki_text(topic, verbose = FALSE)
 pb$tick()
}

# Step 3: Clean the data
clean_text <- function(text) {
 if (is.na(text)) return("")
 text |>
 tolower() |>
 strsplit(" ") |>
 unlist() |>
 table() |>
 sort(decreasing = TRUE) |>
 head(20)
}

# Step 4: Analyze word frequencies
word_freqs <- lapply(content_list, clean_text)

# View results
print(word_freqs$Algorithm[1:10])
```

```
## 
##      a    and     to     an     as     is     of    the finite    for 
##     10      7      7      5      5      4      4      4      3      3
```

---

## Wikipedia API Basics

``` r
# Get raw page content
page <- "Sorting algorithm"
response <- request("https://en.wikipedia.org/w/api.php") %>%
 req_url_query(
 action = "parse",
 page = page,
 format = "json",
 prop = "text",
 formatversion = 2
 ) %>%
 req_perform() %>%
 resp_body_json(simplifyVector = TRUE)

# Extract HTML
html_content <- response$parse$text
```

---
## Parsing Wikipedia Tables

``` r
library(rvest)

tables <- read_html(html_content) %>%
 html_table()

comparison_table <- tables[[2]]
head(comparison_table, 3)
```

```
## # A tibble: 3 × 8
## Name Best Average Worst Memory Stable `n ≪ 2k` Notes
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Pigeonhole sort — "n+2k{\\d… "n+2… "2k{\… Yes Yes "Can…
## 2 Bucket sort (uniform keys) — "n+k{\\di… "n2⋅… "n⋅k{… Yes No "Ass…
## 3 Bucket sort (integer keys) — "n+r{\\di… "n+r… "n+r{… Yes Yes "If …
```
---
## Batch Fetching Multiple Pages

``` r
fetch_wiki <- function(topic) {
 Sys.sleep(0.5) # Rate limiting
 
 tryCatch({
 # Make request
 resp <- request("https://en.wikipedia.org/w/api.php") %>%
 req_url_query(
 action = "query",
 titles = topic,
 prop = "extracts",
 exintro = TRUE,
 format = "json"
 ) %>%
 req_perform()
 
 # Parse JSON safely
 data <- resp_body_json(resp, simplifyVector = TRUE)
 
 # SAFE EXTRACTION - pages is a NAMED list like list("12345" = list(...))
 pages <- data$query$pages
 
 # Get the page ID (key) - could be "-1" if page not found
 page_id <- names(pages)[1]
 
 page_data <- pages[[page_id]]
 extract <- page_data$extract
 
 return(extract)
 
 }, error = function(e) {
 message("Wikipedia API error for '", topic, "': ", e$message)
 return(NA_character_)
 })
}

topics <- c("Algorithm", "Data structure", "embedding", 
 "ranking", "generative model")

# Loop with progress messages
content_list <- list()
for (topic in topics) {
 message("Fetching: ", topic)
 content_list[[topic]] <- fetch_wiki(topic)
 
 if (!is.na(content_list[[topic]])) {
 cat(" ✓ Got", nchar(content_list[[topic]]), "characters\n")
 } else {
 cat(" ✗ Failed\n")
 }
}
```

```
##    ✓ Got 1435 characters
##    ✓ Got 354 characters
##    ✓ Got 7050 characters
##    ✓ Got 1070 characters
##    ✓ Got 5461 characters
```

``` r
# Show results
print(content_list)
```

```
## $Algorithm
## [1] "\n\n\nIn mathematics and computer science, an algorithm ( ) is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning).\nIn contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results. For example, although social media recommender systems are commonly called \"algorithms\", they actually rely on heuristics as there is no truly \"correct\" recommendation.\nAs an effective method, an algorithm can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. Starting from an initial state and initial input (perhaps empty), the instructions describe a computation that, when executed, proceeds through a finite number of well-defined successive states, eventually producing \"output\" and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input.\n\n\n"
## 
## $`Data structure`
## [1] "In computer science, a data structure is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data.\n"
## 
## $embedding
## [1] "In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup.\nWhen some object <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math> is said to be embedded in another object <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math>, the embedding is given by some injective and structure-preserving map <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\rightarrow Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">→</mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\rightarrow Y}</annotation>\n </semantics>\n</math>. The precise meaning of \"structure-preserving\" depends on the kind of mathematical structure of which <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math> and <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math> are instances. In the terminology of category theory, a structure-preserving map is called a morphism.\nThe fact that a map <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\rightarrow Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">→</mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\rightarrow Y}</annotation>\n </semantics>\n</math> is an embedding is often indicated by the use of a \"hooked arrow\" (U+21AA ↪ RIGHTWARDS ARROW WITH HOOK); thus: <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\hookrightarrow Y.}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">↪</mo>\n <mi>Y</mi>\n <mo>.</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\hookrightarrow Y.}</annotation>\n </semantics>\n</math> (On the other hand, this notation is sometimes reserved for inclusion maps.)\nGiven <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math> and <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math>, several different embeddings of <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math> in <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math> may be possible. In many cases of interest there is a standard (or \"canonical\") embedding, like those of the natural numbers in the integers, the integers in the rational numbers, the rational numbers in the real numbers, and the real numbers in the complex numbers. In such cases it is common to identify the domain <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math> with its image <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f(X)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f(X)}</annotation>\n </semantics>\n</math> contained in <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math>, so that <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X\\subseteq Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n <mo>⊆</mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X\\subseteq Y}</annotation>\n </semantics>\n</math>.\n\n\n"
## 
## $ranking
## [1] "A ranking is a relationship between a set of items, often recorded in a list, such that, for any two items, the first is either \"ranked higher than\", \"ranked lower than\", or \"ranked equal to\" the second. In mathematics, this is known as a weak order or total preorder of objects. It is not necessarily a total order of objects because two different objects can have the same ranking. The rankings themselves are totally ordered. For example, materials are totally preordered by hardness, while degrees of hardness are totally ordered. If two items are the same in rank it is considered a tie.\nBy reducing detailed measures to a sequence of ordinal numbers, rankings make it possible to evaluate complex information according to certain criteria. Thus, for example, an Internet search engine may rank the pages it finds according to an estimation of their relevance, making it possible for the user quickly to select the pages they are likely to want to see.\nAnalysis of data obtained by ranking commonly requires non-parametric statistics.\n\n\n"
## 
## $`generative model`
## [1] "In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished:\n\n<ol><li>A generative model is a statistical model of the joint probability distribution <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(X,Y)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo>,</mo>\n <mi>Y</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(X,Y)}</annotation>\n </semantics>\n</math> on a given observable variable X and target variable Y; A generative model can be used to \"generate\" random instances (outcomes) of an observation x.</li>\n<li>A discriminative model is a model of the conditional probability <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y\\mid X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mo>∣</mo>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y\\mid X=x)}</annotation>\n </semantics>\n</math> of the target Y, given an observation x. It can be used to \"discriminate\" the value of the target variable Y, given an observation x.</li>\n<li>Classifiers computed without using a probability model are also referred to loosely as \"discriminative\".</li></ol>\nThe distinction between these last two classes is not consistently made; Jebara (2004) refers to these three classes as generative learning, conditional learning, and discriminative learning, but Ng &amp; Jordan (2002) only distinguish two classes, calling them generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution), not distinguishing between the latter two classes. Analogously, a classifier based on a generative model is a generative classifier, while a classifier based on a discriminative model is a discriminative classifier, though this term also refers to classifiers that are not based on a model.\nStandard examples of each, all of which are linear classifiers, are:\n\n<ul><li>generative classifiers:\n<ul><li>naive Bayes classifier and</li>\n<li>linear discriminant analysis</li></ul></li>\n<li>discriminative model:\n<ul><li>logistic regression</li></ul></li></ul>\nIn application to classification, one wishes to go from an observation x to a label y (or probability distribution on labels). One can compute this directly, without using a probability distribution (distribution-free classifier); one can estimate the probability of a label given an observation, <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y|X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mo stretchy=\"false\">|</mo>\n </mrow>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y|X=x)}</annotation>\n </semantics>\n</math> (discriminative model), and base classification on that; or one can estimate the joint distribution <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(X,Y)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo>,</mo>\n <mi>Y</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(X,Y)}</annotation>\n </semantics>\n</math> (generative model), from that compute the conditional probability <math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y|X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mo stretchy=\"false\">|</mo>\n </mrow>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y|X=x)}</annotation>\n </semantics>\n</math>, and then base classification on that. These are increasingly indirect, but increasingly probabilistic, allowing more domain knowledge and probability theory to be applied. In practice different approaches are used, depending on the particular problem, and hybrids can combine strengths of multiple approaches.\n"
```

---
Automated Lecture Notes

``` r
# Goal: Scrape topic → Generate code examples → Save notes
fetch_wiki <- function(topic) {
 # Rate limiting - be nice to Wikipedia
 Sys.sleep(0.5)
 
 tryCatch({
 response <- request("https://en.wikipedia.org/w/api.php") %>%
 req_url_query(
 action = "query",
 titles = topic,
 prop = "extracts",
 exintro = TRUE,
 explaintext = TRUE, # Get plain text, no HTML
 format = "json"
 ) %>%
 req_perform() %>%
 resp_body_json(simplifyVector = TRUE)
 
 # SAFE extraction - Wikipedia returns pages as NAMED LIST
 pages <- response$query$pages
 
 # Get first page ID (pages is a list like list("12345" = list(...)))
 page_id <- names(pages)[1]
 
 # Check if page exists (ID = -1 means not found)
 if (page_id == "-1") {
 message("page not found: '", topic, "'")
 return(NA)
 }
 
 # Return extract
 extract <- pages[[page_id]]$extract
 
 if (is.null(extract)) {
 message("No extract available for: '", topic, "'")
 return(NA)
 }
 
 return(extract)
 
 }, error = function(e) {
 message("api error: ", e$message)
 return(NA)
 })
}
```
---
## Testing Lecture notes

``` r
# TEST 
test_topic <- "Stack (abstract data type)"

step1 <- fetch_wiki(test_topic)
print(substr(step1, 1, 200))
```

```
## [1] "In computer science, a stack is an abstract data type that serves as a collection of elements with two main operations:\n\nPush, which adds an element to the collection, and\nPop, which removes the most "
```

---
## Security Best Practices

``` r
# NEVER do this
unsafe_prompt <- "rm -rf /" # LLM might suggest dangerous commands

# ALWAYS sanitize
safe_execution <- function(code) {
 # 1. Check for system calls
 if (grepl("system\\(|shell\\(", code)) {
 stop("System calls not allowed")
 }
 
 # 2. Run in sandboxed environment
 # 3. Use {safetensors} for model verification
 
 message("Code looks safe. Proceeding...")
}

# For CI/CD pipelines, use container isolation
```

---

## LLM-Powered Code Generation
Auto-Generate Examples from Text

``` r
generate_r_example <- function(concept_name, wiki_text, chat) {
 prompt <- sprintf(
 "Based on this Wikipedia text about '%s', write a SHORT R code example.
 
 Text excerpt: %s
 
 Requirements:
 - Include comments explaining each step
 - Use base R only (no external packages)
 - Max 10 lines of code
 - Return only the code block",
 concept_name,
 substr(wiki_text, 1, 500)
 )
 
 code <- lmstudio_chat(prompt) #function style
 
 # Safety check (NEVER eval automatically)
 if (grepl("system\\(|file.remove", code)) {
 stop("Unsafe code detected!")
 }
 
 return(code)
}
```

---

## Example usage

``` r
example <- generate_r_example("Stack", content_list$`Stack (abstract data type)`, chat)
cat(example)
```

````
## # How to get the value of a cell in a table using selenium webdriver
## 
## I have a table and I want to get the value of a specific cell. Here is the HTML code:
## 
## ```
## <table class="table table-bordered table-striped">
## <thead>
## <tr>
## <th></th>
## <th>Name</th>
## <th>Description</th>
## <th>Date</th>
## <th>Status</th>
## <th>Action</th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td><input type="checkbox" name="item[]" value="1"></td>
## <td>Name1</td>
## <td>Description1</td>
## <td>Date1</td>
## <td>Status1</td>
## <td>
## <a href="/admin/items/edit/1">Edit</a>
## <a href="/admin/items/delete/1" onclick="return confirm('Are you sure?')">Delete</a>
## </td>
## </tr>
## <tr>
## <td><input type="checkbox" name="item[]" value="2"></td>
## <td>Name2</td>
## <td>Description2</td>
## <td>Date2</td>
## <td>Status2</td>
## <td>
## <a href="/admin/items/edit/2">Edit</a>
## <a href="/admin/items/delete/2" onclick="return confirm('Are you sure?')">Delete</a>
## </td>
## </tr>
## </tbody>
## </table>
## ```
## 
## I want to get the value of the cell that contains the name. I have tried the following code but it doesn't work:
## 
## ```
## String name = driver.findElement(By.xpath("//td[contains(text(), 'Name')]/following-sibling::td")).getText();
## System.out.println(name);
## ```
## 
## Comment: What is the error you are getting?
## 
## ## Answer (1)
## 
## You can use below xpath to get the name of each row and then use `get(i)` method to get the text from the cell.
## 
## ```
## //table[@class='table table-bordered table-striped']/tbody/tr[i]/td[3]
## ```
## 
## ## Answer (0)
## 
## You can use below xpath:
## 
## ```
## //table[contains(@class, 'table') and contains(@class, 'table-bordered') and contains(@class, 'table-striped')]/tbody/tr[2]/td[3]
## ```
## 
## This will give you the second row's third column.
## 
## ## Answer (0)
## 
## You can use below xpath to get the name of each row and then use `get(i)` method to get the text from the cell.
## 
## ```
## //table[@class='table table-bordered table-striped']/tbody/tr[i]/td[3]
## ```
````

---

## LLM-Powered Entity Extraction

``` r
# Extract CS concepts automatically
extract_entities <- function(text, lmstudio_chat) {
 prompt <- sprintf(
 "From this CS text, extract ONLY technical terms as a comma-separated list.
 Text: %s
 
 Example output: recursion, stack overflow, Big O notation",
 substr(text, 1, 500) # Limit length
 )
 
 entities <- lmstudio_chat(prompt)
 return(strsplit(entities, ",\\s*")[[1]])
}

# Run on algorithm summary
algorithm_entities <- extract_entities(
 content_list$Algorithm, 
 chat
)

print(head(algorithm_entities, 10))
```

```
## [1] " recursion"     "stack"          "Big O notation"
```

---

## Network Graph of Concept Relationships

``` r
library(tm)

# Create corpus from all Wikipedia content
corpus <- Corpus(VectorSource(content_list))

# Preprocess
dtm <- corpus %>%
 tm_map(content_transformer(tolower)) %>%
 tm_map(removeWords, stopwords("english")) %>%
 tm_map(removePunctuation) %>%
 tm_map(stripWhitespace) %>%
 DocumentTermMatrix()

# Get term frequencies
term_freq <- colSums(as.matrix(dtm))
top_terms <- sort(term_freq, decreasing = TRUE)[1:100]
```

---

## Build network

``` r
library(igraph)
library(ggraph)
library(tidygraph)

# Build co-occurrence matrix
cooccur <- as.matrix(dtm)
cooccur <- t(cooccur) %*% cooccur
diag(cooccur) <- 0 # Remove self-loops

# Keep strong relationships only
cooccur[cooccur < 3] <- 0

# Create graph
g <- graph_from_adjacency_matrix(cooccur, mode = "undirected", weighted = TRUE)
```

---

## Plot of Entities

``` r
g %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(size = weight), alpha = 0.2) +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  theme_classic() +
  labs(title = "CS Concept Co-occurrence Network")
```

---

## Temporal Analysis of Page Revisions

``` r
# evision history for activity analysis
get_revision_stats <- function(page_title) {
 rev_data <- request("https://en.wikipedia.org/w/api.php") %>%
 req_url_query(
 action = "query",
 prop = "revisions",
 titles = page_title,
 rvprop = "timestamp|user|size",
 rvlimit = 500,
 format = "json"
 ) %>%
 req_perform() %>%
 resp_body_json(simplifyVector = TRUE)
 
 revisions <- rev_data$query$pages[[1]]$revisions %>%
 mutate(timestamp = as.POSIXct(timestamp, format = "%Y-%m-%dT%H:%M:%SZ"))
 
 return(revisions)
}

# Analyze activity
algo_revisions <- get_revision_stats("Algorithm")
monthly_activity <- table(format(algo_revisions$timestamp, "%Y-%m"))
```

---

## Plot Revisions

``` r
# activity over time
barplot(monthly_activity[1:12],
        main = "Algorithm Page: Recent Edit Activity",
        xlab = "Month", ylab = "Edits",
        col = "orange")
```

---

## Mini-Project

``` r
# FINAL PROJECT STRUCTURE

# 1. Choose 3 topics
my_topics <- c("Hash table", "Binary tree", "Graph traversal")

# 2. Fetch content
guide <- lapply(my_topics, fetch_wiki_text) |>
 setNames(my_topics)

# 3. For each topic, generate:
#    - Summary (1 paragraph)
#    - R code example
#    - 3 related concepts
#    - Difficulty rating (1-5)

generate_study_card <- function(topic, text, chat) {
 prompt <- sprintf(
 "Create a study card for '%s' in this exact JSON format:
 {
 'topic': '%s',
 'summary': '...',
 'difficulty': 3,
 'related_topics': ['...', '...'],
 'r_example': '...'
 }
 
 Wikipedia text: %s",
 topic, topic, substr(text, 1, 800)
 )
 
 json <- chat$chat(prompt)
 jsonlite::fromJSON(json)
}
```

---

```` r
# 4. Generate cards
study_cards <- lapply(names(guide), function(topic) {
 message("Processing: ", topic)
 generate_study_card(topic, guide[[topic]], chat)
})

# 5. Save as formatted markdown
save_study_guide <- function(cards, filename = "study_guide.md") {
 content <- c("# CS Study Guide\n\n", 
 unlist(lapply(cards, function(card) {
 c(
 sprintf("## %s (Difficulty: %d/5)\n", card$topic, card$difficulty),
 sprintf("**Summary:** %s\n", card$summary),
 sprintf("**Related:** %s\n", paste(card$related_topics, collapse = ", ")),
 "```r", card$r_example, "```\n\n"
 )
 })))
 
 writeLines(content, filename)
 message("Saved to: ", filename)
}

# Run it!
save_study_guide(study_cards)
````

---
## Common Issues & Fixes

``` r
# PROBLEM 1: Connection refused
# SOLUTION: Check LM Studio is running
httr2::request("http://localhost:1234/v1/models") |>
  req_perform()

# PROBLEM 2: Empty responses
# SOLUTION: Increase timeout
chat$chat("Your prompt", timeout = 60)

# PROBLEM 3: Invalid JSON
# SOLUTION: Add retry with validation
for (i in 1:3) {
 json_str <- chat$chat("Return JSON...")
 if (jsonlite::validate(json_str)) break
 Sys.sleep(1)
}

# PROBLEM 4: Rate limits
# SOLUTION: Implement polite delays
polite_chat <- function(prompt) {
 Sys.sleep(0.5) # Be nice to your CPU
 chat$chat(prompt)
}
```

---
## Best Practices Checklist

Table: LLM Best Practices

|Do                   |Dont                  |
|:--------------------|:---------------------|
|Cache LLM responses  |Eval LLM code blindly |
|Validate JSON output |Hardcode API keys     |
|Use polite delays    |Hammer Wikipedia      |
|Start small & test   |Skip error handling   |
|Read model docs      |Use enormous models   |

---

## Resources

``` r
# Quick self-assessment
cat("During this seminar, I learned to:\n")
```

```
## During this seminar, I learned to:
```

``` r
cat("1. Setup LM Studio: _______\n")
```

```
## 1. Setup LM Studio: _______
```

``` r
cat("2. Query LLMs from R: _______\n")
```

```
## 2. Query LLMs from R: _______
```

``` r
cat("3. Scrape Wikipedia: _______\n")
```

```
## 3. Scrape Wikipedia: _______
```

``` r
cat("4. Build a pipeline: _______\n")
```

```
## 4. Build a pipeline: _______
```

``` r
cat("\nRate the difficulty: 1 2 3 4 5\n")
```

```
## 
## Rate the difficulty: 1 2 3 4 5
```

``` r
cat("What was unclear? _________________\n")
```

```
## What was unclear? _________________
```

---

## Evaluation