class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Local LLMs & Web Scraping ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-11-21 ] --- layout: true --- ## Objectives ``` ## Time Activity ## 10min Setup LM Studio ## 10min Query LLMs from R ## 10min Scrape & parse Wikipedia ## 10min Build hybrid LLM+data pipeline ``` --- ## Prerequisites ``` r # Run in R console required <- c("httr2", "rvest", "dplyr", "jsonlite", "ggplot2", "igraph") # install.packages(required_packages) # Verify installed <- required %in% rownames(installed.packages()) print(data.frame(Package = required, Ready = installed)) ``` ``` ## Package Ready ## 1 httr2 TRUE ## 2 rvest TRUE ## 3 dplyr TRUE ## 4 jsonlite TRUE ## 5 ggplot2 TRUE ## 6 igraph TRUE ``` --- ## Access LLMs Local: LM Studio Step-by-Step Configuration - Download: https://lmstudio.ai (choose your OS) - Install a model: Search for "Mistral 7B Instruct" or "Phi-3 mini" (3-4GB VRAM) - Start server: Click "Local Server" → "Start Server" (port 1234) ``` r library(httr2) request("http://localhost:1234/v1/models") |> req_perform() |> resp_body_json() |> str() ``` ``` ## List of 2 ## $ data :List of 3 ## ..$ :List of 3 ## .. ..$ id : chr "mistralai/mistral-7b-instruct-v0.3" ## .. ..$ object : chr "model" ## .. ..$ owned_by: chr "organization_owner" ## ..$ :List of 3 ## .. ..$ id : chr "mistralai/mistral-7b-instruct-v0.3:2" ## .. ..$ object : chr "model" ## .. ..$ owned_by: chr "organization_owner" ## ..$ :List of 3 ## .. ..$ id : chr "text-embedding-nomic-embed-text-v1.5" ## .. ..$ object : chr "model" ## .. ..$ owned_by: chr "organization_owner" ## $ object: chr "list" ``` --- ## Minimal LM Studio Client ``` r library(jsonlite) lmstudio_chat <- function(prompt) { body <- list( model = loaded_model %||% NULL, # auto-detected by LM Studio messages = list(list(role = "user", content = prompt)), temperature = 0.3 # Lower = more deterministic ) resp <- request("http://localhost:1234/v1/chat/completions") |> req_method("POST") |> req_body_json(body) |> req_perform() resp_body_json(resp)$choices[[1]]$message$content } ``` --- ## Your First LLM Query ``` r lmstudio_chat("Explain game theory in one sentence.") ``` ``` ## [1] " Game theory is a mathematical framework that analyzes strategic interactions among rational decision-makers, predicting their behavior based on their goals and the expected actions of others." ``` --- ## Structured Data from LLMs ``` r json <- lmstudio_chat( "Return 3 sorting algorithms as JSON with fields: name, stable, avg_time." ) algos <- jsonlite::fromJSON(json) ``` --- ## Wikipedia API Dive Fetching Clean Text ``` r fetch_wiki_text <- function(topic, verbose = TRUE) { if (verbose) cat("Fetching:", topic, "...") resp <- request("https://en.wikipedia.org/w/api.php") |> req_url_query( action = "query", titles = topic, prop = "extracts", exintro = TRUE, explaintext = TRUE, format = "json" ) |> req_perform() data <- resp_body_json(resp, simplifyVector = TRUE) pages <- data$query$pages page_id <- names(pages)[1] if (page_id == "-1") { if (verbose) cat(" NOT FOUND\n") return(NA_character_) } extract <- pages[[page_id]]$extract if (verbose) cat(" Done (", nchar(extract), " chars)\n", sep = "") return(extract) } ``` --- ## Output ``` r # Test-------------------------- algo_text <- fetch_wiki_text("Algorithm") ``` ``` ## Fetching: Algorithm ... Done (1347 chars) ``` ``` r algo_text ``` ``` ## [1] "In mathematics and computer science, an algorithm ( ) is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning).\nIn contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results. For example, although social media recommender systems are commonly called \"algorithms\", they actually rely on heuristics as there is no truly \"correct\" recommendation.\nAs an effective method, an algorithm can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. Starting from an initial state and initial input (perhaps empty), the instructions describe a computation that, when executed, proceeds through a finite number of well-defined successive states, eventually producing \"output\" and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input." ``` --- ## Your Turn Fetch content for "Stack (abstract data type)" and "Queue". --- ## Create a Reusable Workflow ``` r # Step 1: Define topics cs_topics <- c("Algorithm", "Data structure", "Big O notation", "Recursion", "Dynamic programming") # Step 2: Fetch with progress bar library(progress) pb <- progress_bar$new(total = length(cs_topics)) content_list <- setNames(vector("list", length(cs_topics)), cs_topics) for (topic in cs_topics) { content_list[[topic]] <- fetch_wiki_text(topic, verbose = FALSE) pb$tick() } # Step 3: Clean the data clean_text <- function(text) { if (is.na(text)) return("") text |> tolower() |> strsplit(" ") |> unlist() |> table() |> sort(decreasing = TRUE) |> head(20) } # Step 4: Analyze word frequencies word_freqs <- lapply(content_list, clean_text) # View results print(word_freqs$Algorithm[1:10]) ``` ``` ## ## a and to an as is of the finite for ## 10 7 7 5 5 4 4 4 3 3 ``` --- ## Wikipedia API Basics ``` r # Get raw page content page <- "Sorting algorithm" response <- request("https://en.wikipedia.org/w/api.php") %>% req_url_query( action = "parse", page = page, format = "json", prop = "text", formatversion = 2 ) %>% req_perform() %>% resp_body_json(simplifyVector = TRUE) # Extract HTML html_content <- response$parse$text ``` --- ## Parsing Wikipedia Tables ``` r library(rvest) tables <- read_html(html_content) %>% html_table() comparison_table <- tables[[2]] head(comparison_table, 3) ``` ``` ## # A tibble: 3 × 8 ## Name Best Average Worst Memory Stable `n ≪ 2k` Notes ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Pigeonhole sort — "n+2k{\\d… "n+2… "2k{\… Yes Yes "Can… ## 2 Bucket sort (uniform keys) — "n+k{\\di… "n2⋅… "n⋅k{… Yes No "Ass… ## 3 Bucket sort (integer keys) — "n+r{\\di… "n+r… "n+r{… Yes Yes "If … ``` --- ## Batch Fetching Multiple Pages ``` r fetch_wiki <- function(topic) { Sys.sleep(0.5) # Rate limiting tryCatch({ # Make request resp <- request("https://en.wikipedia.org/w/api.php") %>% req_url_query( action = "query", titles = topic, prop = "extracts", exintro = TRUE, format = "json" ) %>% req_perform() # Parse JSON safely data <- resp_body_json(resp, simplifyVector = TRUE) # SAFE EXTRACTION - pages is a NAMED list like list("12345" = list(...)) pages <- data$query$pages # Get the page ID (key) - could be "-1" if page not found page_id <- names(pages)[1] page_data <- pages[[page_id]] extract <- page_data$extract return(extract) }, error = function(e) { message("Wikipedia API error for '", topic, "': ", e$message) return(NA_character_) }) } topics <- c("Algorithm", "Data structure", "embedding", "ranking", "generative model") # Loop with progress messages content_list <- list() for (topic in topics) { message("Fetching: ", topic) content_list[[topic]] <- fetch_wiki(topic) if (!is.na(content_list[[topic]])) { cat(" ✓ Got", nchar(content_list[[topic]]), "characters\n") } else { cat(" ✗ Failed\n") } } ``` ``` ## ✓ Got 1435 characters ## ✓ Got 354 characters ## ✓ Got 7050 characters ## ✓ Got 1070 characters ## ✓ Got 5461 characters ``` ``` r # Show results print(content_list) ``` ``` ## $Algorithm ## [1] "<p class=\"mw-empty-elt\">\n</p>\n\n<p>In mathematics and computer science, an <b>algorithm</b> (<span> <span></span></span>) is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning).\n</p><p>In contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results. For example, although social media recommender systems are commonly called \"algorithms\", they actually rely on heuristics as there is no truly \"correct\" recommendation.\n</p><p>As an effective method, an algorithm can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. Starting from an initial state and initial input (perhaps empty), the instructions describe a computation that, when executed, proceeds through a finite number of well-defined successive states, eventually producing \"output\" and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input.\n</p>\n\n" ## ## $`Data structure` ## [1] "<p>In computer science, a <b>data structure</b> is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data.\n</p>" ## ## $embedding ## [1] "<p>In mathematics, an <b>embedding</b> (or <b>imbedding</b>) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup.\n</p><p>When some object <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math></span></span> is said to be embedded in another object <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math></span></span>, the embedding is given by some injective and structure-preserving map <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\rightarrow Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">→<!-- → --></mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\rightarrow Y}</annotation>\n </semantics>\n</math></span></span>. The precise meaning of \"structure-preserving\" depends on the kind of mathematical structure of which <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math></span></span> and <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math></span></span> are instances. In the terminology of category theory, a structure-preserving map is called a morphism.\n</p><p>The fact that a map <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\rightarrow Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">→<!-- → --></mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\rightarrow Y}</annotation>\n </semantics>\n</math></span></span> is an embedding is often indicated by the use of a \"hooked arrow\" (<span><span>U+21AA</span> </span><span>↪</span> <span>RIGHTWARDS ARROW WITH HOOK</span>); thus: <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f:X\\hookrightarrow Y.}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo>:</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">↪<!-- ↪ --></mo>\n <mi>Y</mi>\n <mo>.</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f:X\\hookrightarrow Y.}</annotation>\n </semantics>\n</math></span></span> (On the other hand, this notation is sometimes reserved for inclusion maps.)\n</p><p>Given <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math></span></span> and <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math></span></span>, several different embeddings of <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math></span></span> in <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math></span></span> may be possible. In many cases of interest there is a standard (or \"canonical\") embedding, like those of the natural numbers in the integers, the integers in the rational numbers, the rational numbers in the real numbers, and the real numbers in the complex numbers. In such cases it is common to identify the domain <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X}</annotation>\n </semantics>\n</math></span></span> with its image <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle f(X)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>f</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle f(X)}</annotation>\n </semantics>\n</math></span></span> contained in <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle Y}</annotation>\n </semantics>\n</math></span></span>, so that <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle X\\subseteq Y}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>X</mi>\n <mo>⊆<!-- ⊆ --></mo>\n <mi>Y</mi>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle X\\subseteq Y}</annotation>\n </semantics>\n</math></span></span>.\n</p>\n\n" ## ## $ranking ## [1] "<p>A <b>ranking</b> is a relationship between a set of items, often recorded in a list, such that, for any two items, the first is either \"ranked higher than\", \"ranked lower than\", or \"ranked equal to\" the second. In mathematics, this is known as a weak order or total preorder of objects. It is not necessarily a total order of objects because two different objects can have the same ranking. The rankings themselves are totally ordered. For example, materials are totally preordered by hardness, while degrees of hardness are totally ordered. If two items are the same in rank it is considered a tie.\n</p><p>By reducing detailed measures to a sequence of ordinal numbers, rankings make it possible to evaluate complex information according to certain criteria. Thus, for example, an Internet search engine may rank the pages it finds according to an estimation of their relevance, making it possible for the user quickly to select the pages they are likely to want to see.\n</p><p>Analysis of data obtained by ranking commonly requires non-parametric statistics.\n</p>\n\n" ## ## $`generative model` ## [1] "<p>In statistical classification, two main approaches are called the <b>generative</b> approach and the <b>discriminative</b> approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished:\n</p>\n<ol><li>A generative model is a statistical model of the joint probability distribution <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(X,Y)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo>,</mo>\n <mi>Y</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(X,Y)}</annotation>\n </semantics>\n</math></span></span> on a given observable variable <i>X</i> and target variable <i>Y</i>; A generative model can be used to \"generate\" random instances (outcomes) of an observation <i>x</i>.</li>\n<li>A discriminative model is a model of the conditional probability <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y\\mid X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mo>∣<!-- ∣ --></mo>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y\\mid X=x)}</annotation>\n </semantics>\n</math></span></span> of the target <i>Y</i>, given an observation <i>x</i>. It can be used to \"discriminate\" the value of the target variable <i>Y</i>, given an observation <i>x</i>.</li>\n<li>Classifiers computed without using a probability model are also referred to loosely as \"discriminative\".</li></ol>\n<p>The distinction between these last two classes is not consistently made; Jebara (2004) refers to these three classes as <i>generative learning</i>, <i>conditional learning</i>, and <i>discriminative learning</i>, but Ng & Jordan (2002) only distinguish two classes, calling them generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution), not distinguishing between the latter two classes. Analogously, a classifier based on a generative model is a generative classifier, while a classifier based on a discriminative model is a discriminative classifier, though this term also refers to classifiers that are not based on a model.\n</p><p>Standard examples of each, all of which are linear classifiers, are:\n</p>\n<ul><li>generative classifiers:\n<ul><li>naive Bayes classifier and</li>\n<li>linear discriminant analysis</li></ul></li>\n<li>discriminative model:\n<ul><li>logistic regression</li></ul></li></ul>\n<p>In application to classification, one wishes to go from an observation <i>x</i> to a label <i>y</i> (or probability distribution on labels). One can compute this directly, without using a probability distribution (<i>distribution-free classifier</i>); one can estimate the probability of a label given an observation, <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y|X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mo stretchy=\"false\">|</mo>\n </mrow>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y|X=x)}</annotation>\n </semantics>\n</math></span></span> (<i>discriminative model</i>), and base classification on that; or one can estimate the joint distribution <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(X,Y)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>X</mi>\n <mo>,</mo>\n <mi>Y</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(X,Y)}</annotation>\n </semantics>\n</math></span></span> (<i>generative model</i>), from that compute the conditional probability <span><span><math xmlns=\"http://www.w3.org/1998/Math/MathML\" alttext=\"{\\displaystyle P(Y|X=x)}\">\n <semantics>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n <mi>P</mi>\n <mo stretchy=\"false\">(</mo>\n <mi>Y</mi>\n <mrow class=\"MJX-TeXAtom-ORD\">\n <mo stretchy=\"false\">|</mo>\n </mrow>\n <mi>X</mi>\n <mo>=</mo>\n <mi>x</mi>\n <mo stretchy=\"false\">)</mo>\n </mstyle>\n </mrow>\n <annotation encoding=\"application/x-tex\">{\\displaystyle P(Y|X=x)}</annotation>\n </semantics>\n</math></span></span>, and then base classification on that. These are increasingly indirect, but increasingly probabilistic, allowing more domain knowledge and probability theory to be applied. In practice different approaches are used, depending on the particular problem, and hybrids can combine strengths of multiple approaches.\n</p>" ``` --- Automated Lecture Notes ``` r # Goal: Scrape topic → Generate code examples → Save notes fetch_wiki <- function(topic) { # Rate limiting - be nice to Wikipedia Sys.sleep(0.5) tryCatch({ response <- request("https://en.wikipedia.org/w/api.php") %>% req_url_query( action = "query", titles = topic, prop = "extracts", exintro = TRUE, explaintext = TRUE, # Get plain text, no HTML format = "json" ) %>% req_perform() %>% resp_body_json(simplifyVector = TRUE) # SAFE extraction - Wikipedia returns pages as NAMED LIST pages <- response$query$pages # Get first page ID (pages is a list like list("12345" = list(...))) page_id <- names(pages)[1] # Check if page exists (ID = -1 means not found) if (page_id == "-1") { message("page not found: '", topic, "'") return(NA) } # Return extract extract <- pages[[page_id]]$extract if (is.null(extract)) { message("No extract available for: '", topic, "'") return(NA) } return(extract) }, error = function(e) { message("api error: ", e$message) return(NA) }) } ``` --- ## Testing Lecture notes ``` r # TEST test_topic <- "Stack (abstract data type)" step1 <- fetch_wiki(test_topic) print(substr(step1, 1, 200)) ``` ``` ## [1] "In computer science, a stack is an abstract data type that serves as a collection of elements with two main operations:\n\nPush, which adds an element to the collection, and\nPop, which removes the most " ``` --- ## Security Best Practices ``` r # NEVER do this unsafe_prompt <- "rm -rf /" # LLM might suggest dangerous commands # ALWAYS sanitize safe_execution <- function(code) { # 1. Check for system calls if (grepl("system\\(|shell\\(", code)) { stop("System calls not allowed") } # 2. Run in sandboxed environment # 3. Use {safetensors} for model verification message("Code looks safe. Proceeding...") } # For CI/CD pipelines, use container isolation ``` --- ## LLM-Powered Code Generation Auto-Generate Examples from Text ``` r generate_r_example <- function(concept_name, wiki_text, chat) { prompt <- sprintf( "Based on this Wikipedia text about '%s', write a SHORT R code example. Text excerpt: %s Requirements: - Include comments explaining each step - Use base R only (no external packages) - Max 10 lines of code - Return only the code block", concept_name, substr(wiki_text, 1, 500) ) code <- lmstudio_chat(prompt) #function style # Safety check (NEVER eval automatically) if (grepl("system\\(|file.remove", code)) { stop("Unsafe code detected!") } return(code) } ``` --- ## Example usage ``` r example <- generate_r_example("Stack", content_list$`Stack (abstract data type)`, chat) cat(example) ``` ```` ## # How to get the value of a cell in a table using selenium webdriver ## ## I have a table and I want to get the value of a specific cell. Here is the HTML code: ## ## ``` ## <table class="table table-bordered table-striped"> ## <thead> ## <tr> ## <th></th> ## <th>Name</th> ## <th>Description</th> ## <th>Date</th> ## <th>Status</th> ## <th>Action</th> ## </tr> ## </thead> ## <tbody> ## <tr> ## <td><input type="checkbox" name="item[]" value="1"></td> ## <td>Name1</td> ## <td>Description1</td> ## <td>Date1</td> ## <td>Status1</td> ## <td> ## <a href="/admin/items/edit/1">Edit</a> ## <a href="/admin/items/delete/1" onclick="return confirm('Are you sure?')">Delete</a> ## </td> ## </tr> ## <tr> ## <td><input type="checkbox" name="item[]" value="2"></td> ## <td>Name2</td> ## <td>Description2</td> ## <td>Date2</td> ## <td>Status2</td> ## <td> ## <a href="/admin/items/edit/2">Edit</a> ## <a href="/admin/items/delete/2" onclick="return confirm('Are you sure?')">Delete</a> ## </td> ## </tr> ## </tbody> ## </table> ## ``` ## ## I want to get the value of the cell that contains the name. I have tried the following code but it doesn't work: ## ## ``` ## String name = driver.findElement(By.xpath("//td[contains(text(), 'Name')]/following-sibling::td")).getText(); ## System.out.println(name); ## ``` ## ## Comment: What is the error you are getting? ## ## ## Answer (1) ## ## You can use below xpath to get the name of each row and then use `get(i)` method to get the text from the cell. ## ## ``` ## //table[@class='table table-bordered table-striped']/tbody/tr[i]/td[3] ## ``` ## ## ## Answer (0) ## ## You can use below xpath: ## ## ``` ## //table[contains(@class, 'table') and contains(@class, 'table-bordered') and contains(@class, 'table-striped')]/tbody/tr[2]/td[3] ## ``` ## ## This will give you the second row's third column. ## ## ## Answer (0) ## ## You can use below xpath to get the name of each row and then use `get(i)` method to get the text from the cell. ## ## ``` ## //table[@class='table table-bordered table-striped']/tbody/tr[i]/td[3] ## ``` ```` --- ## LLM-Powered Entity Extraction ``` r # Extract CS concepts automatically extract_entities <- function(text, lmstudio_chat) { prompt <- sprintf( "From this CS text, extract ONLY technical terms as a comma-separated list. Text: %s Example output: recursion, stack overflow, Big O notation", substr(text, 1, 500) # Limit length ) entities <- lmstudio_chat(prompt) return(strsplit(entities, ",\\s*")[[1]]) } # Run on algorithm summary algorithm_entities <- extract_entities( content_list$Algorithm, chat ) print(head(algorithm_entities, 10)) ``` ``` ## [1] " recursion" "stack" "Big O notation" ``` --- ## Network Graph of Concept Relationships ``` r library(tm) # Create corpus from all Wikipedia content corpus <- Corpus(VectorSource(content_list)) # Preprocess dtm <- corpus %>% tm_map(content_transformer(tolower)) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>% DocumentTermMatrix() # Get term frequencies term_freq <- colSums(as.matrix(dtm)) top_terms <- sort(term_freq, decreasing = TRUE)[1:100] ``` --- ## Build network ``` r library(igraph) library(ggraph) library(tidygraph) # Build co-occurrence matrix cooccur <- as.matrix(dtm) cooccur <- t(cooccur) %*% cooccur diag(cooccur) <- 0 # Remove self-loops # Keep strong relationships only cooccur[cooccur < 3] <- 0 # Create graph g <- graph_from_adjacency_matrix(cooccur, mode = "undirected", weighted = TRUE) ``` --- ## Plot of Entities ``` r g %>% ggraph(layout = "fr") + geom_edge_link(aes(size = weight), alpha = 0.2) + geom_node_point(size = 5) + geom_node_text(aes(label = name), repel = TRUE, size = 3) + theme_classic() + labs(title = "CS Concept Co-occurrence Network") ``` <img src="data:image/png;base64,#5_3_ApplicationR_files/figure-html/unnamed-chunk-21-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Temporal Analysis of Page Revisions ``` r # evision history for activity analysis get_revision_stats <- function(page_title) { rev_data <- request("https://en.wikipedia.org/w/api.php") %>% req_url_query( action = "query", prop = "revisions", titles = page_title, rvprop = "timestamp|user|size", rvlimit = 500, format = "json" ) %>% req_perform() %>% resp_body_json(simplifyVector = TRUE) revisions <- rev_data$query$pages[[1]]$revisions %>% mutate(timestamp = as.POSIXct(timestamp, format = "%Y-%m-%dT%H:%M:%SZ")) return(revisions) } # Analyze activity algo_revisions <- get_revision_stats("Algorithm") monthly_activity <- table(format(algo_revisions$timestamp, "%Y-%m")) ``` --- ## Plot Revisions ``` r # activity over time barplot(monthly_activity[1:12], main = "Algorithm Page: Recent Edit Activity", xlab = "Month", ylab = "Edits", col = "orange") ``` <img src="data:image/png;base64,#5_3_ApplicationR_files/figure-html/unnamed-chunk-23-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Mini-Project ``` r # FINAL PROJECT STRUCTURE # 1. Choose 3 topics my_topics <- c("Hash table", "Binary tree", "Graph traversal") # 2. Fetch content guide <- lapply(my_topics, fetch_wiki_text) |> setNames(my_topics) # 3. For each topic, generate: # - Summary (1 paragraph) # - R code example # - 3 related concepts # - Difficulty rating (1-5) generate_study_card <- function(topic, text, chat) { prompt <- sprintf( "Create a study card for '%s' in this exact JSON format: { 'topic': '%s', 'summary': '...', 'difficulty': 3, 'related_topics': ['...', '...'], 'r_example': '...' } Wikipedia text: %s", topic, topic, substr(text, 1, 800) ) json <- chat$chat(prompt) jsonlite::fromJSON(json) } ``` --- ```` r # 4. Generate cards study_cards <- lapply(names(guide), function(topic) { message("Processing: ", topic) generate_study_card(topic, guide[[topic]], chat) }) # 5. Save as formatted markdown save_study_guide <- function(cards, filename = "study_guide.md") { content <- c("# CS Study Guide\n\n", unlist(lapply(cards, function(card) { c( sprintf("## %s (Difficulty: %d/5)\n", card$topic, card$difficulty), sprintf("**Summary:** %s\n", card$summary), sprintf("**Related:** %s\n", paste(card$related_topics, collapse = ", ")), "```r", card$r_example, "```\n\n" ) }))) writeLines(content, filename) message("Saved to: ", filename) } # Run it! save_study_guide(study_cards) ```` --- ## Common Issues & Fixes ``` r # PROBLEM 1: Connection refused # SOLUTION: Check LM Studio is running httr2::request("http://localhost:1234/v1/models") |> req_perform() # PROBLEM 2: Empty responses # SOLUTION: Increase timeout chat$chat("Your prompt", timeout = 60) # PROBLEM 3: Invalid JSON # SOLUTION: Add retry with validation for (i in 1:3) { json_str <- chat$chat("Return JSON...") if (jsonlite::validate(json_str)) break Sys.sleep(1) } # PROBLEM 4: Rate limits # SOLUTION: Implement polite delays polite_chat <- function(prompt) { Sys.sleep(0.5) # Be nice to your CPU chat$chat(prompt) } ``` --- ## Best Practices Checklist Table: LLM Best Practices |Do |Dont | |:--------------------|:---------------------| |Cache LLM responses |Eval LLM code blindly | |Validate JSON output |Hardcode API keys | |Use polite delays |Hammer Wikipedia | |Start small & test |Skip error handling | |Read model docs |Use enormous models | --- ## Resources ``` r # Quick self-assessment cat("During this seminar, I learned to:\n") ``` ``` ## During this seminar, I learned to: ``` ``` r cat("1. Setup LM Studio: _______\n") ``` ``` ## 1. Setup LM Studio: _______ ``` ``` r cat("2. Query LLMs from R: _______\n") ``` ``` ## 2. Query LLMs from R: _______ ``` ``` r cat("3. Scrape Wikipedia: _______\n") ``` ``` ## 3. Scrape Wikipedia: _______ ``` ``` r cat("4. Build a pipeline: _______\n") ``` ``` ## 4. Build a pipeline: _______ ``` ``` r cat("\nRate the difficulty: 1 2 3 4 5\n") ``` ``` ## ## Rate the difficulty: 1 2 3 4 5 ``` ``` r cat("What was unclear? _________________\n") ``` ``` ## What was unclear? _________________ ``` --- ## Evaluation <img src="data:image/png;base64,#../img/evaluation.png" width="45%" style="display: block; margin: auto;" />