class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Applications ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-06-24 ] --- layout: true --- ## What is Text Analysis? - **Text Analysis** involves structuring, processing, and interpreting unstructured textual data. - Subfields: tokenization, stopword removal, sentiment analysis, summarization. - E.g., Cleaning customer reviews before sentiment modeling. --- ## From Text Analysis to Language Models - Traditional text analysis: Bag-of-words, tf-idf, topic modeling. - Modern NLP uses **pretrained models** like BERT, GPT, and LLama for: - Summarization - Classification - Generation - R ecosystem now supports many LLM workflows. --- ## LLMs: A Conceptual Overview - **LLMs**: Predict next token in a sequence based on context. - Trained on huge corpora with billions of parameters. - Common tasks: - Code generation - Data explanation - Cleaning and enrichment - API interaction (e.g., OpenAI, Ollama) or **local** inference via `chatLLM`. --- ## Theoretical Guidance: Why Use LLMs? - **Language as Interface**: Interact with complex systems in plain language. - **Cognitive Offloading**: LLMs break complex problems into tractable steps. - **Structured Reasoning**: Tools like `LLMAgentR` implement state graphs to mimic planning and reasoning. --- ## Exercise 1: Text Preprocessing in R ``` r library(tidytext) library(tibble) library(dplyr) text_data <- c("Text analysis is fun with R!", "LLMs make language processing easier.") df <- tibble(id = 1:2, text = text_data) df %>% unnest_tokens(word, text) %>% anti_join(stop_words) ``` ``` ## # A tibble: 7 × 2 ## id word ## <int> <chr> ## 1 1 text ## 2 1 analysis ## 3 1 fun ## 4 2 llms ## 5 2 language ## 6 2 processing ## 7 2 easier ``` --- ## Agent-Based LLMs in R: LLMAgentR -📦 `LLMAgentR` brings LangGraph-style state machines to R. - Based on `chatLLM`, enables code generation, SQL agents, data cleaning tasks. - Useful for iterative multi-step workflows (e.g. build–review–revise). ``` r install.packages("LLMAgentR") library(LLMAgentR) ``` --- ## RAG: Retrieval-Augmented Generation with ragnar - `ragnar`: Efficient implementation of RAG in R using DuckDB. - Embed documents, store vectors, search + generate. ``` r library(ragnar) corpus <- ingest("r4ds.txt") # your own corpus file ask(corpus, "How do you reshape data in R?") ``` --- ## Visual LLMs with kuzco - `kuzco`: Apply LLMs to image classification and OCR in R. ``` r library(kuzco) image <- system.file("extdata", "cat.jpg", package = "kuzco") analyze_image(image, task = "describe") ``` --- ## Calling a Llama Model via API In this step, we show how to interact with a Llama model via the httr package in R. Setup: Get your Llama API endpoint (the URL provided by the service where the Llama model is hosted). ``` r library(httr2) library(jsonlite) # Define the URL for your locally hosted Llama model url <- "http://127.0.0.1:1234/v1/completions" # Adjust to match your actual API endpoint # Example input prompt prompt <- list( prompt = "What can R Studio be used for?", max_tokens = 100, # Maximum number of tokens to generate temperature = 0.7 # Control randomness ) # Send POST request and check the status response <- POST(url, body = toJSON(prompt), encode = "json", content_type_json(), # Ensure we send JSON formatted content timeout(60)) # Set timeout in case of long processing times # Check if successful if (status_code(response) == 200) { # Parse the response content response_content <- content(response, "text") # Parse the JSON response parsed_response <- fromJSON(response_content) } ``` --- ## Introduction to Reproducibility - **Reproducibility**: Ensures that experiments can be repeated under the same conditions to confirm results. - LLMs like GPT, Llama, and others can be sensitive to settings like model version, temperature, and prompts. - Best practices for ensuring reproducibility when working with LLMs include: - Documenting code, environment, and model parameters. - Using version control for models and dependencies. - Ensuring consistent randomization across runs. --- ## Track Model Hyperparameters and Results Save model configurations (e.g., temperature, max tokens) and results using a configuration file ``` r # Model Configuration (config.yml) model: type: Llama temperature: 0.7 max_tokens: 100 ``` --- ## How Can We Simulate Norms? 1. **Agents with different preferences**: Some agents may value tradition, others may prefer change. 2. **Role-play scenarios**: Different agents interact based on their preferences and perspectives. 3. **Emergence**: The group’s behavior may shift to align with one norm over time. --- ## Mini-Task: Simulating Norm Emergence with Language Models - Imagine we have three agents in a community: **Alice**, **Bob**, **Charlie**. - Their goal is to establish a social norm for **greeting** people when they meet. - Alice prefers formal greetings, Bob likes informal greetings, and Charlie is neutral. - **Step 1**: Think about the following questions: - What would each agent say in a conversation about greetings? - How might they try to influence each other’s behavior? - Would they compromise, or would one of them dominate the group behavior? - **Step 2**: Write a **short prompt** to capture norm emergence. --- ## Prompting - Set Specific Roles for Agents - Sequential Prompts (step-by-step progression) - Specify Evaluation of Outcomes