Data Literacy: Introduction to R

class: center, middle, inverse, title-slide

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Applications
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-06-24
]

---

layout: true

---

## What is Text Analysis?

- **Text Analysis** involves structuring, processing, and interpreting unstructured textual data.
- Subfields: tokenization, stopword removal, sentiment analysis, summarization.
- E.g., Cleaning customer reviews before sentiment modeling.

---

## From Text Analysis to Language Models

- Traditional text analysis: Bag-of-words, tf-idf, topic modeling.
- Modern NLP uses **pretrained models** like BERT, GPT, and LLama for:
  - Summarization
  - Classification
  - Generation
- R ecosystem now supports many LLM workflows.

---

## LLMs: A Conceptual Overview

- **LLMs**: Predict next token in a sequence based on context.
- Trained on huge corpora with billions of parameters.
- Common tasks:
  - Code generation
  - Data explanation
  - Cleaning and enrichment
- API interaction (e.g., OpenAI, Ollama) or **local** inference via `chatLLM`.

---

## Theoretical Guidance: Why Use LLMs?

- **Language as Interface**: Interact with complex systems in plain language.
- **Cognitive Offloading**: LLMs break complex problems into tractable steps.
- **Structured Reasoning**: Tools like `LLMAgentR` implement state graphs to mimic planning and reasoning.

---

## Exercise 1: Text Preprocessing in R

``` r
library(tidytext)
library(tibble)
library(dplyr)

text_data <- c("Text analysis is fun with R!", 
               "LLMs make language processing easier.")

df <- tibble(id = 1:2, text = text_data)

df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
```

```
## # A tibble: 7 × 2
##      id word      
##   <int> <chr>     
## 1     1 text      
## 2     1 analysis  
## 3     1 fun       
## 4     2 llms      
## 5     2 language  
## 6     2 processing
## 7     2 easier
```
---

## Agent-Based LLMs in R: LLMAgentR

-📦 `LLMAgentR` brings LangGraph-style state machines to R.

- Based on `chatLLM`, enables code generation, SQL agents, data cleaning tasks.

- Useful for iterative multi-step workflows (e.g. build–review–revise).

``` r
install.packages("LLMAgentR")
library(LLMAgentR)
```

---
## RAG: Retrieval-Augmented Generation with ragnar

- `ragnar`: Efficient implementation of RAG in R using DuckDB.

- Embed documents, store vectors, search + generate.

``` r
library(ragnar)

corpus <- ingest("r4ds.txt")  # your own corpus file
ask(corpus, "How do you reshape data in R?")
```

---
## Visual LLMs with kuzco

- `kuzco`: Apply LLMs to image classification and OCR in R.

``` r
library(kuzco)
image <- system.file("extdata", "cat.jpg", package = "kuzco")
analyze_image(image, task = "describe")
```

---

##  Calling a Llama Model via API

In this step, we show how to interact with a Llama model via the httr package in R.
Setup: Get your Llama API endpoint (the URL provided by the service where the Llama model is hosted).

``` r
library(httr2)
library(jsonlite)

# Define the URL for your locally hosted Llama model
url <- "http://127.0.0.1:1234/v1/completions"  # Adjust to match your actual API endpoint

# Example input prompt
prompt <- list(
  prompt = "What can R Studio be used for?",
  max_tokens = 100,  # Maximum number of tokens to generate
  temperature = 0.7   # Control randomness
)

# Send POST request and check the status
response <- POST(url,
                 body = toJSON(prompt), 
                 encode = "json", 
                 content_type_json(),  # Ensure we send JSON formatted content
                 timeout(60))  # Set timeout in case of long processing times

# Check if successful
if (status_code(response) == 200) {
  # Parse the response content
  response_content <- content(response, "text")
  
  # Parse the JSON response
  parsed_response <- fromJSON(response_content)
  
}
```
---

## Introduction to Reproducibility

- **Reproducibility**: Ensures that experiments can be repeated under the same conditions to confirm results.
- LLMs like GPT, Llama, and others can be sensitive to settings like model version, temperature, and prompts.
- Best practices for ensuring reproducibility when working with LLMs include:
  - Documenting code, environment, and model parameters.
  - Using version control for models and dependencies.
  - Ensuring consistent randomization across runs.

---
  
## Track Model Hyperparameters and Results

Save model configurations (e.g., temperature, max tokens) and results using a configuration file

``` r
# Model Configuration (config.yml)
model:
  type: Llama
  temperature: 0.7
  max_tokens: 100
```

---

## How Can We Simulate Norms?

1. **Agents with different preferences**: Some agents may value tradition, others may prefer change.
2. **Role-play scenarios**: Different agents interact based on their preferences and perspectives.
3. **Emergence**: The group’s behavior may shift to align with one norm over time.

---

## Mini-Task: Simulating Norm Emergence with Language Models

- Imagine we have three agents in a community: **Alice**, **Bob**, **Charlie**.
- Their goal is to establish a social norm for **greeting** people when they meet.
- Alice prefers formal greetings, Bob likes informal greetings, and Charlie is neutral.

- **Step 1**: Think about the following questions:
  - What would each agent say in a conversation about greetings?
  - How might they try to influence each other’s behavior?
  - Would they compromise, or would one of them dominate the group behavior?
  
- **Step 2**: Write a **short prompt** to capture norm emergence.

---

## Prompting

- Set Specific Roles for Agents
- Sequential Prompts (step-by-step progression)
- Specify Evaluation of Outcomes