The exercise uses a python package grelu to make predictions for a given DNA sequence. It uses a pre-trained Borzoi model to make the prediction.
About the grelu package:
- github repo: https://github.com/Genentech/gReLU
- citation Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. bioRxiv 2024.10.09.617507 (2024) doi:10.1101/2024.10.09.617507.
About Borzoi: Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).
On discovery set up a conda environment and install grelu
source /optnfs/common/miniconda3/etc/profile.d/conda.sh
conda create --name py10 python=3.10
conda activate py10
pip install grelu
# run python
python
You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.
Navigate to your lab share and create container directory
# cd to your lab share. You home directory only has 45G, so it is not suitable to install by this way.
mkdir containers
cd containers
Pull image to container directory, this step will take ~ 10 min. It would require at least 10GB storage space.
singularity pull --dir ~/containers docker://nvcr.io/nvidia/pytorch:24.08-py3
Build a sandbox from the image
singularity build --sandbox inference_pytorch_24.08-py3/ pytorch_24.08-py3.sif
Initialize Container
CONTAINER_NAME="inference_pytorch_24.08-py3/"
PROJECT_BASE="/dartfs-hpc/rc/home/j/$netID" # Change as needed
CONTAINER_LOCATION="/dartfs-hpc/rc/home/j/$netID/containers/$CONTAINER_NAME" # I use /dartfs/rc/lab/S/Szhao/grahams
PROJECT_BASE="/dartfs/rc/lab/S/Szhao/grahams"
CONTAINER_LOCATION="/dartfs/rc/lab/S/Szhao/grahams/containers/$CONTAINER_NAME"
export PATH=$PATH:/root/.local/bin
singularity shell --fakeroot --writable --nv --contain \
--home "${PROJECT_BASE}:/root" \
"${CONTAINER_LOCATION}" bash -c
pip install grelu
export WANDB_MODE=disabled
python
You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.
import numpy as np
import pandas as pd
import torch
import os
Load the pre-trained Borzoi model from the GreLU model zoo
import grelu.resources
model = grelu.resources.load_model(
project="borzoi",
model_name="human_rep0",
)
If you see:
wandb: (1) Private W&B dashboard, no account required
wandb: (2) Create a W&B account
wandb: (3) Use an existing W&B account
wandb: (4) Don't visualize my results
choose 1.
Check model metadata and trained cell contexts
model.data_params.keys()
tasks = pd.DataFrame(model.data_params['tasks'])
tasks.head(3)
View Borzoi hyperparameters and training intervals
model.data_params['train'].keys()
for key in model.data_params['train'].keys():
if key !="intervals":
print(key, model.data_params['train'][key])
pd.DataFrame(model.data_params['train']['intervals']).head()
Make the inference intervals
input_len = model.data_params["train"]["seq_len"]
chrom = "chr1"
input_start = 69993520
input_end = input_start + input_len
input_intervals = pd.DataFrame({
'chrom':[chrom], 'start':[input_start], 'end':[input_end], "strand":["+"],
})
input_intervals
Extract sequence using GENCODE assembly (takes a few minutes)
import grelu.sequence.format
input_seqs = grelu.sequence.format.convert_input_type(
input_intervals,
output_type="strings",
genome="hg38"
)
input_seq = input_seqs[0]
len(input_seq)
input_seq[:10]
Run inference on sequence
device = torch.device('cpu')
model.to(device)
preds = model.predict_on_seqs(input_seqs, device=device)
preds.shape
Note the shape of preds: it’s in the format Batch, Tasks, Length. So we have 1 sequence, 7611 tasks, and 6144 bins along the length axis.
Get output intervals:
output_intervals = model.input_intervals_to_output_intervals(input_intervals)
output_intervals
output_start = output_intervals.start[0]
output_end = output_intervals.end[0]
output_len = output_end - output_start
print(output_len)
Save output predictions as image
cage_brain_tasks = tasks[(tasks.assay=="CAGE") & (tasks["sample"].str.contains("brain"))].head(2)
rna_brain_tasks = tasks[(tasks.assay=="RNA") & (tasks["sample"].str.contains("brain"))].head(2)
tasks_to_plot = cage_brain_tasks.index.tolist() + rna_brain_tasks.index.tolist()
task_names = tasks.description[tasks_to_plot].tolist() # Description of these tracks from the `tasks` dataframe
print(tasks_to_plot)
print(task_names)
fig = grelu.visualize.plot_tracks(
preds[0, tasks_to_plot, :], # Outputs to plot
start_pos=output_start, # Start coordinate for the x-axis label
end_pos=output_end, # End coordinate for the x-axis label
titles=task_names, # Titles for each track
figsize=(10, 3.5), # Width, height
)
# Save the figure as a JPG
fig.savefig("predictions.jpg", format='jpg')