e4-2 Borzoi Inference Exercise¶

The exercise uses a python package grelu to make predictions for a given DNA sequence. It uses a pre-trained Borzoi model to make the prediction.

About the grelu package:

- github repo: https://github.com/Genentech/gReLU  
- citation Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. bioRxiv 2024.10.09.617507 (2024) doi:10.1101/2024.10.09.617507.

About Borzoi: Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).

Before the class¶

On discovery set up a conda environment and install grelu

In [ ]:
source /optnfs/common/miniconda3/etc/profile.d/conda.sh
conda create --name py10 python=3.10
conda activate py10
pip install grelu
In [ ]:
# run python 
python

You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.

  • An alternative way to install grelu: If you have >50gb disk space, you can try the following way to set up grelu

Navigate to your lab share and create container directory

In [ ]:
# cd to your lab share. You home directory only has 45G, so it is not suitable to install by this way.
mkdir containers
cd containers

Pull image to container directory, this step will take ~ 10 min. It would require at least 10GB storage space.

In [ ]:
singularity pull --dir ~/containers docker://nvcr.io/nvidia/pytorch:24.08-py3 

Build a sandbox from the image

In [ ]:
singularity build --sandbox inference_pytorch_24.08-py3/ pytorch_24.08-py3.sif

Initialize Container

In [ ]:
CONTAINER_NAME="inference_pytorch_24.08-py3/"
PROJECT_BASE="/dartfs-hpc/rc/home/j/$netID" # Change as needed
CONTAINER_LOCATION="/dartfs-hpc/rc/home/j/$netID/containers/$CONTAINER_NAME" # I use /dartfs/rc/lab/S/Szhao/grahams
In [ ]:
PROJECT_BASE="/dartfs/rc/lab/S/Szhao/grahams"
CONTAINER_LOCATION="/dartfs/rc/lab/S/Szhao/grahams/containers/$CONTAINER_NAME"
In [ ]:
export PATH=$PATH:/root/.local/bin
In [ ]:
singularity shell --fakeroot --writable --nv --contain \
--home "${PROJECT_BASE}:/root" \
"${CONTAINER_LOCATION}" bash -c
In [ ]:
pip install grelu
export WANDB_MODE=disabled
python

You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.

predict features based on input sequence¶

In [ ]:
import numpy as np
import pandas as pd
import torch
import os

Load the pre-trained Borzoi model from the GreLU model zoo

In [ ]:
import grelu.resources
model = grelu.resources.load_model(
    project="borzoi",
    model_name="human_rep0",
)

If you see:

wandb: (1) Private W&B dashboard, no account required
wandb: (2) Create a W&B account
wandb: (3) Use an existing W&B account
wandb: (4) Don't visualize my results

choose 1.

Check model metadata and trained cell contexts

In [ ]:
model.data_params.keys()
In [ ]:
tasks = pd.DataFrame(model.data_params['tasks'])
tasks.head(3)

View Borzoi hyperparameters and training intervals

In [ ]:
model.data_params['train'].keys()
In [ ]:
for key in model.data_params['train'].keys():
    if key !="intervals":
        print(key, model.data_params['train'][key])
In [ ]:
pd.DataFrame(model.data_params['train']['intervals']).head()

Make the inference intervals

In [ ]:
input_len = model.data_params["train"]["seq_len"]
chrom = "chr1"
input_start = 69993520
input_end = input_start + input_len
In [ ]:
input_intervals = pd.DataFrame({
    'chrom':[chrom], 'start':[input_start], 'end':[input_end], "strand":["+"],
})

input_intervals

Extract sequence using GENCODE assembly (takes a few minutes)

In [ ]:
import grelu.sequence.format

input_seqs = grelu.sequence.format.convert_input_type(
    input_intervals,
    output_type="strings",
    genome="hg38"
)
input_seq = input_seqs[0]

len(input_seq)
In [ ]:
input_seq[:10]

Run inference on sequence

In [ ]:
device = torch.device('cpu')

model.to(device)
preds = model.predict_on_seqs(input_seqs, device=device)
preds.shape

Note the shape of preds: it’s in the format Batch, Tasks, Length. So we have 1 sequence, 7611 tasks, and 6144 bins along the length axis.

Get output intervals:

In [ ]:
output_intervals = model.input_intervals_to_output_intervals(input_intervals)
output_intervals
In [ ]:
output_start = output_intervals.start[0]
output_end = output_intervals.end[0]
output_len = output_end - output_start
print(output_len)

Save output predictions as image

In [ ]:
cage_brain_tasks = tasks[(tasks.assay=="CAGE") & (tasks["sample"].str.contains("brain"))].head(2)
rna_brain_tasks = tasks[(tasks.assay=="RNA") & (tasks["sample"].str.contains("brain"))].head(2)

tasks_to_plot = cage_brain_tasks.index.tolist() + rna_brain_tasks.index.tolist()
task_names = tasks.description[tasks_to_plot].tolist() # Description of these tracks from the `tasks` dataframe

print(tasks_to_plot)
print(task_names)
In [ ]:
fig = grelu.visualize.plot_tracks(
    preds[0, tasks_to_plot, :],  # Outputs to plot
    start_pos=output_start,      # Start coordinate for the x-axis label
    end_pos=output_end,         # End coordinate for the x-axis label
    titles=task_names,          # Titles for each track
    figsize=(10, 3.5),           # Width, height
)

# Save the figure as a JPG
fig.savefig("predictions.jpg", format='jpg')