Programmable Genomics Laboratory

Research

We believe in open science. All code and models developed in the Programmable Genomics Laboratory are made freely available to the research community.

Ledidi

Deep LearningDesign

Ledidi is an algorithm for turning any predictive deep learning model into a designer of edits to an initial template sequence. The repository contains convenient functions for using one or more models of any architecture to do the design.

[Paper]

tangermeme

Deep Learning

A toolkit for "everything-but-the-model" when it comes to genomic machine learning. Implements algorithms for calculating feature attributions, perturbing sequences to ask hypothetical "what if?" questions, and plotting results.

[Paper]

Avocado

Tensor FactorizationDeep Learning

A deep tensor factorization model for imputing genome-wide genomic and transcriptomic experiments at compendium scale. We used Avocado to impute the >10k experiments that have not yet been performed in the ENCODE Compendium, and did cross-species modeling to make imputations for assays performed in only one of the species.

[Paper 1] [Paper 2] [Paper 3]

yuzu

Compressed SensingSaturation Mutagenesis

An algorithm for accelerating saturation mutagenesis for deep learning models with convolution layers using compressed sensing. Includes a function that can be applied to any PyTorch model.

[Paper]

dragonnfruit

Single-Cell GenomicsDeep Learning

DragoNNFruit is an extension of the ChromBPNet model that makes predictions for chromatin accessibility at dynamic pseudobulk resolution. This enables datasets to be modeled without the need to pseudobulk cells or scrub outliers. Work is ongoing.

General Purpose ML Software

Some of the software developed by the group is general purpose in the sense that the algorithms can be applied in any domain, not just genomics.

pomegranate

Probabilistic Modeling

Pomegranate implements general-purpose probabilistic modeling in PyTorch including hidden Markov models, Bayesian networks, and factor graphs. Pomegranate treats all models as probability distributions, enabling one to create a mixture model of hidden Markov models instead of only allowing simple distributions. Because it is implemented in PyTorch, the fitting and inference steps can use a GPU and all other features implemented in PyTorch, e.g. mixed precision, compiling of models, etc.

[Paper]

apricot

Submodular Optimization

A general-purpose toolkit implementing submodular optimization for the purpose of selecting representative subsets of data. Implements several optimization algorithms and objective functions and enables users to write custom versions of either.

[Paper]

Ecosystem Software

Below is ecosystem-level "miscellaneous" software that was developed to improve data processing in general outside the context of any specific research project.

bam2bw

High Throughput Genomics Read Mapping

Bam2bw is a simple utility for converting BAM-formatted reads into a bigwig without needing any intermediate values. bam2bw can process data either locally or remotely, and can do so using parallel processing, frequently needing only a few minutes to process even massive datasets.

Re-implementations of Popular Algorithms

In addition to developing new algorithms, models, and software, PGL has also re-implemented some popular algorithms to make them more efficient and/or port them to a different backend because these re-implementations would be valuable for our mission. To give appropriate credit to the original work and scope our contribution solely as that of a re-implementation, these implementations use the original names with the "-lite" suffix.

bpnet-lite

Deep Learning

A re-implementation of the BPNet and ChromBPNet models in PyTorch. This implementation includes a command-line interface for training, evaluating, and using these models, as well as a pipeline command that goes from unprocessed data to trained models and downstream artifacts (e.g. TF-MoDISco patterns).

memesuite-lite

DNA Sequence Motifs

A re-implementation of select algorithms from the MEME suite. Currently, only FIMO and Tomtom are included and provide Python APIs. Additionally, there is a command-line tool that uses the new Tomtom implementation in various ways.

[Tomtom-lite Paper]

tfmodisco-lite

Motif Discovery

A re-implementation of the TF-MoDISco algorithm for motif discovery from feature attributions. This version significantly decreases the amount of memory needed, speeds it up several times over, and reduces the size of the codebase by an order of magnitude. This code has now been merged back into the official TF-MoDISco repository and improved upon.