Training Malinois, a model of MPRA activity of CREs

To enable systematic evaluation of parameters governing data preprocessing, model architecture and training, we developed tools for limited automatic machine learning in PyTorch (https://github.com/sjgosai/boda2). We implemented support for regression based on DNA sequences using CNNs. We deployed a containerized application based on this library in conjunction with the Vertex AI platform on Google Cloud to tune all hyperparameters using Bayesian optimization.

Data preprocessing

Malinois training

To construct the train/validation/test dataset to train Malinois, we aggregated the log2[FC] output of sequences tested in K562, HepG2 and SK-N-SH cells from multiple projects (OL indexed reference files are shown in Supplementary Table 1). The majority of projects focused on testing the allelic effects of human genetic variation with the remaining projects testing only the reference sequences of the human genome. In total, 798,064 unique oligos were aggregated, originating from ten independent experiments (from three different projects: UKBB (OL27, OL28, OL29, OL30, OL31, OL32, OL33), GTEx (OL41, OL42), OL15). The majority of the sequences used in our study (783,978) were designed to evaluate common human genetic variation associated with heritable complex traits. The majority of sequences (706,054) consisted of testing the reference and alternative allele, typically a single-nucleotide substitution, centred within 200 bp of flanking sequence. Additional sequences (77,924) evaluated the four pairwise combinations of two independent variants. Variants were selected on the basis of genetic fine-mapping, with most variants being linkage disequilibrium partners of causal alleles and therefore likely to not have a meaningful impact on cellular or organismal traits. The remaining sequences (14,086) originated from OL15, from which we selected the known DHSs and H3K27ac sequences. Oligos with a plasmid count of fewer than 20 or no RNA count in any cell type were discarded. If an oligo was present in more than one UKBB library, its log2[FC] values were averaged across libraries. If an oligo in UKBB was also found in GTEx or OL15, only the UKBB readout was collected and the others were discarded. If an oligo in GTEx (but not in UKBB) was also found in OL15, only the GTEx readout was collected and the OL15 readout was discarded. Non-natural sequences from OL15 were discarded. Moreover, oligos with a log2[FC] of 6 s.d. below the global mean were discarded (fewer than 10 oligos). Sequences were padded on both sides with constant sequences from the reporter vector backbone to form 600 bp sequences and converted into one-hot arrays (that is, A:= [1,0,0,0], C:= [0,1,0,0], G:= [0,0,1,0], T:= [0,0,0,1], N:= [0,0,0,0]). Oligos from chromosomes 19, 21 and X were held out from the parameter training loop as a validation set guide hyperparameter tuning. Oligos from chromosomes 7 and 13 were held out from both parameter training and hyperparameter tuning loops as a test set for reporting performance. Oligos from the remaining chromosomes were used in the training loop. Oligos containing alternative alleles were assigned to the same chromosomes as the reference allele oligos. Data augmentation was performed by including into the training set the reverse complement of the (600 bp) sequences, and duplicating oligos that had a log2[FC] greater than 0.5 in any cell type. We also aggregated the log2[FC] output of 318,247 and 442,482 sequences tested in A549 (OL27, OL28, OL29, OL30, OL31, OL32, OL33) and HCT116 (OL41, OL42) cells, respectively, according to the same count filtering steps as described above.

Test set performance metrics and other analyses

For analyses outside Malinois’ training loop that leverage the train/validation/test sets, we aggregated the same 798,064 unique oligos mentioned above initially filtering out only oligos with an RNA count of zero before averaging the log2[FC] across UKBB libraries (no plasmid count filter). Oligos with a log2[FC] standard error greater than 1 in any cell type were then omitted from performance metrics (in the case of oligos with multiple instances across UKBB libraries, oligos of which the highest log2[FC] standard error across libraries was greater than 1 were omitted). For locus-specific benchmarking, we aggregated the log2[FC] of oligos that tile the GATA1 locus (OL43) according to the same count filtering steps as described above. We generated per-genome-base activity measurements by averaging the MPRA activity of each oligo that overlaps that base pair. We removed oligo genomic coordinates that overlap with those in the UKBB and GTEx libraries in scatterplots and correlation calculations.

Model architecture

The final Malinois model is composed of three functional segments: (1) three convolutional layers with batch normalization and maximum value pooling; (2) a fully connected linear layer to integrate positional and feature information from the previous hidden state after flattening; and (3) a stack of branched linear layers such that each output feature is a function of four independent transformations. As the first two segments are replicated from the Basset architecture4, Malinois accepts batches of 4 × 600 arrays corresponding to one-hot encoded DNA sequences, so predictions for 200-nucleotide MPRA oligos are made by padding inputs on both sides with constant sequences from the reporter vector backbone. This strict input sizing requirement ensures that hidden states are appropriately shaped when transitioning between segments 1 and 2 of the model. Furthermore, this padding strategy enables us to use reverse complement data augmentation with awareness of the orientation of the 200-nucleotide MPRA inserts with respect to the transcription start site in the reporter backbone. Although it was not tested in this study, replacing the final strict max pooling layer with adaptive pooling or padding would allow flexibility in the input sizing requirements while maintaining all other components of the architecture. At training initiation, weights were initialized using pretrained weights from a PyTorch implementation of Basset when segments 1 and 2 were appropriately configured.

Model fitting

We trained Malinois using the Vertex AI API on the Google Cloud Platform (GCP). This enabled optimization of all tuneable parameters controlling data preprocessing, model architecture and model training. To do this, first we generated a docker container (gcr.io/sabeti-encode/boda/production:0.0.11) with an installation of CODA using a GCP VM with the following specifications: Debian-based deep learning VM for Pytorch CPU/GPU operating system, a2-highgpu-1g machine type and 1 NVIDIA Tesla A100 40G GPU. The container entrypoint was set to a Python script for model training (boda2/src/main.py). Using this container, we deployed hyperparameter tuning jobs using the default algorithm to optimize the indicated hyperparameters (Supplementary Table 9). We include a notebook for deploying a hyperparameter tuning job using the Vertex AI SDK (boda2/tutorials/vertex_sdk_launch.ipynb). We finalized model selection for Malinois by benchmarking candidates on the validation set using predictions calculated as described in the next section. All test set benchmarking was retrospective and did not impact decision making in the study. Two additional models were fitted using a subset of sequences tested in either A549 or HCT116 cells using identical hyperparameter configurations to Malinois.

Correlation of empirical and predicted MPRA activity

When comparing Malinois’ predictions to empirical MPRA, we discard any oligo with a replicate log2[FC] standard error greater than 1 in any cell type (see section Data preprocessing above for more details). Malinois’ predictions for the (padded) forward and reverse complement sequences are averaged into a single prediction.

Optimization of cell-type specificity

The objective function to guide the sequence design with simulated annealing (minimize energy) was the MinGap (Malinois log2[FC] prediction in the target cell type minus the maximum off-target cell type log2[FC] prediction). The objective function used with the algorithms Fast SeqProp and AdaLead (minimize or maximize, respectively) was the bent-MinGap, which is defined as follows. Let y+ be the Malinois log2[FC] prediction on the target cell type, and y the maximum of the log2[FC] predictions on the off-target cell types of a given sequence (so MinGap = y+ − y). We constructed a bending function g(x) = x − e−x + 1 to preprocess predictions such that the objective function becomes bent-MinGap = g(y+) − g(y). We applied g(x) to the predictions to incentivize greater MinGaps with low expression in the off-target cell types. For three generative algorithms, to prevent pathologically extreme activity predictions that are common in deep learning methods when computing on sequences highly divergent from the training data, we constrained predictions to a limited interval (default: [−2, 6]) when generating sequences.

Iterative maximization of sequence function using iterative, generative and evolutionary sequence generation algorithms

Fast SeqProp

Fast SeqProp5 was selected as a representative gradient-based local optimization method that exploits the structure of deep learning models to conduct greedy search while retaining the ability to pass true one-hot encoded inputs to the model. We implemented this algorithm as described in previous work, but we removed the learnable affine transformation in the instance normalization layer and drew many one-hot encoded samples from the categorical nucleotide probability distribution in each optimization step to more confidently estimate the gradients of the learnable reparameterized input sequence. The input parameters were randomly initialized (drawn from a normal distribution) and optimized using the Pytorch implementation of the Adam optimization algorithm with a learning rate of 0.5, along with a cosine annealing scheduler with a minimum learning rate of 10−6 over 300 training steps. In each training step, the loss function value was the negative average bent-MinGap of 20 sequence samples drawn from the categorical nucleotide probability distribution at that step. Once optimization is finalized, instance normalization is applied to the learned input and 20 sequences were sampled from the obtained distribution and the sequence with the highest predicted bent-MinGap was collected unless the value was less than 3.6.

AdaLead

AdaLead6, another greedy search algorithm, was selected as a representative evolutionary optimization algorithm for its ease of implementation and previously reported success in DNA sequence optimization. We implemented this algorithm as written in the GitHub repository associated with the original paper. In each run, 20 randomly initialized sequences are optimized over 30 generations with mu=1, recomb_rate=0.1, threshold=0.25, rho=2, using bent-MinGap as the fitness (objective) function. Once optimization is finalized, only the sequence with the highest predicted bent-MinGap is collected unless the MinGap was less than 2. We chose to collect only one sequence per run to maximize diversity in the global batch collected from all runs.

Simulated annealing

Simulated annealing7 was selected as a representative probabilistic optimization algorithm based on a decades-long history of successful application to a wide range of domains for non-convex optimization. Simulated annealing starts by jumping between regions with different local optima by occasionally accepting proposals that deteriorate the objective when the sampling temperature is high early in the algorithm. In later stages, the algorithm shifts toward greedy hill climbing as low sampling temperatures only allow proposals that improve the objective to be accepted. We implemented simulated annealing based on the Metropolis–Hastings algorithm for Markov chain Monte Carlo simulations. Proposals were generated symmetrically at each step by mutating three random bases. We used negative MinGap (without bending) to simulate the energy landscape of the theoretical system. During optimization, the temperature term was reduced using a monotonically decreasing function with a diverging infinite sum:

$$\tau =\frac{1}{1+{s}^{0.501}}.$$

To produce sequences with high target-specific activity we used negative MinGap (without bending) to simulate energy of the system.

Motif penalization

To design a batch of sequences penalizing the enrichment of given motifs in the batch, we introduced to the loss function an additional term explained below. To penalize a single motif of length l, we construct the motif position–weight matrix (PWM; also known as position-specific scoring matrix, or log probabilities) and use it to score all possible subsequences xj of length l in the batch. Let sj = PWM(xj) be the motif score of the subsequence xj, n the number of sequences in the batch, and t a score threshold. Then, we define the motif penalty as

$$\frac{1}{n}\sum _{j:{s}_{j}\ge t}{s}_{j},$$

where j iterates over all the possible subsequences including their reverse complements. In other words, we sum all the motif scores above the score threshold and divide by the size of the batch. When penalizing m motifs, the term we introduce is very close to simply averaging the m motif penalties, except that we introduce a weighting factor for each motif penalty to emphasize the penalization of motifs with lower indices (or in our case below, to prioritize motifs based on their order of inclusion to the motif pool). If we let sj(i) = PWM(i)(xj) be the motif score of motif i of the subsequence xj, and t(i) the score threshold of motif i, then the total motif penalty given a motif pool {PWM(1), …, PWM(m)} is defined as

$$\frac{1}{mn}\sum _{i\in [m]}{(m-i+1)}^{\frac{1}{3}}\sum _{j:{s}_{j}^{(i)}\ge {t}^{(i)}}{s}_{j}^{(i)},$$

where the term (m − i + 1)1/3 is the weighting factor increasing the value of the motif penalties with lower index i.

We used this motif penalty expression to iteratively design sequences subject to an increasing pool of motifs. We call these iterations penalization tracks. A single penalization track starts with the generation of a batch of 500 (non-penalized) sequences, which is then analysed for motif enrichment (top 10 motifs of length 8 to 15) using STREME through a Python wrapper function. We collect the top motif PWM(1) from the analysis and design a second batch of 250 sequences (which we call round-1 penalized sequences) penalizing the motif pool {PWM(1)}. We then extract the top motif PWM(2) enriched in the round-1 penalized sequences and design a third batch of 250 sequences (round-2 penalized sequences) penalizing the motif pool {PWM(1), PWM(2)}. We continue this process till we generate 250 round-5 penalized sequences penalizing the motif pool {PWM(1), PWM(2), …, PWM(5)}.

We generated four penalization tracks for each target cell type, for all three cell types. We defined the score threshold for each motif as a percentage of the motif score of its consensus sequence. The percentages used were 0 for K562-target sequences, and 0.25 for HepG2- and SK-N-SH-target sequences. The reason behind the different choice for K562 cells is that we found that the optimization process could more easily escape the penalization of GATA by still using suboptimal instances of the motif, so a more stringent penalty was of interest for us. The motivation for using a weighting factor was that we hypothesize that sequence design optimization gravitates more strongly to motifs captured in enrichment analyses of early penalization rounds, so we wanted to keep emphasizing the penalization of motifs extracted from earlier rounds.

In Supplementary Note 2, the motif-presence score (y axis) of a motif in each sequence was calculated by summing all the motif-match scores that pass the Patser score threshold (as defined in Biopython83), and then dividing by the maximum possible motif score (the match score of the motif consensus sequence).

k-mer analysis

We calculated 4-mer and 7-mer content for sequences in the CODA MPRA library as well as various other sets of reference sequences including 200-mers upstream of RefGene annotated transcription start sites, shuffled CODA sequences and random 200-mers. We calculated the average Manhattan distance to the k-nearest neighbours distances for 200-mers (k = 4) by splitting sequences into groups based on design method, target cell line and penalty level and using the NearestNeighbors module from scikit-learn (v.1.2.2). We embedded sequences in two-dimensional space based on 4-mer content using the uniform manifold approximation and projection implemented by the umap-learn (v.0.5.2) Python package.

Homology search using Nucleotide BLAST

We conducted a homology search using NCBI ElasticBLAST to determine whether synthetic sequences had measurable homology to any sequences in the Nucleotide Collection. We used the BLASTn algorithm, the dc-megablast task and a word size of 11 and maintained the defaults for all other settings.

Selection of naturally occurring cell-type-specific sequences by DNase and Malinois-driven GenomeScan

DHS-natural

To identify CREs broadly replicating across experimental approaches, using a uniformly processed dataset from ENCODE, we first selected DNase peaks from each of the three cell lines (K562, HepG2 and SK-N-SH). To further select for active CREs, we subsetted DHS peaks that intersect with H3K27ac peaks from the same cell type. For each cell type, we then identified cell-type-specific peaks by requiring a that a DHS+H3K27ac+ peak had no overlap with a DHS peak in the other two cell types. For these DHS–H3K27ac peaks, in each cell type, we scored the K562, HepG2 and SK-N-SH DHS signal in the peak coordinates of the target cell type. We then selected the top 4,000 peaks with the highest ratio of on-target cell type’s DHS signal to the maximal off-target cell type’s DHS signal, mirroring our efforts to maximize MinGap of log2-space MPRA activity with other CREs.

Malinois-natural

To nominate cell-type-specific natural sequences with Malinois, we tiled the whole human genome into 200 bp windows using a 50 bp stride and generated predictions for each window sequence. The cell-type specificity of each sequence was obtained by evaluating the objective function mentioned above (bent-MinGap), and the top 4,000 best-performing sequences were selected for each cell type.

Genome annotation of natural sequences

Malinois-natural sequences capture a unique component of the genome compared with DHS-natural, with 2.7% of Malinois-natural sequences overlapping sequences in our DHS-natural set, and 65.8% residing outside any previously annotated CREs. cCRE BED files for promoter-like sequences, proximal enhancer-like sequences, distal enhancer-like sequences and CTCF-only were downloaded from the ENCODE SCREEN Portal12 and concatenated into a single BED file for intersection with DHS-natural and Malinois-natural BED files using a custom script. Intersections were performed using bedtools (v.2.30.0)84 and pybedtools (v.0.9.0)85 with the following command ‘Malinois/DHS-natural_BED.intersect(ENCODE_cCRE_BED, wa=True, u=True)’ and the number of intersections was reported. To determine the genomic features overlapping DHS-natural and Malinois-natural sequences, the same BED files were used as an input for annotatePeaks.pl from the homer suite (v.4.11)86 with the following command ‘annotatePeaks.pl inputBED hg38 -annStats annStats.txt > annotatePeaksOut.txt’. Annotations for the whole genome (hg38) were generated by dividing the genome into 200 bp intervals using the bedtools makewindows command ‘bedtools makewindows -g hg38.txt -w 200 > hg38_200bp.bed’. Annotations were generated for each cell type (K562, HepG2, SK-N-SH) and sequence selection method (DHS-natural, Malinois-natural).

Sampled integrated gradients to compute contribution scores of Malinois predictions

We calculated nucleotide contribution scores for each sequence in the proposed library using an adaptation of the input attribution method Integrated Gradients58. Sampled Integrated Gradients (SIG) considers the expected gradients along the linear path in log-probability space from the background distribution to the distribution that samples the input sequence almost surely. In each point of the linear path, a sequence probability distribution (also known as a position probability matrix (PPM)) is obtained from the log-probability space parameters by applying the SoftMax function along the nucleotide axis, and a batch of sequences is sampled from that distribution to be fed into the model. We then calculate the gradients of the batch model predictions with respect to the parameters in the log-probability space, using the straight-through estimator to backpropagate through the sampling operation. The batch gradients are averaged for each point in the path and approximate the gradient integral as in the original formulation of the method. In our case, the subtraction of the baseline input from the input of interest involves the parameters in log-probability space. This adaptation of Integrated Gradients provides two useful features. First, the sequence inputs being fed to the model are always in one-hot form, avoiding evaluations of inputs off the vertices of the simplex on which the model was trained which could more easily lead to pathological predictions. Second, the original method relies on choosing an appropriate single baseline input against which to compare the input of interest, which might not always be straightforward, whereas our adaptation uses a background distribution of sequences as the baseline. Favourably, when choosing the uniform background (0.25, 0.25, 0.25, 0.25), the parameters in log-probability space where the line path is traversed become the zero matrix, which removes the need to subtract the baseline from the input of interest. We can then more easily extract integrated gradients for all tokens in all positions (by omitting masking the gradients with the one-hot input), which we found useful as hypothetical scores for TF-MoDISco.

Contribution block ablation

To test the value of contribution scores obtained with SIG, we conducted an in silico ablation study of the library sequences using contribution blocks (defined below) to randomize segments of the sequences. The goal of the study was to investigate the predicted log2[FC] effects of randomizing positions within the sequences corresponding to blocks of either positive or negative contribution, or random positions outside blocks. The result of the study is summarized in Supplementary Note 4. Overall, randomizing segments of the sequences associated with negative contributions resulted in an increase in the predicted activity in either the target or off-target cell type, while randomizing those associated with positive contribution completely destroyed the activity in the target cell type, and marginally decreased the (already repressed) activity in off-target cell types. To make calls of contribution blocks in any given sequence, we took the 200 contribution scores and built a smoothed contribution signal using a one-dimensional Gaussian filter (scipy.ndimage.gaussian_filter1d) with a sigma of 1.15. We defined a positive contribution block whenever the smoothed signal was above a threshold of 0.015 for 4 contiguous positions or more, and negative whenever it was below 0.015 for 4 contiguous positions or more. Outside positions were those not assigned to a contribution block. For each target cell type group (25,000 sequences), contribution block calls and ablations were performed for all three prediction tasks. For example, taking the K562-target sequences, three different ablations and call sets were carried out: (1) block calls using contribution scores in K562 cells assessing the K562 activity effect (target cell type); (2) block calls using contribution scores in HepG2 cells assessing the HepG2 activity effect (off-target cell type); (3) and block calls using contribution scores in SK-N-SH cells assessing the SK-N-SH activity effect (off-target cell type). This resulted in a total of nine sets of calls and ablations. When assessing the effect of disrupting positions outside contribution blocks, we subsampled the outside coverage (number of positions not in blocks) to match the upper half of the distribution of coverage sizes of positive and negative contribution blocks together, whenever possible. For the SK-N-SH-target group, for example, such a distribution match was not possible as the total number of available positions from which to sample was simply not large enough globally. The same was true for the target cell type outside ablation in K562 and HepG2 cells, which might be expected as positive contribution blocks alone have large coverages. We performed this outside subsampling to have comparable ablation sizes across categories, but also because disrupting all of the positions outside blocks that have low coverage (resulting in very high outside coverages) introduces too much noise into the sequence when most of the sequence is disrupted. We set a minimum of five positions to be disrupted by outside coverages.

Propeller plots

A propeller dot plot (Fig. 2e (top row)) is a two-dimensional plot scheme of our own device that seeks to elucidate the cross-dimensional non-uniformity of three-dimensional points. In this coordinate system, a point’s radial distance from the origin corresponds to the difference between the maximum and minimum values. Its deviant angle from the axis corresponding to the maximum value quantifies the position of the median value within the range of the minimum and maximum values. Namely, the angle is proportional to the ratio between two differences: (1) the difference of the median and minimum values; and (2) the difference of the maximum and minimum values. This ratio represents the 60°-angle fraction deviating from the axis corresponding to the maximum value towards the axis corresponding to the median value. A higher angle of deviation (maximum of 60°) indicates that the median value is closer to the maximum value, while a lower angle (minimum of 0°) of deviation indicates that the median value is closer to the minimum value.

This can also be formulated in terms of the MinGap (maximum − median) and MaxGap (maximum − minimum). In our coordinate system, the MaxGap corresponds to the radial distance. The difference (1 − MinGap/MaxGap) corresponds to the 60°-angle fraction deviating from the axis corresponding to the maximum value towards the axis corresponding to the median value. The MinGap:MaxGap ratio controls how much a point gravitates toward a main axis and away from the in-between-axis areas. A ratio of 0 means that the MinGap is zero and therefore the median value is equal to the maximum, so the point will be exactly between two axes. If the ratio is 1, it means that the median and the minimum values are equal, therefore the point will fall exactly in the axis corresponding to the maximum value. Note that, for this point of view to work with target and off-target cell type activities, we assume that the maximum cell type activity is the intended target cell type. This implies that, when counting sequences that pass specificity thresholds in Fig. 2e, some sequences get their target cell type reassigned to the cell type with the maximum activity, with DHS-natural sequences being the group that most benefits from the reassignment. A total of 652 sequences pass the lenient specificity threshold of MaxGap > 1 and MinGap/MaxGap > 0.5 by getting their target cell type reassigned (DHS-natural, 565; Malinois-natural, 39; AdaLead, 12; Simulated Annealing, 5; Fast SeqProp, 0; Fast SeqProp penalized, 4). However, only 16 sequences pass the stringent specificity threshold of MaxGap > 4 and MinGap/MaxGap > 0.5 by getting their target cell type reassigned (DHS-natural, 15; Malinois-natural, 0; AdaLead, 1; Simulated Annealing, 0; Fast SeqProp, 0; Fast SeqProp penalized, 0).

As an example of coordinate calculation, take the point (5, 3, 1). This point would have a radial distance of 5 − 1 = 4 and an angle of deviation from the axis of the first dimension of (3 − 1)/(5−1) × (60°) = 30° (in the direction of the axis of the second dimension). In terms of the MinGap:MaxGap ratio, the angle of deviation from the axis of the first dimension (the dimension of the maximum value) towards the axis of the second dimension would be (1 − (5 − 3)/(5 − 1))(60°) = 30°. Observe that all the points of the form (x + 4, x + 2, x), for any real value of x, will have the same coordinates as the point (5, 3, 1).

A propeller count plot (Fig. 2e (bottom row)) shows the percentage of points that fall in each given area of a propeller dot plot. The teal, yellow and red regions capture sequences in which the median value is closer to the minimum value than to the maximum value. Teal, yellow and red areas represent sequences in which the MinGap:MaxGap ratio is greater than 0.5.

The two synthetic groups in Fig. 2e were randomly subsampled to have exactly 12,000 sequences each and avoid over-plotting compared to the plots of the two natural groups. Supplementary Fig. 9 shows the complete propeller plots broken down by design method.

Oligos with a replicate log2[FC] standard error greater than 1 in any cell type were omitted from the plots.

Motif discovery

We used TF-MoDISco Lite59,60 to extract sequence motifs to be predicted as functional by Malinois through contribution scores obtained through SIG. As described above, SIG naturally provides hypothetical contribution scores (as defined by TF-MoDISco) when selecting the uniform random background by simply carrying out the equivalent of the full process minus masking out using the input sequence one-hot matrix. The final contribution scores can then be retrieved masking out the hypothetical contribution using the input sequence one-hot matrices, as required by TF-MoDISco. We computed hypothetical contribution scores for each of the three prediction tasks and ran TF-MoDISco Lite with 100,000 seqlets and a window size of 200 (equivalent results were obtained using 1,000,000 seqlets). We aggregated the discovered patterns across prediction tasks following their provided example using modiscolite.aggregator.SimilarPatternsCollapser. TF-MoDISco Lite results are provided as positive and negative patterns.

TF-MoDISco patterns to PWMs

To convert a TF-MoDISco positive pattern living in the hypothetical-contribution-score space into a PWM, we divided the pattern scores by the maximum position score sum and multiplied by 10. To obtain the PPM, we applied the SoftMax function to each position vector. Some of our TF-MoDISco negative patterns are a combination of a negative pattern (negative contributions) and a positive one (positive contributions). Thus, to convert a TF-MoDISco negative pattern into a PWM, we first reversed the sign directionality of the negative portions (as informed by the pattern scores living in contribution-score space, not hypothetical) and compensated their magnitude by multiplying by 1.2 (because our negative contribution scores are in general smaller in magnitude than positive ones perhaps due to the nature of the training data target distribution that has a positive bias). We then proceed as for the positive patterns.

Core motifs (TF-MoDISco)

As TF-MoDISco, in addition to capturing isolated ungapped motifs, is able to capture patterns that are combinations of motifs, we heuristically extracted core ungapped patterns that, to varying degrees, account for all the of the combinations observed in the TF-MoDISco merged results. To manually define the starts and stops of core motifs, we relied on scoring the full pattern PWMs against themselves using TOMTOM87, information content contours and visual examination. The core motif IDs are derived from the IDs of the original patterns from which they were extracted. To convert the patterns into PWMs and PPMs, we applied the same operations as described above. Matches to human known TF-binding motifs were assigned using TOMTOM with the default parameters against the databases JASPAR CORE (2022)61 and HOCOMOCO Human (v11 FULL)62.

Core motifs (STREME)

In addition to extracting sequence motifs with TF-MoDISco, we also performed a motif enrichment analysis using STREME. First, to assess the agreement between a given STREME motif and its predicted functionality as measured by contribution scores, we performed weighted-averaging of the hypothetical contribution scores corresponding to all the sequence segments that were determined to be a match to the motif (as provided by FIMO with default parameters, using motif scores as weights), and compared the score averages (one set of averages per prediction task) to the motif’s information-content matrix. We will refer to the weighted average hypothetical scores as the contribution-score projection. All motifs with overall positive contribution scores that had a strong agreement with their contribution-score projection had been already captured by TF-MoDISco, suggesting that the TF-MoDISco positive pattern results are very comprehensive. However, we found a small number of STREME motifs with negative contribution scores that had a strong agreement with their contribution-score projection, so we decided to include them to the list of core motifs. Note that these motifs had negative contribution scores with moderate-to-low magnitude. We speculate that the reason TF-MoDISco might not have been able to detect them is because the contribution allocated in the seqlets that would correspond to these motifs too often falls below the threshold of the distribution of negative scores, making it hard to discriminate them from noise or insignificant scores. Running TF-MoDISco with 1 million seqlets did not change the results. We retrieved 11 such STREME motifs with strong agreement with their contribution-score projection not captured by TF-MoDISco, 9 of which were clustered together into 3 groups with nearly identical contribution-score projection (up to 1 or 2 additional positions to the left or right). This gave us a total of five STREME negative patterns in contribution-score projection form that were included to the list of core motifs. Their conversion to PWM and PPM forms followed the same process as for the TF-MoDISco patterns. Matches to human known TF-binding motifs were assigned using TOMTOM with the default parameters against the databases JASPAR CORE (2022)61 and HOCOMOCO Human (v11 FULL)62.

Contribution score-based motif scan

To find instances of the core motifs present in the CODA sequence library, we leveraged the hypothetical contribution scores of the sequences to match sequence segments to the core motifs in hypothetical-contribution-score form. First, we padded with zeros left and right all the sequence hypothetical contribution scores, yielding a matrix of dimensions 3 × 75,000 × 4 × 210. Second, for a core motif of length l, we computed all the Pearson correlation coefficients between every possible subsequence hypothetical contribution scores of length l (matrices of size 75,000 × 4 × l) and the core motif’s hypothetical contribution scores in forward and reverse complement orientations. For each cell type dimension, we randomly sampled 500,000 Pearson correlation coefficients (arising from a single core motif) to obtain the value min(0.75, μ + 4σ) to serve as a coefficient threshold, where μ and σ represent the mean and the s.d., respectively, of the subsampled distribution. All subsequences for which the hypothetical contribution scores scored above their coefficient threshold were collected as motif hits for the given core motif. We repeated this process for all core motifs across all cell types.

Motifs embedded in random background

We embedded single motifs in random sequences to measure their standalone predicted effect compared to fully random sequences. For each motif, we built a 200 × 4 PPM consisting of the motif’s PPM in the middle and random background ([0.25, 0.25, 0.25, 0.25]) everywhere else. We sampled 5,000 sequences from it and fed them to Malinois to obtain predictions in each cell type. We also sampled 5,000 sequences from a 200 × 4 PPM of uniform background everywhere (no motif in the middle), and fed them to Malinois to serve as baseline.

Motif ablation

We sought to assess the predicted effect of disrupting all instances of a single motif in our sequence library. For each motif, we collected the particular batch of sequences that had at least one instance of such motif, replaced all of the instances with random segments (sampled from uniform background), and fed them to Malinois to obtain predictions in each cell type. We performed this step five times, averaged the five predictions of each disrupted sequence and subtracted from the average the batch’s original predicted activities to obtain the predicted disrupting effect. For example, say that a sequence has one instance of a given motif in positions 20–32. We inserted a random sequence segment in those positions and got the disrupted sequence’s predictions. We did this five times, so five different random segments (with five different predictions) in positions 20–32, and averaged the five predictions (to mildly marginalize potential effects of replacing with random segments). The disrupting effect would be this average prediction minus the sequence’s original predicted activity. We aggregated the disrupting effects by motif presence (as defined above in the last paragraph of motif penalization in this section). To find instances of core motifs, we used the contribution-score-based motif scan described above. To find instances of the original TF-MoDISco patterns, we used FIMO (with the default parameters), as our contribution score-based motif scan might not handle gapped patterns as well as FIMO. When submitting the pattern PPMs to FIMO, we trimmed the patterns at both ends such that the start/stop of the pattern is the first/last position to have an information content of at least 0.15 bits.

Motif contributions

To get a motif’s overall contribution, we performed a weighted average of the contribution score sums contained in all of the motif instances provided by our motif hit method across the three prediction tasks. The average was weighted using the motif scores corresponding to the Pearson correlation coefficients mentioned above. The overall regulatory directionality of a motif (activator or repressor) is given by the sign of the mean of the weighted averages across cell types. For all motifs, the overall regulatory directionality agrees with the original TF-MoDISco designation as a positive or negative pattern.

Motif co-occurrence

We say that a pair of motifs co-occur whenever a sequence has at least one instance of each motif. By co-occurrence percentage of a motif pair, we mean the percentage of sequences in a given group in which the motif pair co-occurs.

NMF analysis of motif programs

We used NMF, a parts-based representation of data74, to model semantic relationships between motifs in our sequence library (scikit-learn v.1.2.2, initialized with NNDSVDAR, Frobenius loss). First we counted motif matches in each sequence with the contribution score-based motif scan described above88 to generate \(X\in {{\mathbb{N}}}^{n\times f}\), where rows represent sequences in the library and columns correspond to motifs. The sample matrix X can then be decomposed into the coefficients and features matrices \(W\in {{\mathbb{R}}}^{n\times k}\) and \(H\in {{\mathbb{R}}}^{k\times f}\), respectively. These k-dimensional representations are referred to as ‘topics’ in natural language processing and ‘programs’ in gene expression analysis89,90. These programs capture the frequency of TF motifs appearing in semantically similar CREs, and the CREs themselves are modelled as compositions of programs. We tested decomposing sequences into k [8,28] programs using bi-cross-validation91 and identified an ‘elbow’ in the reconstruction error at k = 12 (data not shown). When plotting the coefficient matrix comparative analysis, we normalize the coefficient matrix such that the rows sum to 1. We quantified the function of each decomposed program by calculating a weighted average of motif contributions (see the ‘Motif contributions’ section above) for each program using the motif weights in the features matrix. Motif contributions were clipped to an upper bound of 3 to mitigate the impact of extreme outliers.

MPRA saturation mutagenesis plot

The saturation mutagenesis study (Supplementary Table 10) of the sequence in Fig. 4g consisted in empirically testing the activity of all the possible 600 variants of the sequence (3 variants per position, 200 positions). We followed an identical protocol to the previous MPRAs in SK-N-SH cells with this saturation mutagenesis library. We visualized the effect of each variant as the subtraction of the activity of the original sequence from each variant-sequence’s activity, resulting in the lollipops in Fig. 4h. The mean variant effect is represented in the height of the logo sequence letters but in the opposite direction.

CODA MPRA

MPRA library construction

The CODA MPRA library was constructed according to previously described protocols8. In brief, oligos were synthesized (Twist Bioscience) as 230 bp sequences containing 200 bp of genomic sequences and 15 bp of adaptor sequence on either end. The oligo library was PCR amplified with primers MPRA_v3_F and MPRA_v3_20I_R to add unique 20 bp barcodes along with arms for Gibson assembly into a backbone vector. The oligonucleotide library was assembled into pMPRAv3:∆luc:∆xbaI (Addgene plasmid, 109035) and expanded by electroporation into Escherichia coli. Seven of the ten expanded cultures were purified using Qiagen Plasmid Plus Midi Kit to reach 200–300 colony-forming units (barcodes) per oligonucleotide. The expanded plasmid library was sequenced on the Illumina NovaSeq system using 2 × 150 bp chemistry to acquire oligo–barcode pairings. The library underwent AsiSI restriction digestion, and GFP with a minimal promoter amplified from pMPRAv3:minP-GFP (Addgene plasmid, 109036) using primers MPRA_v3_GFP_Fusion_F and MPRA_v3_GFP_Fusion_R was inserted by Gibson assembly resulting into the 200 bp oligo sequence positioned directly upstream of the promoter and the 20 bp barcode falling in the 3′ UTR of GFP. Finally, the library was expanded within E. coli and purified using the Qiagen Plasmid Plus Giga Kit.

MPRA library transfection into cells

All cell culture and transfection conditions followed previously established protocols27. For each of the three cell types, K562, SK-N-SH and HepG2, we collected two hundred million cells for transfections using the Neon Transfection System 100 μl Kit with 5 μg or 10 μg of the MPRA library per 10 million cells. Cells were collected 24 h after transfection, rinsed with PBS and collected by centrifugation. After adding RLT buffer (RNeasy Maxi kit), dithiothreitol and homogenization, cell pellets were frozen at −80 °C until further processing. For each cell type, three biological replicates were performed on different days. All cell lines were acquired from ATCC, authenticated using genotyping and gene expression signatures, and routinely tested for Mycoplasma and other common contaminants by The Jackson Laboratory’s Molecular Diagnostic Laboratory.

RNA isolation and MPRA RNA library generation

RNA was extracted from frozen cell homogenates using the Qiagen RNeasy Maxi kit. After DNase treatment, a mixture of three GFP-specific biotinylated primers was used to capture GFP transcripts using Sera Mag Beads (Thermo Fisher Scientific). After a second round of DNase treatment, cDNA was synthesized using SuperScript III (Life Technologies) and the GFP mRNA abundance was quantified using quantitative PCR (qPCR) to determine the cycle at which linear amplification begins for each replicate. Replicates were diluted to approximately the same concentration based on the qPCR results, and a first round of PCR (8 or 9 cycles) with primers MPRA_Illumina_GFP_F_v2 and Ilmn_P5_1stPCR_v2 was used to amplify barcodes associated with GFP mRNA sequences for each replicate. A second round of PCR (6 cycles) was used to add Illumina sequencing adaptors to the replicates. The resulting Illumina indexed MPRA barcode libraries were sequenced on the Illumina NovaSeq system using 1 × 20 bp chemistry.

CRE prioritization for in vivo validation

Enformer analysis of epigenetic signatures

To simulate epigenetic and gene expression signatures in silico we collected the nucleotide sequence from chromosome 11: 3101137–3493091 of the mouse reference genome (mm10). The expected insertion sequence using an H11 targeting vector with a lacZ:P2A:GFP open reading frame was added. As a control, the expected CRE insertion site was simulated as a 200 nucleotide sequence of N. We simulated all possible CRE insertions corresponding to our cell-type-specific MPRA by replacing the oligo-N sequence with 200-mers from our library. We inferred epigenetic signatures for all of these sequences using Enformer by modifying the notebook available online (https://colab.research.google.com/github/deepmind/deepmind_research/blob/master/enformer/enformer-usage.ipynb). To estimate CRE-induced transcriptional activation in various tissues, we collected 128 nucleotide resolution DHS, H3K27ac, ATAC and CAGE datasets overlapping the expected insertion (35 bins). To calculate an aggregate effect for each tissue, we calculated the maximum signal for each feature over the insertion, followed by a feature-specific Yeo–Johnson power transformation. Normalized features were then selected based on tissue correspondence (Supplementary Table 8) and averaged to estimate CRE activity in ten different tissues. We calculated MinGap values for the spleen, liver and brain using these ten measurements for each CRE.

Manual sequence prioritization

Sequences were prioritized on the basis of review of empirical MPRA measurements, contribution scores, motif matches, sequence content and predicted epigenetic signatures. We looked for sequences that displayed a high separation between the MPRA measures of the target and the off-target cell types. We also looked to capture variations of combinations of motif matches, and we used the contribution scores to visually examine the motif matches and other potentially important sequence content and motif organization. Finally, we selected sequences with at least moderate tissue specificity in predicted epigenetic signatures.

Transgenics

Transient zebrafish synthetic enhancer assay

To build the synthetic CRE eGFP reporter, double-stranded oligonucleotides corresponding to synthetic CREs (200 bp) were synthesized by IDT (GeneBlock). Synthetic CREs were amplified by PCR with primers that included homology to the plasmid vector E1b-GFP-Tol2 (Addgene plasmid, 37845)75 and were cloned upstream of the minimal promoter (E1b) to generate the synthetic enhancer eGFP plasmid reporter (pTol2-synthetic CRE-E1b-eGFP-Tol2) using HiFi DNA Assembly according to the manufacturer’s instructions (New England Biolabs). We also created ‘empty vectors’, which were identical to CODA CRE vectors except for the lack of a 200 bp insert. Reporter plasmid sequences were verified by Sanger sequencing. To transiently express the synthetic CRE reporter in zebrafish, plasmids were co-injected with tol2 transposase mRNA into one-cell-stage zebrafish embryos according to established methods92. A minimum of 15, one-cell zebrafish embryos of either sex were injected per construct. Injected embryos were imaged at the indicated days (2 or 4 days after fertilization) either using the dissecting (Olympus) or confocal fluorescence (Leica SP8) microscope. Injected embryos were not randomized, and researchers were not blinded. All zebrafish procedures were approved by the Yale University Institutional Animal Care and Use Committee (2022-20274).

Mouse transgenic reporter assay

An H11 targeting vector with an lacZ:P2A:GFP open reading frame was linearized using PCR containing 2 ng of template, 1 μl of KOD Xtreme Hot Start DNA Polymerase (Sigma-Aldrich, 71975), 25 μl of Xtreme buffer and 0.5 μM forward and reverse primers (H11_bxb_lacZ:GFP_lin_F, pGL_minP_GFP_R; Supplementary Table 11) cycled with the following conditions: 94 °C for 2 min; 20 cycles of 98 °C for 10 s, 56 °C for 30 s and 68 °C for 13 min; and then 68 °C for 5 min. Amplified fragments were treated with 0.5 μl of DpnI (NEB, R0176S) for 30 min at 37 °C, purified using 1× volume of AMPure XP (Beckman Coulter, A63881) and eluted with water. Double-stranded oligonucleotides corresponding to synthetic enhancers with Gibson arms were synthesized by IDT (GeneBlock) and assembled into the targeting vector using 5 μl of NEBuilder HiFi DNA Assembly Master Mix (NEB, E2621S), 36 ng of linearized vector and 10 ng of the synthesized fragment in a total volume of 20 μl for 45 min at 50 °C. Transgenic mice were created according to the enSERT protocol76. A mixture of 20 ng μl−1 Cas9 protein (IDT, 1074181), 50 ng μl−1 single guide RNA (sgRNA_H11lacZ; Supplementary Table 11), 25 ng μl−1 donor plasmid, 10 mM Tris, pH 7.5, and 0.1 mM EDTA was injected into pronuclear of FBV zygotes. Each group was tested with a predetermined sample size of 3 l and all of the samples were stained regardless of their genotype and sex. Embryos were collected and stained blindly with respect to their genotype. The whole embryo at embryonic day 14.5 or isolated brain at 5 weeks postnatal were fixed at 4 °C for 1 h in PBS supplemented with 2% paraformaldehyde, 0.2% glutaraldehyde and 0.2% IGEPAL CA-630. After washing with PBS, the embryos were stained at 37 °C overnight in a solution in PBS supplemented with 0.5 mg ml−1 X-gal (Sigma-Aldrich, B4252), 5 mM potassium hexacyanoferrate(ii) trihydrate, 5 mM potassium hexacyanoferrate(iii), 2 mM MgCl2 and 0.2% IGEPAL CA-630. The images were taken using the Leica M165 system for embryos or the Leica M125 system for brains. All mice were housed in duplexed pens containing five or less mice and under a 12 h–12 h light–dark cycle at 18–23 °C with 40–60% humidity. All mouse procedures were performed in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals, and were approved by the Institutional Animal Care and Use Committees of The Jackson Laboratory (18038).

Histology and immunofluorescence staining

After LacZ staining, mouse brains were sectioned with a vibratome (Leica VT100s) and free-floating 70-µm-thick sagittal sections were collected in ice-cold PBS. The sections were then rinsed in 1× PBS for 5 min and incubated for 30 min in a blocking solution consisting of 0.3% Triton X-100, 0.3% mouse on mouse blocking reagent (Vector laboratories, MKB-2213-1), 10% normal goat serum (Abcam, ab7481) and 5% BSA in 1× PBS with gentle agitation at room temperature. Immunostaining was then performed with a mixture of primary antibodies in the blocking solution at 4 °C on a shaker overnight. The sections were rinsed in 1× PBS three times for 5 min each and then incubated with corresponding fluorescence conjugated secondary antibodies for 2 h. After treatment with secondary antibodies, the slices were then further rinsed with PBS three times, followed by staining for nuclei with DAPI (Thermo Fisher Scientific, 62248). The sections were mounted onto slides with Prolong Gold antifade reagent (Cell Signalling Technology, 9071). The following primary antibodies were used during the staining procedure: mouse anti-NeuN (Abcam, ab104224), chicken anti-GFAP (OriGene Technologies, TA309150), rabbit anti-IBA1 (Abcam, ab178846). Secondary antibodies used were as follows: goat anti-mouse Alexa Fluor 488 (Thermo Fisher Scientific, AB_ 2534069), goat anti-chicken Alexa Fluor 568 (Thermo Fisher Scientific, AB_ 2534098), goat anti-rabbit Alexa Fluor 568 (Abcam, ab175471). All primary and secondary antibodies were used at 1:500 dilutions. Image acquisition for whole-brain sagittal slice mosaic images was performed using the Thunder Imager (Leica Microsystems) system using a ×10/0.8 NA dry lens. Fluorescence imaging was combined with bright-field imaging to visualize LacZ staining. Computational tissue clearing was applied systematically to reduce background noise (Leica acquisition software). After obtaining mosaic scans, higher-magnification images of regions of interest (ROIs) were acquired on the Stellaris 8 (Leica Microsystems) equipped with a Diode, Ar gas and He/Ne adjustable wavelength lasers using ×40/1.2 NA and ×63/1.4 NA oil objectives for quantification and representative images, respectively. The pinhole size was set to 1 a.u. and the samples were illuminated with 405, 488, 561 and 633 nm lasers sequentially. Six-micrometre z-stack images with a 2 µm z-step size and with a 4,096 × 4,096 pixel resolution were acquired using HyD detectors with a line average of 3. Fluorescent LacZ staining was visualized using the confocal microscope using the 633 nm laser93. For the representative images shown, bright outliers were removed using the default 2-pixel radius and 20 threshold. A Gaussian blur was then applied with a sigma radius of 1.

LacZ layer intensity analysis

Acquired mosaic bright-field images underwent auto-thresholding using the default algorithm in the FIJI software (NIH). Quantification of LacZ signal intensity was achieved using the plot profile tool with ROIs drawn from superficial cortical layers down to the corpus callosum. Depth information for cortical layers was acquired from the Allen Brain atlas. Multiple ROIs were taken in different cortical areas to verify the distribution of the signal. Representative images are ROIs taken from the somatosensory and visual cortices. For cell quantification and overlap analysis, to quantify cell populations, using FIJI software, maximum-intensity projection of the z-stack of images acquired with a confocal microscope was performed, and background removal was applied with rolling ball radius of 50. The images were then processed for autothresholding using the Moments algorithm. The signal to noise ratio was uniform across ROIs and a single thresholding algorithm yielded reproducible results. Cells were then quantified using the Analyse particle function. By varying the particle size, accurate quantification of neurons, astrocytes and microglia was achieved. To calculate the overlap between LacZ expression and the cell-type-specific markers, each binarized LacZ image was multiplied with corresponding binarized neuronal, astrocytic and microglia ROIs and the residual signals were quantified using the Analyse particle function. In total, five sagittal slices were analysed per mouse and a total of n = 3 mice was used for both controls and LacZ-positive brains.

RNA-seq analysis

Three replicates each from transgenic mice of CODA-designed SK-N-SH-specific CRE and empty vector were collected at 5 weeks postnatally. The liver, spleen and the right half of the brain were soaked into RNA later (Thermo Fisher Scientific) overnight at 4 °C and homogenized in QIAzol, followed by total RNA isolation using the RNeasy mini kit (Qiagen) with on-column DNase treatment. The RNA-seq library was generated from 1 µg of total RNA using the NEBNext Ultra II RNA Library Prep Kit for Illumina (NEB) and NEBNext Poly(A) mRNA Magnetic Isolation Module (NEB) according to the manufacturer’s protocol. The libraries were indexed using i7 and i5 primers with the following conditions: 98 °C for 30 s; 10 cycles of 98 °C for 10 s, 65 °C for 75 s; then 65 °C for 5 min. Indexed samples were purified using 0.9× volume of AMpure XP, eluted in 20 µl of EB, pooled equimolarly and sequenced using 2 × 150 bp chemistry on the Illumina NovaSeq X+ instrument at the Jackson Laboratory. The sequencing reads were mapped onto a modified mouse genome (GRCm38/mm10) with the lacZ-GFP sequence as an additional chromosome using STAR94 (v.2.5.2b). After removing duplicates using picard MarkDuplicates (MIT, v.3.1.1), the mapped reads were counted using featureCount (v.2.0.6, options: -p -B -Q 20 -T 16 -s 2 –countReadPairs). DESeq2 (v.1.32.0)95 was used to normalize the read counts and calculate the log2[FC], standard error and Wald-test P values.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *