usage: trill [-h] [--nodes NODES] [--logger LOGGER] [--profiler]
             [--RNG_seed RNG_SEED] [--outdir OUTDIR] [--n_workers N_WORKERS]
             name GPUs
             {embed,finetune,inv_fold_gen,lang_gen,diff_gen,classify,fold,visualize,simulate,dock,utils}
             ...

Positional Arguments

name

Name of run

GPUs

Input total number of GPUs per node

Default: 1

command

Possible choices: embed, finetune, inv_fold_gen, lang_gen, diff_gen, classify, fold, visualize, simulate, dock, utils

Named Arguments

--nodes

Input total number of nodes. Default is 1

Default: 1

--logger

Enable Tensorboard logger. Default is None

Default: False

--profiler

Utilize PyTorchProfiler

Default: False

--RNG_seed

Input RNG seed. Default is 123

Default: 123

--outdir

Input full path to directory where you want the output from TRILL

Default: “.”

--n_workers

Change number of CPU cores/’workers’ TRILL uses

Default: 1

Sub-commands

embed

Embed proteins of interest

trill embed [-h] [--batch_size BATCH_SIZE] [--finetuned FINETUNED] [--per_AA]
            [--avg]
            {esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,Ankh,Ankh-Large}
            query

Positional Arguments

model

Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, Ankh, Ankh-Large

Choose protein language model to embed query proteins

query

Input protein fasta file

Named Arguments

--batch_size

Change batch-size number for embedding proteins. Default is 1, but with more RAM, you can do more

Default: 1

--finetuned

Input path to your own finetuned ESM model

Default: False

--per_AA

Add this flag to return the per amino acid representations.

Default: False

--avg

Add this flag to return the average, whole sequence representation.

Default: False

finetune

Finetune protein language models

trill finetune [-h] [--epochs EPOCHS] [--save_on_epoch] [--lr LR]
               [--batch_size BATCH_SIZE] [--strategy STRATEGY]
               [--ctrl_tag CTRL_TAG] [--finetuned FINETUNED]
               {esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtGPT2,ZymCTRL}
               query

Positional Arguments

model

Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtGPT2, ZymCTRL

Choose the protein language model to finetune. Note that ESM2 is trained with the MLM objective, while ProtGPT2/ZymCTRL are trained with the CLM objective.

query

Input fasta file

Named Arguments

--epochs

Number of epochs for fine-tuning. Default is 10

Default: 10

--save_on_epoch

Saves a checkpoint on every successful epoch completed. WARNING, this could lead to rapid storage consumption

Default: False

--lr

Learning rate for optimizer. Default is 0.0001

Default: 0.0001

--batch_size

Change batch-size number for fine-tuning. Default is 1

Default: 1

--strategy

Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

--ctrl_tag

ZymCTRL: Choose an Enzymatic Commision (EC) control tag for finetuning ZymCTRL. Note that the tag must match all of the enzymes in the query fasta file. You can find all ECs here https://www.brenda-enzymes.org/index.php

--finetuned

Input path to your previously finetuned model to continue finetuning

Default: False

inv_fold_gen

Generate proteins using inverse folding

trill inv_fold_gen [-h] [--temp TEMP]
                   [--num_return_sequences NUM_RETURN_SEQUENCES]
                   [--max_length MAX_LENGTH] [--top_p TOP_P]
                   [--repetition_penalty REPETITION_PENALTY] [--dont_sample]
                   [--mpnn_model MPNN_MODEL] [--save_score SAVE_SCORE]
                   [--save_probs SAVE_PROBS] [--score_only SCORE_ONLY]
                   [--path_to_fasta PATH_TO_FASTA]
                   [--conditional_probs_only CONDITIONAL_PROBS_ONLY]
                   [--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE]
                   [--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY]
                   [--backbone_noise BACKBONE_NOISE] [--batch_size BATCH_SIZE]
                   [--pdb_path_chains PDB_PATH_CHAINS]
                   [--chain_id_jsonl CHAIN_ID_JSONL]
                   [--fixed_positions_jsonl FIXED_POSITIONS_JSONL]
                   [--omit_AAs OMIT_AAS] [--bias_AA_jsonl BIAS_AA_JSONL]
                   [--bias_by_res_jsonl BIAS_BY_RES_JSONL]
                   [--omit_AA_jsonl OMIT_AA_JSONL] [--pssm_jsonl PSSM_JSONL]
                   [--pssm_multi PSSM_MULTI] [--pssm_threshold PSSM_THRESHOLD]
                   [--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG]
                   [--pssm_bias_flag PSSM_BIAS_FLAG]
                   [--tied_positions_jsonl TIED_POSITIONS_JSONL]
                   {ESM-IF1,ProteinMPNN,ProstT5} query

Positional Arguments

model

Possible choices: ESM-IF1, ProteinMPNN, ProstT5

Select which model to generate proteins using inverse folding.

query

Input pdb file for inverse folding

Named Arguments

--temp

Choose sampling temperature.

Default: “1”

--num_return_sequences

Choose number of proteins to generate.

Default: 1

--max_length

Max length of proteins generated, default is 500 AAs

Default: 500

--top_p

ProstT5: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default is 1

Default: 1

--repetition_penalty

ProstT5: The parameter for repetition penalty. 1.0 means no penalty, the default is 1.2

Default: 1.2

--dont_sample

ProstT5: By default, the model will sample to generate the protein. With this flag, you can enable greedy decoding, where the most probable tokens will be returned.

Default: True

--mpnn_model

ProteinMPNN: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48 edges 0.10A noise

Default: “v_48_020”

--save_score

ProteinMPNN: 0 for False, 1 for True; save score=-log_prob to npy files

Default: 0

--save_probs

ProteinMPNN: 0 for False, 1 for True; save MPNN predicted probabilites per position

Default: 0

--score_only

ProteinMPNN: 0 for False, 1 for True; score input backbone-sequence pairs

Default: 0

--path_to_fasta

ProteinMPNN: score provided input sequence in a fasta format; e.g. GGGGGG/PPPPS/WWW for chains A, B, C sorted alphabetically and separated by /

Default: “”

--conditional_probs_only

ProteinMPNN: 0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the sequence and backbone)

Default: 0

--conditional_probs_only_backbone

ProteinMPNN: 0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone)

Default: 0

--unconditional_probs_only

ProteinMPNN: 0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one forward pass

Default: 0

--backbone_noise

ProteinMPNN: Standard deviation of Gaussian noise to add to backbone atoms

Default: 0.0

--batch_size

ProteinMPNN: Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory

Default: 1

--pdb_path_chains

ProteinMPNN: Define which chains need to be designed for a single PDB

Default: “”

--chain_id_jsonl

ProteinMPNN: Path to a dictionary specifying which chains need to be designed and which ones are fixed, if not specied all chains will be designed.

Default: “”

--fixed_positions_jsonl

ProteinMPNN: Path to a dictionary with fixed positions

Default: “”

--omit_AAs

ProteinMPNN: Specify which amino acids should be omitted in the generated sequence, e.g. ‘AC’ would omit alanine and cystine.

Default: X

--bias_AA_jsonl

ProteinMPNN: Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7} would make A less likely and F more likely.

Default: “”

--bias_by_res_jsonl

ProteinMPNN: Path to dictionary with per position bias.

Default: “”

--omit_AA_jsonl

ProteinMPNN: Path to a dictionary which specifies which amino acids need to be omited from design at specific chain indices

Default: “”

--pssm_jsonl

ProteinMPNN: Path to a dictionary with pssm

Default: “”

--pssm_multi

ProteinMPNN: A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions

Default: 0.0

--pssm_threshold

ProteinMPNN: A value between -inf + inf to restric per position AAs

Default: 0.0

--pssm_log_odds_flag

ProteinMPNN: 0 for False, 1 for True

Default: 0

--pssm_bias_flag

ProteinMPNN: 0 for False, 1 for True

Default: 0

--tied_positions_jsonl

ProteinMPNN: Path to a dictionary with tied positions

Default: “”

lang_gen

Generate proteins using large language models

trill lang_gen [-h] [--finetuned FINETUNED] [--esm2_arch ESM2_ARCH]
               [--temp TEMP] [--ctrl_tag CTRL_TAG] [--batch_size BATCH_SIZE]
               [--seed_seq SEED_SEQ] [--max_length MAX_LENGTH]
               [--do_sample DO_SAMPLE] [--top_k TOP_K]
               [--repetition_penalty REPETITION_PENALTY]
               [--num_return_sequences NUM_RETURN_SEQUENCES] [--random_fill]
               [--num_positions NUM_POSITIONS]
               {ESM2,ProtGPT2,ZymCTRL}

Positional Arguments

model

Possible choices: ESM2, ProtGPT2, ZymCTRL

Choose desired language model

Named Arguments

--finetuned

Input path to your own finetuned model

Default: False

--esm2_arch

ESM2_Gibbs: Choose which ESM2 architecture your finetuned model is

Default: “esm2_t12_35M_UR50D”

--temp

Choose sampling temperature.

Default: “1”

--ctrl_tag

ZymCTRL: Choose an Enzymatic Commision (EC) control tag for conditional protein generation based on the tag. You can find all ECs here https://www.brenda-enzymes.org/index.php

--batch_size

Change batch-size number to modulate how many proteins are generated at a time. Default is 1

Default: 1

--seed_seq

Sequence to seed generation, the default is M.

Default: “M”

--max_length

Max length of proteins generated, default is 100

Default: 100

--do_sample

ProtGPT2/ZymCTRL: Whether or not to use sampling for generation; use greedy decoding otherwise

Default: True

--top_k

The number of highest probability vocabulary tokens to keep for top-k-filtering

Default: 950

--repetition_penalty

ProtGPT2/ZymCTRL: The parameter for repetition penalty, the default is 1.2. 1.0 means no penalty

Default: 1.2

--num_return_sequences

Number of sequences to generate. Default is 1

Default: 1

--random_fill

ESM2_Gibbs: Randomly select positions to fill each iteration for Gibbs sampling with ESM2. If not called then fill the positions in order

Default: True

--num_positions

ESM2_Gibbs: Generate new AAs for this many positions each iteration for Gibbs sampling with ESM2. If 0, then generate for all target positions each round.

Default: 0

diff_gen

Generate proteins using RFDiffusion

trill diff_gen [-h] [--contigs CONTIGS]
               [--RFDiffusion_Override RFDIFFUSION_OVERRIDE]
               [--num_return_sequences NUM_RETURN_SEQUENCES]
               [--Inpaint INPAINT] [--query QUERY] [--partial_T PARTIAL_T]
               [--partial_diff_fix PARTIAL_DIFF_FIX] [--hotspots HOTSPOTS]

Named Arguments

--contigs

Generate proteins between these sizes in AAs for RFDiffusion. For example, –contig 100-200, will result in proteins in this range

--RFDiffusion_Override

Change RFDiffusion model. For example, –RFDiffusion_Override ActiveSite will use ActiveSite_ckpt.pt for holding small motifs in place.

Default: False

--num_return_sequences

Number of sequences for RFDiffusion to generate. Default is 5

Default: 5

--Inpaint

Residues to inpaint.

--query

Input pdb file for motif scaffolding, partial diffusion etc.

--partial_T

Adjust partial diffusion sampling value.

--partial_diff_fix

Pass the residues that you want to keep fixed for your input pdb during partial diffusion. Note that the residues should be 0-indexed.

--hotspots

Define resiudes that binder must interact with. For example, –hotspots A30,A33,A34 , where A is the chain and the numbers are the residue indices.

classify

Classify proteins using either pretrained classifiers or train/test your own.

trill classify [-h] [--key KEY] [--save_emb]
               [--emb_model {esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,Ankh,Ankh-Large}]
               [--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
               [--preComputed_Embs PRECOMPUTED_EMBS] [--batch_size BATCH_SIZE]
               [--xg_gamma XG_GAMMA] [--xg_lr XG_LR]
               [--xg_max_depth XG_MAX_DEPTH] [--xg_reg_alpha XG_REG_ALPHA]
               [--xg_reg_lambda XG_REG_LAMBDA]
               [--if_contamination IF_CONTAMINATION]
               [--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
               [--f1_avg_method {macro,weighted,micro,None}]
               {TemStaPro,EpHod,XGBoost,iForest} query

Positional Arguments

classifier

Possible choices: TemStaPro, EpHod, XGBoost, iForest

Predict thermostability/optimal enzymatic pH using TemStaPro/EpHod or choose custom to train/use your own XGBoost or Isolation Forest classifier. Note for training XGBoost, you need to submit roughly equal amounts of each class as part of your query.

query

Fasta file of sequences to score

Named Arguments

--key

Input a CSV, with your class mappings for your embeddings where the first column is the label and the second column is the class.

--save_emb

Save csv of ProtT5 embeddings

Default: False

--emb_model

Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, Ankh, Ankh-Large

Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M

Default: “esm2_t12_35M”

--train_split

Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating

--preTrained

Enter the path to your pre-trained XGBoost binary classifier that you’ve trained with TRILL. This will be a .json file.

--preComputed_Embs

Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.

Default: False

--batch_size

EpHod: Sets batch_size for embedding with ESM1v.

Default: 1

--xg_gamma

XGBoost: sets gamma for XGBoost, which is a hyperparameter that sets ‘Minimum loss reduction required to make a further partition on a leaf node of the tree.’

Default: 0.4

--xg_lr

XGBoost: Sets the learning rate for XGBoost

Default: 0.2

--xg_max_depth

XGBoost: Sets the maximum tree depth

Default: 8

--xg_reg_alpha

XGBoost: L1 regularization term on weights

Default: 0.8

--xg_reg_lambda

XGBoost: L2 regularization term on weights

Default: 0.1

--if_contamination

iForest: The amount of outliers in the data. Default is automatically determined, but you can set it between (0 , 0.5])

Default: “auto”

--n_estimators

XGBoost/iForest: Number of boosting rounds

Default: 115

--sweep

XGBoost/iForest: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.

Default: False

--sweep_cv

XGBoost Change the number of folds used for cross-validation.

Default: 3

--f1_avg_method

Possible choices: macro, weighted, micro, None

XGBoost: Change the scoring method used for calculated F1. Default is with no averaging.

fold

Predict 3D protein structures using ESMFold or obtain 3Di structure for use with Foldseek to perform remote homology detection

trill fold [-h] [--strategy STRATEGY] [--batch_size BATCH_SIZE]
           {ESMFold,ProstT5} query

Positional Arguments

model

Possible choices: ESMFold, ProstT5

Choose your desired model.

query

Input fasta file

Named Arguments

--strategy

ESMFold: Choose a specific strategy if you are running out of CUDA memory. You can also pass either 64, or 32 for model.trunk.set_chunk_size(x)

--batch_size

ESMFold: Change batch-size number for folding proteins. Default is 1

Default: 1

visualize

Reduce dimensionality of embeddings to 2D

trill visualize [-h] [--method {PCA,UMAP,tSNE}] [--key KEY] embeddings

Positional Arguments

embeddings

Embeddings to be visualized

Named Arguments

--method

Possible choices: PCA, UMAP, tSNE

Method for reducing dimensions of embeddings. Default is PCA

Default: “PCA”

--key

Input a CSV, with your group mappings for your embeddings where the first column is the label and the second column is the group to be colored.

Default: False

simulate

Use MD to relax protein structures

trill simulate [-h] [--just_relax] receptor

Positional Arguments

receptor

Receptor of interest to be simulated. Must be either pdb file or a .txt file with the absolute path for each pdb, separated by a new-line.

Named Arguments

--just_relax

Just relaxes the input structure(s) and outputs the fixed and relaxed structure(s). The forcefield that is used is amber14.

Default: False

dock

Perform molecular docking with proteins and ligands. Note that you should relax your protein receptor with Simulate or another method before docking.

trill dock [-h] [--save_visualisation]
           [--samples_per_complex SAMPLES_PER_COMPLEX] [--no_final_step_noise]
           [--inference_steps INFERENCE_STEPS] [--actual_steps ACTUAL_STEPS]
           [--min_radius MIN_RADIUS] [--max_radius MAX_RADIUS]
           [--min_alpha_spheres MIN_ALPHA_SPHERES]
           [--exhaustiveness EXHAUSTIVENESS] [--blind] [--anm]
           [--swarms SWARMS] [--sim_steps SIM_STEPS] [--restraints RESTRAINTS]
           {DiffDock,Vina,Smina,LightDock} protein [ligand [ligand ...]]

Positional Arguments

algorithm

Possible choices: DiffDock, Vina, Smina, LightDock

Note that while LightDock can dock protein ligands, DiffDock, Smina, and Vina can only do small-molecules.

protein

Protein of interest to be docked with ligand

ligand

Ligand to dock protein with. Note that with Autodock Vina, you can dock multiple ligands at one time. Simply provide them one after another before any other optional TRILL arguments are added. Also, if a .txt file is provided with each line providing the absolute path to different ligands, TRILL will dock each ligand one at a time.

Named Arguments

--save_visualisation

DiffDock: Save a pdb file with all of the steps of the reverse diffusion.

Default: False

--samples_per_complex

DiffDock: Number of samples to generate.

Default: 10

--no_final_step_noise

DiffDock: Use no noise in the final step of the reverse diffusion

Default: False

--inference_steps

DiffDock: Number of denoising steps

Default: 20

--actual_steps

DiffDock: Number of denoising steps that are actually performed

--min_radius

Smina/Vina + Fpocket: Minimum radius of alpha spheres in a pocket. Default is 3Å.

Default: 3.0

--max_radius

Smina/Vina + Fpocket: Maximum radius of alpha spheres in a pocket. Default is 6Å.

Default: 6.0

--min_alpha_spheres

Smina/Vina + Fpocket: Minimum number of alpha spheres a pocket must contain to be considered. Default is 35.

Default: 35

--exhaustiveness

Smina/Vina: Change computational effort.

Default: 8

--blind

Smina/Vina: Perform blind docking and skip binding pocket prediction with fpocket

Default: False

--anm

LightDock: If selected, backbone flexibility is modeled using Anisotropic Network Model (via ProDy)

Default: False

--swarms

LightDock: The number of swarms of the simulations, default is 25

Default: 25

--sim_steps

LightDock: The number of steps of the simulation. Default is 100

Default: 100

--restraints

LightDock: If restraints_file is provided, residue restraints will be considered during the setup and the simulation

utils

Misc utilities

trill utils [-h] [--dir DIR] [--fasta_paths_txt FASTA_PATHS_TXT]
            [--uniprotDB {UniProtKB,A.thaliana,C.elegans,E.coli,H.sapiens,M.musculus,R.norvegicus,SARS-CoV-2}]
            [--rep {per_AA,avg}]
            {prepare_class_key,fetch_embeddings}

Positional Arguments

tool

Possible choices: prepare_class_key, fetch_embeddings

prepare_class_key: Pepare a csv for use with the classify command. Takes a directory or text file with list of paths for fasta files. Each file will be a unique class, so if your directory contains 5 fasta files, there will be 5 classes in the output key csv.

Named Arguments

--dir

Directory to be used for creating a class key csv for classification.

--fasta_paths_txt

Text file with absolute paths of fasta files to be used for creating the class key. Each unique path will be treated as a unique class, and all the sequences in that file will be in the same class.

--uniprotDB

Possible choices: UniProtKB, A.thaliana, C.elegans, E.coli, H.sapiens, M.musculus, R.norvegicus, SARS-CoV-2

UniProt embedding dataset to download.

--rep

Possible choices: per_AA, avg

The representation to download.