usage: trill [-h] [--nodes NODES] [--logger LOGGER] [--profiler]
[--RNG_seed RNG_SEED] [--outdir OUTDIR] [--n_workers N_WORKERS]
name GPUs
{embed,finetune,inv_fold_gen,lang_gen,diff_gen,classify,fold,visualize,simulate,dock,utils}
...
Positional Arguments
- name
Name of run
- GPUs
Input total number of GPUs per node
Default: 1
- command
Possible choices: embed, finetune, inv_fold_gen, lang_gen, diff_gen, classify, fold, visualize, simulate, dock, utils
Named Arguments
- --nodes
Input total number of nodes. Default is 1
Default: 1
- --logger
Enable Tensorboard logger. Default is None
Default: False
- --profiler
Utilize PyTorchProfiler
Default: False
- --RNG_seed
Input RNG seed. Default is 123
Default: 123
- --outdir
Input full path to directory where you want the output from TRILL
Default: “.”
- --n_workers
Change number of CPU cores/’workers’ TRILL uses
Default: 1
Sub-commands
embed
Embed proteins of interest
trill embed [-h] [--batch_size BATCH_SIZE] [--finetuned FINETUNED] [--per_AA]
[--avg]
{esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,Ankh,Ankh-Large}
query
Positional Arguments
- model
Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, Ankh, Ankh-Large
Choose protein language model to embed query proteins
- query
Input protein fasta file
Named Arguments
- --batch_size
Change batch-size number for embedding proteins. Default is 1, but with more RAM, you can do more
Default: 1
- --finetuned
Input path to your own finetuned ESM model
Default: False
- --per_AA
Add this flag to return the per amino acid representations.
Default: False
- --avg
Add this flag to return the average, whole sequence representation.
Default: False
finetune
Finetune protein language models
trill finetune [-h] [--epochs EPOCHS] [--save_on_epoch] [--lr LR]
[--batch_size BATCH_SIZE] [--mask_fraction MASK_FRACTION]
[--pre_masked_fasta] [--strategy STRATEGY]
[--ctrl_tag CTRL_TAG] [--finetuned FINETUNED]
{esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtGPT2,ZymCTRL}
query
Positional Arguments
- model
Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtGPT2, ZymCTRL
Choose the protein language model to finetune. Note that ESM2 is trained with the MLM objective, while ProtGPT2/ZymCTRL are trained with the CLM objective.
- query
Input fasta file
Named Arguments
- --epochs
Number of epochs for fine-tuning. Default is 10
Default: 10
- --save_on_epoch
Saves a checkpoint on every successful epoch completed. WARNING, this could lead to rapid storage consumption
Default: False
- --lr
Learning rate for optimizer. Default is 0.0001
Default: 0.0001
- --batch_size
Change batch-size number for fine-tuning. Default is 1
Default: 1
- --mask_fraction
ESM: Change fraction of animo acids masked for MLM training. Default is 0.15
Default: 0.15
- --pre_masked_fasta
ESM: Use this flag to specify that your input fasta will be pre-masked and does not need masking performed by TRILL. The sequences will still be randomly shuffled.
Default: False
- --strategy
Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html
- --ctrl_tag
ZymCTRL: Choose an Enzymatic Commision (EC) control tag for finetuning ZymCTRL. Note that the tag must match all of the enzymes in the query fasta file. You can find all ECs here https://www.brenda-enzymes.org/index.php
- --finetuned
Input path to your previously finetuned model to continue finetuning
Default: False
inv_fold_gen
Generate proteins using inverse folding
trill inv_fold_gen [-h] [--temp TEMP]
[--num_return_sequences NUM_RETURN_SEQUENCES]
[--max_length MAX_LENGTH] [--top_p TOP_P]
[--repetition_penalty REPETITION_PENALTY] [--dont_sample]
[--mpnn_model MPNN_MODEL] [--save_score SAVE_SCORE]
[--save_probs SAVE_PROBS] [--score_only SCORE_ONLY]
[--path_to_fasta PATH_TO_FASTA]
[--conditional_probs_only CONDITIONAL_PROBS_ONLY]
[--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE]
[--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY]
[--backbone_noise BACKBONE_NOISE] [--batch_size BATCH_SIZE]
[--pdb_path_chains PDB_PATH_CHAINS]
[--chain_id_jsonl CHAIN_ID_JSONL]
[--fixed_positions_jsonl FIXED_POSITIONS_JSONL]
[--omit_AAs OMIT_AAS] [--bias_AA_jsonl BIAS_AA_JSONL]
[--bias_by_res_jsonl BIAS_BY_RES_JSONL]
[--omit_AA_jsonl OMIT_AA_JSONL] [--pssm_jsonl PSSM_JSONL]
[--pssm_multi PSSM_MULTI] [--pssm_threshold PSSM_THRESHOLD]
[--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG]
[--pssm_bias_flag PSSM_BIAS_FLAG]
[--tied_positions_jsonl TIED_POSITIONS_JSONL]
{ESM-IF1,ProteinMPNN,ProstT5} query
Positional Arguments
- model
Possible choices: ESM-IF1, ProteinMPNN, ProstT5
Select which model to generate proteins using inverse folding.
- query
Input pdb file for inverse folding
Named Arguments
- --temp
Choose sampling temperature.
Default: “1”
- --num_return_sequences
Choose number of proteins to generate.
Default: 1
- --max_length
Max length of proteins generated, default is 500 AAs
Default: 500
- --top_p
ProstT5: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default is 1
Default: 1
- --repetition_penalty
ProstT5: The parameter for repetition penalty. 1.0 means no penalty, the default is 1.2
Default: 1.2
- --dont_sample
ProstT5: By default, the model will sample to generate the protein. With this flag, you can enable greedy decoding, where the most probable tokens will be returned.
Default: True
- --mpnn_model
ProteinMPNN: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48 edges 0.10A noise
Default: “v_48_020”
- --save_score
ProteinMPNN: 0 for False, 1 for True; save score=-log_prob to npy files
Default: 0
- --save_probs
ProteinMPNN: 0 for False, 1 for True; save MPNN predicted probabilites per position
Default: 0
- --score_only
ProteinMPNN: 0 for False, 1 for True; score input backbone-sequence pairs
Default: 0
- --path_to_fasta
ProteinMPNN: score provided input sequence in a fasta format; e.g. GGGGGG/PPPPS/WWW for chains A, B, C sorted alphabetically and separated by /
Default: “”
- --conditional_probs_only
ProteinMPNN: 0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the sequence and backbone)
Default: 0
- --conditional_probs_only_backbone
ProteinMPNN: 0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone)
Default: 0
- --unconditional_probs_only
ProteinMPNN: 0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one forward pass
Default: 0
- --backbone_noise
ProteinMPNN: Standard deviation of Gaussian noise to add to backbone atoms
Default: 0.0
- --batch_size
ProteinMPNN: Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory
Default: 1
- --pdb_path_chains
ProteinMPNN: Define which chains need to be designed for a single PDB
Default: “”
- --chain_id_jsonl
ProteinMPNN: Path to a dictionary specifying which chains need to be designed and which ones are fixed, if not specied all chains will be designed.
Default: “”
- --fixed_positions_jsonl
ProteinMPNN: Path to a dictionary with fixed positions
Default: “”
- --omit_AAs
ProteinMPNN: Specify which amino acids should be omitted in the generated sequence, e.g. ‘AC’ would omit alanine and cystine.
Default: X
- --bias_AA_jsonl
ProteinMPNN: Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7} would make A less likely and F more likely.
Default: “”
- --bias_by_res_jsonl
ProteinMPNN: Path to dictionary with per position bias.
Default: “”
- --omit_AA_jsonl
ProteinMPNN: Path to a dictionary which specifies which amino acids need to be omited from design at specific chain indices
Default: “”
- --pssm_jsonl
ProteinMPNN: Path to a dictionary with pssm
Default: “”
- --pssm_multi
ProteinMPNN: A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions
Default: 0.0
- --pssm_threshold
ProteinMPNN: A value between -inf + inf to restric per position AAs
Default: 0.0
- --pssm_log_odds_flag
ProteinMPNN: 0 for False, 1 for True
Default: 0
- --pssm_bias_flag
ProteinMPNN: 0 for False, 1 for True
Default: 0
- --tied_positions_jsonl
ProteinMPNN: Path to a dictionary with tied positions
Default: “”
lang_gen
Generate proteins using large language models
trill lang_gen [-h] [--finetuned FINETUNED] [--esm2_arch ESM2_ARCH]
[--temp TEMP] [--ctrl_tag CTRL_TAG] [--batch_size BATCH_SIZE]
[--seed_seq SEED_SEQ] [--max_length MAX_LENGTH]
[--do_sample DO_SAMPLE] [--top_k TOP_K]
[--repetition_penalty REPETITION_PENALTY]
[--num_return_sequences NUM_RETURN_SEQUENCES] [--random_fill]
[--num_positions NUM_POSITIONS]
{ESM2,ProtGPT2,ZymCTRL}
Positional Arguments
- model
Possible choices: ESM2, ProtGPT2, ZymCTRL
Choose desired language model
Named Arguments
- --finetuned
Input path to your own finetuned model
Default: False
- --esm2_arch
ESM2_Gibbs: Choose which ESM2 architecture your finetuned model is
Default: “esm2_t12_35M_UR50D”
- --temp
Choose sampling temperature.
Default: “1”
- --ctrl_tag
ZymCTRL: Choose an Enzymatic Commision (EC) control tag for conditional protein generation based on the tag. You can find all ECs here https://www.brenda-enzymes.org/index.php
- --batch_size
Change batch-size number to modulate how many proteins are generated at a time. Default is 1
Default: 1
- --seed_seq
Sequence to seed generation, the default is M.
Default: “M”
- --max_length
Max length of proteins generated, default is 100
Default: 100
- --do_sample
ProtGPT2/ZymCTRL: Whether or not to use sampling for generation; use greedy decoding otherwise
Default: True
- --top_k
The number of highest probability vocabulary tokens to keep for top-k-filtering
Default: 950
- --repetition_penalty
ProtGPT2/ZymCTRL: The parameter for repetition penalty, the default is 1.2. 1.0 means no penalty
Default: 1.2
- --num_return_sequences
Number of sequences to generate. Default is 1
Default: 1
- --random_fill
ESM2_Gibbs: Randomly select positions to fill each iteration for Gibbs sampling with ESM2. If not called then fill the positions in order
Default: True
- --num_positions
ESM2_Gibbs: Generate new AAs for this many positions each iteration for Gibbs sampling with ESM2. If 0, then generate for all target positions each round.
Default: 0
diff_gen
Generate proteins using RFDiffusion
trill diff_gen [-h] [--contigs CONTIGS]
[--RFDiffusion_Override RFDIFFUSION_OVERRIDE]
[--num_return_sequences NUM_RETURN_SEQUENCES]
[--Inpaint INPAINT] [--query QUERY] [--partial_T PARTIAL_T]
[--partial_diff_fix PARTIAL_DIFF_FIX] [--hotspots HOTSPOTS]
Named Arguments
- --contigs
Generate proteins between these sizes in AAs for RFDiffusion. For example, –contig 100-200, will result in proteins in this range
- --RFDiffusion_Override
Change RFDiffusion model. For example, –RFDiffusion_Override ActiveSite will use ActiveSite_ckpt.pt for holding small motifs in place.
Default: False
- --num_return_sequences
Number of sequences for RFDiffusion to generate. Default is 5
Default: 5
- --Inpaint
Residues to inpaint.
- --query
Input pdb file for motif scaffolding, partial diffusion etc.
- --partial_T
Adjust partial diffusion sampling value.
- --partial_diff_fix
Pass the residues that you want to keep fixed for your input pdb during partial diffusion. Note that the residues should be 0-indexed.
- --hotspots
Define resiudes that binder must interact with. For example, –hotspots A30,A33,A34 , where A is the chain and the numbers are the residue indices.
classify
Classify proteins using either pretrained classifiers or train/test your own.
trill classify [-h] [--key KEY] [--save_emb]
[--emb_model {esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,Ankh,Ankh-Large}]
[--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
[--preComputed_Embs PRECOMPUTED_EMBS] [--batch_size BATCH_SIZE]
[--xg_gamma XG_GAMMA] [--xg_lr XG_LR]
[--xg_max_depth XG_MAX_DEPTH] [--xg_reg_alpha XG_REG_ALPHA]
[--xg_reg_lambda XG_REG_LAMBDA]
[--if_contamination IF_CONTAMINATION]
[--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
[--f1_avg_method {macro,weighted,micro,None}]
{TemStaPro,EpHod,XGBoost,iForest} query
Positional Arguments
- classifier
Possible choices: TemStaPro, EpHod, XGBoost, iForest
Predict thermostability/optimal enzymatic pH using TemStaPro/EpHod or choose custom to train/use your own XGBoost or Isolation Forest classifier. Note for training XGBoost, you need to submit roughly equal amounts of each class as part of your query.
- query
Fasta file of sequences to score
Named Arguments
- --key
Input a CSV, with your class mappings for your embeddings where the first column is the label and the second column is the class.
- --save_emb
Save csv of ProtT5 embeddings
Default: False
- --emb_model
Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, Ankh, Ankh-Large
Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M
Default: “esm2_t12_35M”
- --train_split
Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating
- --preTrained
Enter the path to your pre-trained XGBoost binary classifier that you’ve trained with TRILL. This will be a .json file.
- --preComputed_Embs
Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.
Default: False
- --batch_size
EpHod: Sets batch_size for embedding with ESM1v.
Default: 1
- --xg_gamma
XGBoost: sets gamma for XGBoost, which is a hyperparameter that sets ‘Minimum loss reduction required to make a further partition on a leaf node of the tree.’
Default: 0.4
- --xg_lr
XGBoost: Sets the learning rate for XGBoost
Default: 0.2
- --xg_max_depth
XGBoost: Sets the maximum tree depth
Default: 8
- --xg_reg_alpha
XGBoost: L1 regularization term on weights
Default: 0.8
- --xg_reg_lambda
XGBoost: L2 regularization term on weights
Default: 0.1
- --if_contamination
iForest: The amount of outliers in the data. Default is automatically determined, but you can set it between (0 , 0.5])
Default: “auto”
- --n_estimators
XGBoost/iForest: Number of boosting rounds
Default: 115
- --sweep
XGBoost: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.
Default: False
- --sweep_cv
XGBoost: Change the number of folds used for cross-validation.
Default: 3
- --f1_avg_method
Possible choices: macro, weighted, micro, None
XGBoost: Change the scoring method used for calculated F1. Default is with no averaging.
fold
Predict 3D protein structures using ESMFold or obtain 3Di structure for use with Foldseek to perform remote homology detection
trill fold [-h] [--strategy STRATEGY] [--batch_size BATCH_SIZE]
{ESMFold,ProstT5} query
Positional Arguments
- model
Possible choices: ESMFold, ProstT5
Choose your desired model.
- query
Input fasta file
Named Arguments
- --strategy
ESMFold: Choose a specific strategy if you are running out of CUDA memory. You can also pass either 64, or 32 for model.trunk.set_chunk_size(x)
- --batch_size
ESMFold: Change batch-size number for folding proteins. Default is 1
Default: 1
visualize
Reduce dimensionality of embeddings to 2D
trill visualize [-h] [--method {PCA,UMAP,tSNE}] [--key KEY] embeddings
Positional Arguments
- embeddings
Embeddings to be visualized
Named Arguments
- --method
Possible choices: PCA, UMAP, tSNE
Method for reducing dimensions of embeddings. Default is PCA
Default: “PCA”
- --key
Input a CSV, with your group mappings for your embeddings where the first column is the label and the second column is the group to be colored.
Default: False
simulate
Use MD to relax protein structures
trill simulate [-h] [--ligand LIGAND]
[--constraints {None,HBonds,AllBonds,HAngles}] [--rigidWater]
[--forcefield FORCEFIELD] [--solvent SOLVENT] [--solvate]
[--step_size STEP_SIZE] [--num_steps NUM_STEPS]
[--reporting_interval REPORTING_INTERVAL]
[--output_traj_dcd OUTPUT_TRAJ_DCD]
[--apply-harmonic-force APPLY_HARMONIC_FORCE]
[--force-constant FORCE_CONSTANT] [--z0 Z0]
[--molecule-atom-indices MOLECULE_ATOM_INDICES]
[--equilibration_steps EQUILIBRATION_STEPS]
[--periodic_box PERIODIC_BOX] [--just_relax]
[--reporter_interval REPORTER_INTERVAL]
receptor
Positional Arguments
- receptor
Receptor of interest to be simulated. Must be either pdb file or a .txt file with the absolute path for each pdb, separated by a new-line.
Named Arguments
- --ligand
Ligand of interest to be simulated with input receptor
- --constraints
Possible choices: None, HBonds, AllBonds, HAngles
Specifies which bonds and angles should be implemented with constraints. Allowed values are None, HBonds, AllBonds, or HAngles.
Default: “None”
- --rigidWater
If true, water molecules will be fully rigid regardless of the value passed for the constraints argument.
- --forcefield
Force field to use. Default is amber14-all.xml
Default: “amber14-all.xml”
- --solvent
Solvent model to use, the default is amber14/tip3pfb.xml
Default: “amber14/tip3pfb.xml”
- --solvate
Add to solvate your simulation
Default: False
- --step_size
Step size in femtoseconds. Default is 2
Default: 2
- --num_steps
Number of simulation steps
Default: 5000
- --reporting_interval
Reporting interval for simulation
Default: 1000
- --output_traj_dcd
Output trajectory DCD file
Default: “trajectory.dcd”
- --apply-harmonic-force
Whether to apply a harmonic force to pull the molecule.
Default: False
- --force-constant
Force constant for the harmonic force in kJ/mol/nm^2.
- --z0
The z-coordinate to pull towards in nm.
- --molecule-atom-indices
Comma-separated list of atom indices to which the harmonic force will be applied.
Default: “0,1,2”
- --equilibration_steps
Steps you want to take for NVT and NPT equilibration. Each step is 0.002 picoseconds
Default: 300
- --periodic_box
Give, in nm, one of the dimensions to build the periodic boundary.
Default: 10
- --just_relax
Just relaxes the input structure(s) and outputs the fixed and relaxed structure(s). The forcefield that is used is amber14.
Default: False
- --reporter_interval
Set interval to save PDB and energy snapshot. Note that the higher the number, the bigger the output files will be and the slower the simulation. Default is 1000
Default: 1000
dock
Perform molecular docking with proteins and ligands. Note that you should relax your protein receptor with Simulate or another method before docking.
trill dock [-h] [--save_visualisation]
[--samples_per_complex SAMPLES_PER_COMPLEX] [--no_final_step_noise]
[--inference_steps INFERENCE_STEPS] [--actual_steps ACTUAL_STEPS]
[--min_radius MIN_RADIUS] [--max_radius MAX_RADIUS]
[--min_alpha_spheres MIN_ALPHA_SPHERES]
[--exhaustiveness EXHAUSTIVENESS] [--blind] [--anm]
[--swarms SWARMS] [--sim_steps SIM_STEPS] [--restraints RESTRAINTS]
{DiffDock,Vina,Smina,LightDock,GeoDock} protein
[ligand [ligand ...]]
Positional Arguments
- algorithm
Possible choices: DiffDock, Vina, Smina, LightDock, GeoDock
Note that while LightDock can dock protein ligands, DiffDock, Smina, and Vina can only do small-molecules.
- protein
Protein of interest to be docked with ligand
- ligand
Ligand to dock protein with. Note that with Autodock Vina, you can dock multiple ligands at one time. Simply provide them one after another before any other optional TRILL arguments are added. Also, if a .txt file is provided with each line providing the absolute path to different ligands, TRILL will dock each ligand one at a time.
Named Arguments
- --save_visualisation
DiffDock: Save a pdb file with all of the steps of the reverse diffusion.
Default: False
- --samples_per_complex
DiffDock: Number of samples to generate.
Default: 10
- --no_final_step_noise
DiffDock: Use no noise in the final step of the reverse diffusion
Default: False
- --inference_steps
DiffDock: Number of denoising steps
Default: 20
- --actual_steps
DiffDock: Number of denoising steps that are actually performed
- --min_radius
Smina/Vina + Fpocket: Minimum radius of alpha spheres in a pocket. Default is 3Å.
Default: 3.0
- --max_radius
Smina/Vina + Fpocket: Maximum radius of alpha spheres in a pocket. Default is 6Å.
Default: 6.0
- --min_alpha_spheres
Smina/Vina + Fpocket: Minimum number of alpha spheres a pocket must contain to be considered. Default is 35.
Default: 35
- --exhaustiveness
Smina/Vina: Change computational effort.
Default: 8
- --blind
Smina/Vina: Perform blind docking and skip binding pocket prediction with fpocket
Default: False
- --anm
LightDock: If selected, backbone flexibility is modeled using Anisotropic Network Model (via ProDy)
Default: False
- --swarms
LightDock: The number of swarms of the simulations, default is 25
Default: 25
- --sim_steps
LightDock: The number of steps of the simulation. Default is 100
Default: 100
- --restraints
LightDock: If restraints_file is provided, residue restraints will be considered during the setup and the simulation
utils
Misc utilities
trill utils [-h] [--dir DIR] [--fasta_paths_txt FASTA_PATHS_TXT]
[--uniprotDB {UniProtKB,A.thaliana,C.elegans,E.coli,H.sapiens,M.musculus,R.norvegicus,SARS-CoV-2}]
[--rep {per_AA,avg}]
{prepare_class_key,fetch_embeddings}
Positional Arguments
- tool
Possible choices: prepare_class_key, fetch_embeddings
prepare_class_key: Pepare a csv for use with the classify command. Takes a directory or text file with list of paths for fasta files. Each file will be a unique class, so if your directory contains 5 fasta files, there will be 5 classes in the output key csv.
Named Arguments
- --dir
Directory to be used for creating a class key csv for classification.
- --fasta_paths_txt
Text file with absolute paths of fasta files to be used for creating the class key. Each unique path will be treated as a unique class, and all the sequences in that file will be in the same class.
- --uniprotDB
Possible choices: UniProtKB, A.thaliana, C.elegans, E.coli, H.sapiens, M.musculus, R.norvegicus, SARS-CoV-2
UniProt embedding dataset to download.
- --rep
Possible choices: per_AA, avg
The representation to download.