usage: trill [-h] [--nodes NODES] [--logger LOGGER] [--profiler]
[--RNG_seed RNG_SEED] [--outdir OUTDIR] [--n_workers N_WORKERS]
name GPUs
{visualize,regress,utils,simulate,finetune,fold,dock,lang_gen,inv_fold_gen,classify,diff_gen,embed,score}
...
Positional Arguments
- name
Name of run
- GPUs
Input total number of GPUs per node
Default: 1
- command
Possible choices: visualize, regress, utils, simulate, finetune, fold, dock, lang_gen, inv_fold_gen, classify, diff_gen, embed, score
Named Arguments
- --nodes
Input total number of nodes. Default is 1
Default: 1
- --logger
Enable Tensorboard logger. Default is None
Default: False
- --profiler
Utilize PyTorchProfiler
Default: False
- --RNG_seed
Input RNG seed. Default is 123
Default: 123
- --outdir
Input full path to directory where you want the output from TRILL
Default: “.”
- --n_workers
Change number of CPU cores/’workers’ TRILL uses
Default: 1
Sub-commands
visualize
Reduce dimensionality of embeddings to 2D
trill visualize [-h] [--method {PCA,UMAP,tSNE}] [--key KEY] embeddings
Positional Arguments
- embeddings
Embeddings to be visualized
Named Arguments
- --method
Possible choices: PCA, UMAP, tSNE
Method for reducing dimensions of embeddings. Default is PCA
Default: “PCA”
- --key
Input a CSV, with your group mappings for your embeddings where the first column is the label and the second column is the group to be colored.
Default: False
regress
Train you own regressors on input protein sequences and some sort of score.
trill regress [-h] [--key KEY] [--save_emb]
[--emb_model {Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}]
[--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
[--preComputed_Embs PRECOMPUTED_EMBS] [--batch_size BATCH_SIZE]
[--lr LR] [--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
[--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
[--sweep_iters SWEEP_ITERS]
{Linear,LightGBM} query
Positional Arguments
- regressor
Possible choices: Linear, LightGBM
Train a custom regression model
- query
Fasta file of protein sequences
Named Arguments
- --key
Input a CSV, with your mappings for your embeddings where the first column is the label and the second column is the value.
- --save_emb
Save csv of embeddings
Default: False
- --emb_model
Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt
Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M
Default: “esm2_t12_35M”
- --train_split
Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating
- --preTrained
Enter the path to your pre-trained XGBoost binary classifier that you’ve trained with TRILL. This will be a .json file.
- --preComputed_Embs
Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.
Default: False
- --batch_size
Sets batch_size for embedding.
Default: 1
- --lr
LightGBM: Sets the learning rate. Default is 0.2
Default: 0.2
- --max_depth
LightGBM: Sets the maximum tree depth. Default is -1, no max tree depth.
Default: -1
- --num_leaves
LightGBM: Sets the max number of leaves in one tree. Default is 31
Default: 31
- --n_estimators
LightGBM: Number of boosting rounds
Default: 115
- --sweep
LightGBM: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.
Default: False
- --sweep_cv
LightGBM: Change the number of folds used for cross-validation.
Default: 3
- --sweep_iters
LightGBM: Change the number of optimization iterations. Default is 10.
Default: 10
utils
Misc utilities
trill utils [-h] [--dir DIR] [--fasta_paths_txt FASTA_PATHS_TXT]
[--uniprotDB {UniProtKB,A.thaliana,C.elegans,E.coli,H.sapiens,M.musculus,R.norvegicus,SARS-CoV-2}]
[--rep {per_AA,avg}]
{prepare_class_key,fetch_embeddings}
Positional Arguments
- tool
Possible choices: prepare_class_key, fetch_embeddings
prepare_class_key: Pepare a csv for use with the classify command. Takes a directory or text file with list of paths for fasta files. Each file will be a unique class, so if your directory contains 5 fasta files, there will be 5 classes in the output key csv.
Named Arguments
- --dir
Directory to be used for creating a class key csv for classification.
- --fasta_paths_txt
Text file with absolute paths of fasta files to be used for creating the class key. Each unique path will be treated as a unique class, and all the sequences in that file will be in the same class.
- --uniprotDB
Possible choices: UniProtKB, A.thaliana, C.elegans, E.coli, H.sapiens, M.musculus, R.norvegicus, SARS-CoV-2
UniProt embedding dataset to download.
- --rep
Possible choices: per_AA, avg
The representation to download.
simulate
Use OpenMM to perform molecular dynamics
trill simulate [-h] [--ligand LIGAND]
[--constraints {None,HBonds,AllBonds,HAngles}] [--rigidWater]
[--forcefield FORCEFIELD]
[--solvent {implicit/hct.xml,amber14/tip3p.xml,amber14/tip3pfb.xml}]
[--step_size STEP_SIZE] [--num_steps NUM_STEPS]
[--reporting_interval REPORTING_INTERVAL]
[--output_traj_dcd OUTPUT_TRAJ_DCD]
[--apply-harmonic-force APPLY_HARMONIC_FORCE]
[--force-constant FORCE_CONSTANT] [--z0 Z0]
[--molecule-atom-indices MOLECULE_ATOM_INDICES]
[--equilibration_steps EQUILIBRATION_STEPS]
[--periodic_box PERIODIC_BOX]
[--nonbonded_method {NoCutoff,CutoffNonPeriodic,CutoffPeriodic,Ewald,PME,LJPME}]
[--just_relax] [--reporter_interval REPORTER_INTERVAL]
receptor
Positional Arguments
- receptor
Receptor of interest to be simulated. Must be either pdb file or a .txt file with the absolute path for each pdb, separated by a new-line.
Named Arguments
- --ligand
Ligand of interest to be simulated with input receptor
- --constraints
Possible choices: None, HBonds, AllBonds, HAngles
Specifies which bonds and angles should be implemented with constraints. Allowed values are None, HBonds, AllBonds, or HAngles.
Default: “None”
- --rigidWater
If true, water molecules will be fully rigid regardless of the value passed for the constraints argument.
- --forcefield
Force field to use. Default is amber14-all.xml
Default: “amber14-all.xml”
- --solvent
Possible choices: implicit/hct.xml, amber14/tip3p.xml, amber14/tip3pfb.xml
Solvent model to use. Options are ‘implicit/hct.xml’, ‘amber14/tip3p.xml’, or ‘amber14/tip3pfb.xml’. The default is ‘implicit/hct.xml’.
Default: “implicit/hct.xml”
- --step_size
Step size in femtoseconds. Default is 2
Default: 2
- --num_steps
Number of simulation steps
Default: 5000
- --reporting_interval
Reporting interval for simulation
Default: 1000
- --output_traj_dcd
Output trajectory DCD file
Default: “trajectory.dcd”
- --apply-harmonic-force
Whether to apply a harmonic force to pull the molecule.
Default: False
- --force-constant
Force constant for the harmonic force in kJ/mol/nm^2.
- --z0
The z-coordinate to pull towards in nm.
- --molecule-atom-indices
Comma-separated list of atom indices to which the harmonic force will be applied.
Default: “0,1,2”
- --equilibration_steps
Steps you want to take for NVT and NPT equilibration. Each step is 0.002 picoseconds
Default: 300
- --periodic_box
Give, in nm, one of the dimensions to build the periodic boundary.
Default: 10
- --nonbonded_method
Possible choices: NoCutoff, CutoffNonPeriodic, CutoffPeriodic, Ewald, PME, LJPME
Specify the method for handling nonbonded interactions. Find more info in 3.6.5 of the OpenMM user guide.
Default: “CutoffPeriodic”
- --just_relax
Just relaxes the input structure(s) and outputs the fixed and relaxed structure(s). The forcefield that is used is amber14.
Default: False
- --reporter_interval
Set interval to save PDB and energy snapshot. Note that the higher the number, the bigger the output files will be and the slower the simulation. Default is 1000
Default: 1000
finetune
Finetune protein language models
trill finetune [-h] [--epochs EPOCHS] [--save_on_epoch] [--lr LR]
[--batch_size BATCH_SIZE] [--mask_fraction MASK_FRACTION]
[--pre_masked_fasta PRE_MASKED_FASTA] [--strategy STRATEGY]
[--ctrl_tag CTRL_TAG] [--finetuned FINETUNED] [--eval EVAL]
[--grad_accum_steps GRAD_ACCUM_STEPS]
[--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt}]
[--warmup_steps WARMUP_STEPS]
{esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtGPT2,progen2-small,progen2-medium,progen2-large,progen2-oas,progen2-BFD90,progen2-xlarge,ZymCTRL}
query
Positional Arguments
- model
Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtGPT2, progen2-small, progen2-medium, progen2-large, progen2-oas, progen2-BFD90, progen2-xlarge, ZymCTRL
Choose the protein language model to finetune. Note that ESM2 is trained with the MLM objective, while ProtGPT2/ZymCTRL/ProGen2 are trained with the CLM objective. ZymCTRL must be finetuned with –ctrl_tag specifying a Enzymatic Commission number.
- query
Input fasta file. For ProGen2, you can provide a .csv file where the first column are absolute paths to fasta files and the second column is the control tag related to that fasta file on the same row.
Named Arguments
- --epochs
Number of epochs for fine-tuning. Default is 10
Default: 10
- --save_on_epoch
Saves a checkpoint on every successful epoch completed. WARNING, this could lead to rapid storage consumption
Default: False
- --lr
Learning rate for optimizer. Default is 0.0001
Default: 0.0001
- --batch_size
Change batch-size number for fine-tuning. Default is 1
Default: 1
- --mask_fraction
ESM: Change fraction of amino acids masked for MLM training. Default is 0.15
Default: 0.15
- --pre_masked_fasta
ESM: Use this flag to specify that your input fasta will be pre-masked and does not need masking performed by TRILL. The sequences will still be randomly shuffled.
Default: False
- --strategy
Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html. For ProGen2 only, you can select either deepspeed_stage_1, deepspeed_stage_2 deepspeed_stage_2_offload, deepspeed_stage_3 and deepspeed_stage_3_offload.
- --ctrl_tag
ZymCTRL: Choose an Enzymatic Commision (EC) control tag for finetuning ZymCTRL. Note that the tag must match all of the enzymes in the query fasta file. You can find all ECs here https://www.brenda-enzymes.org/index.php. You can also provide a control tag for ProGen2, which can be any arbitrary string specifying a ‘class’ of proteins.
- --finetuned
Input path to your previously finetuned model to continue finetuning
Default: False
- --eval
ProGen2: You can choose to withold a random proportion of the input data for evaluation to check for overfitting. Input a float, like 0.25, which would hold-out 25% of the data from finetuning for evaluation after every epoch.
Default: 0
- --grad_accum_steps
ProGen2: You can choose to change the number of steps to accumulate gradients for before performing a backwards pass, will help with GPU vRAM usage.
Default: 1
- --scheduler
Possible choices: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, inverse_sqrt
ProGen2: Choose the learning rate scheduler to use during training, default is constant.
Default: “constant”
- --warmup_steps
ProGen2: Number of steps for a warmup ramping up to the set learning rate, default is 0.
Default: 0
fold
Predict 3D protein structures using ESMFold or obtain 3Di structure for use with Foldseek to perform remote homology detection
trill fold [-h] [--strategy STRATEGY] [--batch_size BATCH_SIZE]
{ESMFold,ProstT5} query
Positional Arguments
- model
Possible choices: ESMFold, ProstT5
Choose your desired model.
- query
Input fasta file
Named Arguments
- --strategy
ESMFold: Choose a specific strategy if you are running out of CUDA memory. You can also pass either 64, or 32 for model.trunk.set_chunk_size(x)
- --batch_size
ESMFold: Change batch-size number for folding proteins. Default is 1
Default: 1
dock
Perform molecular docking with proteins and ligands. Note that you should relax your protein receptor with Simulate or another method before docking.
trill dock [-h] [--save_visualisation]
[--samples_per_complex SAMPLES_PER_COMPLEX] [--no_final_step_noise]
[--inference_steps INFERENCE_STEPS] [--actual_steps ACTUAL_STEPS]
[--min_radius MIN_RADIUS] [--max_radius MAX_RADIUS]
[--min_alpha_spheres MIN_ALPHA_SPHERES]
[--exhaustiveness EXHAUSTIVENESS] [--blind] [--anm]
[--swarms SWARMS] [--glowworms GLOWWORMS] [--sim_steps SIM_STEPS]
[--restraints RESTRAINTS]
{DiffDock,DiffDock-L,Vina,Smina,LightDock,GeoDock} protein
[ligand [ligand ...]]
Positional Arguments
- algorithm
Possible choices: DiffDock, DiffDock-L, Vina, Smina, LightDock, GeoDock
LightDock and GeoDock are only able to dock proteins-proteins currently. Vina, Smina and DiffDock allow for docking small molecules to proteins.
- protein
Protein of interest to be docked with ligand
- ligand
Ligand to dock protein with. Note that with Autodock Vina, you can dock multiple ligands at one time. Simply provide them one after another before any other optional TRILL arguments are added. Also, if a .txt file is provided with each line providing the absolute path to different ligands, TRILL will dock each ligand one at a time.
Named Arguments
- --save_visualisation
DiffDock: Save a pdb file with all of the steps of the reverse diffusion.
Default: False
- --samples_per_complex
DiffDock: Number of samples to generate.
Default: 10
- --no_final_step_noise
DiffDock: Use no noise in the final step of the reverse diffusion
Default: True
- --inference_steps
DiffDock: Number of denoising steps
Default: 20
- --actual_steps
DiffDock: Number of denoising steps that are actually performed
Default: 0
- --min_radius
Smina/Vina + Fpocket: Minimum radius of alpha spheres in a pocket. Default is 3Å.
Default: 3.0
- --max_radius
Smina/Vina + Fpocket: Maximum radius of alpha spheres in a pocket. Default is 6Å.
Default: 6.0
- --min_alpha_spheres
Smina/Vina + Fpocket: Minimum number of alpha spheres a pocket must contain to be considered. Default is 35.
Default: 35
- --exhaustiveness
Smina/Vina: Change computational effort.
Default: 8
- --blind
Smina/Vina: Perform blind docking and skip binding pocket prediction with fpocket
Default: False
- --anm
LightDock: If selected, backbone flexibility is modeled using Anisotropic Network Model (via ProDy)
Default: False
- --swarms
LightDock: The number of swarms of the simulations, default is 25
Default: 25
- --glowworms
LightDock: The number of glowworms per swarm, default is 200
Default: 200
- --sim_steps
LightDock: The number of steps of the simulation. Default is 100
Default: 100
- --restraints
LightDock: If restraints_file is provided, residue restraints will be considered during the setup and the simulation
lang_gen
Generate proteins using large language models
trill lang_gen [-h] [--finetuned FINETUNED] [--esm2_arch ESM2_ARCH]
[--temp TEMP] [--ctrl_tag CTRL_TAG] [--batch_size BATCH_SIZE]
[--seed_seq SEED_SEQ] [--max_length MAX_LENGTH]
[--do_sample DO_SAMPLE] [--top_k TOP_K]
[--repetition_penalty REPETITION_PENALTY]
[--num_return_sequences NUM_RETURN_SEQUENCES] [--random_fill]
[--num_positions NUM_POSITIONS]
{ESM2,ProtGPT2,progen2-small,progen2-medium,progen2-large,progen2-oas,progen2-BFD90,progen2-xlarge,ZymCTRL}
Positional Arguments
- model
Possible choices: ESM2, ProtGPT2, progen2-small, progen2-medium, progen2-large, progen2-oas, progen2-BFD90, progen2-xlarge, ZymCTRL
Choose desired language model
Named Arguments
- --finetuned
Input path to your own finetuned model
Default: False
- --esm2_arch
ESM2_Gibbs: Choose which ESM2 architecture your finetuned model is
Default: “esm2_t12_35M_UR50D”
- --temp
Choose sampling temperature.
Default: “1”
- --ctrl_tag
ZymCTRL: Choose an Enzymatic Commision (EC) control tag for conditional protein generation based on the tag. You can find all ECs here https://www.brenda-enzymes.org/index.php
- --batch_size
Change batch-size number to modulate how many proteins are generated at a time. Default is 1
Default: 1
- --seed_seq
Sequence to seed generation, the default is M.
Default: “M”
- --max_length
Max length of proteins generated, default is 100
Default: 100
- --do_sample
ProtGPT2/ZymCTRL: Whether or not to use sampling for generation; use greedy decoding otherwise
Default: True
- --top_k
The number of highest probability vocabulary tokens to keep for top-k-filtering
Default: 950
- --repetition_penalty
ProtGPT2/ZymCTRL: The parameter for repetition penalty, the default is 1.2. 1.0 means no penalty
Default: 1.2
- --num_return_sequences
Number of sequences to generate. Default is 1
Default: 1
- --random_fill
ESM2_Gibbs: Randomly select positions to fill each iteration for Gibbs sampling with ESM2. If not called then fill the positions in order
Default: True
- --num_positions
ESM2_Gibbs: Generate new AAs for this many positions each iteration for Gibbs sampling with ESM2. If 0, then generate for all target positions each round.
Default: 0
inv_fold_gen
Generate proteins using inverse folding
trill inv_fold_gen [-h] [--temp TEMP]
[--num_return_sequences NUM_RETURN_SEQUENCES]
[--max_length MAX_LENGTH] [--top_p TOP_P]
[--repetition_penalty REPETITION_PENALTY] [--dont_sample]
[--lig_mpnn_model LIG_MPNN_MODEL]
[--lig_mpnn_noise LIG_MPNN_NOISE] [--omit_AAs OMIT_AAS]
[--fasta_seq_separation FASTA_SEQ_SEPARATION]
[--verbose VERBOSE] [--pdb_path_multi PDB_PATH_MULTI]
[--fixed_residues FIXED_RESIDUES]
[--fixed_residues_multi FIXED_RESIDUES_MULTI]
[--redesigned_residues REDESIGNED_RESIDUES]
[--redesigned_residues_multi REDESIGNED_RESIDUES_MULTI]
[--bias_AA BIAS_AA]
[--bias_AA_per_residue BIAS_AA_PER_RESIDUE]
[--bias_AA_per_residue_multi BIAS_AA_PER_RESIDUE_MULTI]
[--omit_AA_per_residue OMIT_AA_PER_RESIDUE]
[--omit_AA_per_residue_multi OMIT_AA_PER_RESIDUE_MULTI]
[--symmetry_residues SYMMETRY_RESIDUES]
[--symmetry_weights SYMMETRY_WEIGHTS]
[--homo_oligomer HOMO_OLIGOMER]
[--zero_indexed ZERO_INDEXED] [--batch_size BATCH_SIZE]
[--number_of_batches NUMBER_OF_BATCHES]
[--save_stats SAVE_STATS]
[--ligand_mpnn_use_atom_context LIGAND_MPNN_USE_ATOM_CONTEXT]
[--ligand_mpnn_cutoff_for_score LIGAND_MPNN_CUTOFF_FOR_SCORE]
[--ligand_mpnn_use_side_chain_context LIGAND_MPNN_USE_SIDE_CHAIN_CONTEXT]
[--chains_to_design CHAINS_TO_DESIGN]
[--parse_these_chains_only PARSE_THESE_CHAINS_ONLY]
[--transmembrane_buried TRANSMEMBRANE_BURIED]
[--transmembrane_interface TRANSMEMBRANE_INTERFACE]
[--global_transmembrane_label GLOBAL_TRANSMEMBRANE_LABEL]
[--parse_atoms_with_zero_occupancy PARSE_ATOMS_WITH_ZERO_OCCUPANCY]
[--pack_side_chains PACK_SIDE_CHAINS]
[--number_of_packs_per_design NUMBER_OF_PACKS_PER_DESIGN]
[--sc_num_denoising_steps SC_NUM_DENOISING_STEPS]
[--sc_num_samples SC_NUM_SAMPLES]
[--repack_everything REPACK_EVERYTHING]
[--force_hetatm FORCE_HETATM]
[--packed_suffix PACKED_SUFFIX]
[--pack_with_ligand_context PACK_WITH_LIGAND_CONTEXT]
{ESM-IF1,ProstT5,LigandMPNN} query
Positional Arguments
- model
Possible choices: ESM-IF1, ProstT5, LigandMPNN
Select which model to generate proteins using inverse folding.
- query
Input pdb file for inverse folding
Named Arguments
- --temp
Choose sampling temperature.
Default: 1
- --num_return_sequences
Choose number of proteins to generate.
Default: 1
- --max_length
Max length of proteins generated, default is 500 AAs
Default: 500
- --top_p
ProstT5: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default is 1
Default: 1
- --repetition_penalty
ProstT5: The parameter for repetition penalty. 1.0 means no penalty, the default is 1.2
Default: 1.2
- --dont_sample
ProstT5: By default, the model will sample to generate the protein. With this flag, you can enable greedy decoding, where the most probable tokens will be returned.
Default: True
- --lig_mpnn_model
LigandMPNN: ProteinMPNN, Soluble, Global_Membrane, Local_Membrane, Side-Chain_Packing
Default: “”
- --lig_mpnn_noise
LigandMPNN Noise levels: 002, 005, 010, 020, 030; 010 = .10A noise. Note that 002 is only available for Soluble and Side-Chain_packing models
Default: “010”
- --omit_AAs
LigandMPNN: Specify which amino acids should be omitted in the generated sequence, e.g. “AC” would omit alanine and cysteine.
Default: X
- --fasta_seq_separation
LigandMPNN: Symbol to use between sequences from different chains
Default: “:”
- --verbose
LigandMPNN: Print stuff
Default: 1
- --pdb_path_multi
LigandMPNN: Path to json listing PDB paths. {‘/path/to/pdb’: ‘’} - only keys will be used.
Default: “”
- --fixed_residues
LigandMPNN: Provide fixed residues, A12 A13 A14 B2 B25
Default: “”
- --fixed_residues_multi
LigandMPNN: Path to json mapping of fixed residues for each pdb i.e., {‘/path/to/pdb’: ‘A12 A13 A14 B2 B25’}
Default: “”
- --redesigned_residues
LigandMPNN: Provide to be redesigned residues, everything else will be fixed, A12 A13 A14 B2 B25
Default: “”
- --redesigned_residues_multi
LigandMPNN: Path to json mapping of redesigned residues for each pdb i.e., {‘/path/to/pdb’: ‘A12 A13 A14 B2 B25’}
Default: “”
- --bias_AA
LigandMPNN: Bias generation of amino acids, e.g. ‘A:-1.024,P:2.34,C:-12.34’
Default: “”
- --bias_AA_per_residue
LigandMPNN: Path to json mapping of bias {‘A12’: {‘G’: -0.3, ‘C’: -2.0, ‘H’: 0.8}, ‘A13’: {‘G’: -1.3}}
Default: “”
- --bias_AA_per_residue_multi
LigandMPNN: Path to json mapping of bias {‘pdb_path’: {‘A12’: {‘G’: -0.3, ‘C’: -2.0, ‘H’: 0.8}, ‘A13’: {‘G’: -1.3}}}
Default: “”
- --omit_AA_per_residue
LigandMPNN: Path to json mapping of bias {‘A12’: ‘APQ’, ‘A13’: ‘QST’}
Default: “”
- --omit_AA_per_residue_multi
LigandMPNN: Path to json mapping of bias {‘pdb_path’: {‘A12’: ‘QSPC’, ‘A13’: ‘AGE’}}
Default: “”
- --symmetry_residues
LigandMPNN: Add list of lists for which residues need to be symmetric, e.g. ‘A12,A13,A14|C2,C3|A5,B6’
Default: “”
- --symmetry_weights
LigandMPNN: Add weights that match symmetry_residues, e.g. ‘1.01,1.0,1.0|-1.0,2.0|2.0,2.3’
Default: “”
- --homo_oligomer
LigandMPNN: Setting this to 1 will automatically set –symmetry_residues and –symmetry_weights to do homooligomer design with equal weighting.
Default: 0
- --zero_indexed
LigandMPNN: 1 - to start output PDB numbering with 0
Default: 0
- --batch_size
LigandMPNN: Number of sequence to generate per one pass.
Default: 1
- --number_of_batches
LigandMPNN: Number of times to design sequence using a chosen batch size.
Default: 1
- --save_stats
LigandMPNN: Save output statistics
Default: 0
- --ligand_mpnn_use_atom_context
LigandMPNN: 1 - use atom context, 0 - do not use atom context.
Default: 1
- --ligand_mpnn_cutoff_for_score
LigandMPNN: Cutoff in angstroms between protein and context atoms to select residues for reporting score.
Default: 8.0
- --ligand_mpnn_use_side_chain_context
LigandMPNN: Flag to use side chain atoms as ligand context for the fixed residues
Default: 0
- --chains_to_design
LigandMPNN: Specify which chains to redesign, all others will be kept fixed.
- --parse_these_chains_only
LigandMPNN: Provide chains letters for parsing backbones, ‘ABCF’
Default: “”
- --transmembrane_buried
LigandMPNN: Provide buried residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25
Default: “”
- --transmembrane_interface
LigandMPNN: Provide interface residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25
Default: “”
- --global_transmembrane_label
LigandMPNN: Provide global label for global_label_membrane_mpnn model. 1 - transmembrane, 0 - soluble
Default: 0
- --parse_atoms_with_zero_occupancy
LigandMPNN: To parse atoms with zero occupancy in the PDB input files. 0 - do not parse, 1 - parse atoms with zero occupancy
Default: 0
- --pack_side_chains
LigandMPNN: 1 - to run side chain packer, 0 - do not run it
Default: 0
- --number_of_packs_per_design
LigandMPNN: Number of independent side chain packing samples to return per design
Default: 4
- --sc_num_denoising_steps
LigandMPNN: Number of denoising/recycling steps to make for side chain packing
Default: 3
- --sc_num_samples
LigandMPNN: Number of samples to draw from a mixture distribution and then take a sample with the highest likelihood.
Default: 16
- --repack_everything
LigandMPNN: 1 - repacks side chains of all residues including the fixed ones; 0 - keeps the side chains fixed for fixed residues
Default: 0
- --force_hetatm
LigandMPNN: To force ligand atoms to be written as HETATM to PDB file after packing.
Default: 0
- --packed_suffix
LigandMPNN: Suffix for packed PDB paths
Default: “_packed”
- --pack_with_ligand_context
LigandMPNN: 1-pack side chains using ligand context, 0 - do not use it.
Default: 1
classify
Classify proteins using either pretrained classifiers or train/test your own.
trill classify [-h] [--key KEY] [--save_emb]
[--emb_model {Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}]
[--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
[--preComputed_Embs PRECOMPUTED_EMBS]
[--batch_size_emb BATCH_SIZE_EMB]
[--batch_size_mlp BATCH_SIZE_MLP] [--xg_gamma XG_GAMMA]
[--lr LR] [--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
[--bagging_freq BAGGING_FREQ] [--bagging_frac BAGGING_FRAC]
[--feature_frac FEATURE_FRAC] [--xg_reg_alpha XG_REG_ALPHA]
[--xg_reg_lambda XG_REG_LAMBDA]
[--if_contamination IF_CONTAMINATION]
[--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
[--sweep_iters SWEEP_ITERS]
[--f1_avg_method {macro,weighted,micro,None}] [--epochs EPOCHS]
[--hidden_layers HIDDEN_LAYERS] [--dropout DROPOUT] [--db DB]
{TemStaPro,EpHod,ECPICK,PSALM,MLP,XGBoost,LightGBM,iForest,ESM2+MLP,3Di-Search}
query
Positional Arguments
- classifier
Possible choices: TemStaPro, EpHod, ECPICK, PSALM, MLP, XGBoost, LightGBM, iForest, ESM2+MLP, 3Di-Search
Predict thermostability/optimal enzymatic pH using TemStaPro/EpHod or choose custom to train/use your own XGBoost, Multilayer perceptron, LightGBM or Isolation Forest classifier. ESM2+MLP allows you to train an ESM2 model with a classification head end-to-end.
- query
Fasta file of sequences to score
Named Arguments
- --key
Input a CSV, with your class mappings for your embeddings where the first column is the label and the second column is the class.
- --save_emb
Save csv of embeddings
Default: False
- --emb_model
Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt
Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M
Default: “esm2_t12_35M”
- --train_split
Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating
- --preTrained
Enter the path to your pre-trained classifier that you’ve trained with TRILL. This will be a .json file.
- --preComputed_Embs
Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.
Default: False
- --batch_size_emb
EpHod: Sets batch_size for embedding with ESM1v.
Default: 1
- --batch_size_mlp
MLP: Sets batch_size for training/evaluating
Default: 1
- --xg_gamma
XGBoost: sets gamma for XGBoost, which is a hyperparameter that sets ‘Minimum loss reduction required to make a further partition on a leaf node of the tree.’
Default: 0.4
- --lr
XGBoost/LightGBM/ESM2+MLP/MLP: Sets the learning rate. Default is 0.0001 for ESM2+MLP/MLP, 0.2 for XGBoost and LightGBM
Default: 0.2
- --max_depth
XGBoost/LightGBM: Sets the maximum tree depth
Default: 8
- --num_leaves
LightGBM: Sets the max number of leaves in one tree. Default is 31
Default: 31
- --bagging_freq
LightGBM: Int that allows for bagging, which enables random sampling of training data of traingin data. For example, if it is set to 3, LightGBM will randomly sample the –bagging_frac of the data every 3rd iteration. Default is 0
Default: 0
- --bagging_frac
LightGBM: Sets fraction of training data to be used when bagging. Must be 0 < –bagging_frac <= 1. Default is 1
Default: 1
- --feature_frac
LightGBM: Sets fraction of training features to be randomly sampled for use in training. Must be 0 < –feature_frac <= 1. Default is 1
Default: 1
- --xg_reg_alpha
XGBoost: L1 regularization term on weights
Default: 0.8
- --xg_reg_lambda
XGBoost: L2 regularization term on weights
Default: 0.1
- --if_contamination
iForest: The amount of outliers in the data. Default is automatically determined, but you can set it between (0 , 0.5])
Default: “auto”
- --n_estimators
XGBoost/LightGBM: Number of boosting rounds
Default: 115
- --sweep
XGBoost/LightGBM: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.
Default: False
- --sweep_cv
XGBoost/LightGBM: Change the number of folds used for cross-validation.
Default: 3
- --sweep_iters
XGBoost/LightGBM: Change the number of optimization iterations. Default is 10.
Default: 10
- --f1_avg_method
Possible choices: macro, weighted, micro, None
XGBoost/LightGBM: Change the scoring method used for calculated F1. Default is with no averaging.
- --epochs
ESM2+MLP/MLP: Set number of epochs to train ESM2+MLP classifier.
Default: 3
- --hidden_layers
MLP: Set number of hidden layers. Default is [128,64,32]
Default: [128, 64, 32]
- --dropout
MLP: Set dropout rate. Default is 0.3
Default: 0.3
- --db
3Di-Search: Specify the path of the fasta file for your database that you want to query against.
diff_gen
Generate proteins using RFDiffusion
trill diff_gen [-h] [--contigs CONTIGS]
[--RFDiffusion_Override RFDIFFUSION_OVERRIDE]
[--num_return_sequences NUM_RETURN_SEQUENCES]
[--Inpaint INPAINT] [--query QUERY] [--partial_T PARTIAL_T]
[--partial_diff_fix PARTIAL_DIFF_FIX] [--hotspots HOTSPOTS]
Named Arguments
- --contigs
Generate proteins between these sizes in AAs for RFDiffusion. For example, –contig 100-200, will result in proteins in this range
- --RFDiffusion_Override
Change RFDiffusion model. For example, –RFDiffusion_Override ActiveSite will use ActiveSite_ckpt.pt for holding small motifs in place.
Default: False
- --num_return_sequences
Number of sequences for RFDiffusion to generate. Default is 5
Default: 5
- --Inpaint
Residues to inpaint.
- --query
Input pdb file for motif scaffolding, partial diffusion etc.
- --partial_T
Adjust partial diffusion sampling value.
- --partial_diff_fix
Pass the residues that you want to keep fixed for your input pdb during partial diffusion. Note that the residues should be 0-indexed.
- --hotspots
Define residues that binder must interact with. For example, –hotspots A30,A33,A34 , where A is the chain and the numbers are the residue indices.
embed
Embed sequences of interest
trill embed [-h] [--batch_size BATCH_SIZE] [--finetuned FINETUNED] [--per_AA]
[--avg]
{Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}
query
Positional Arguments
- model
Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt
Choose language model to embed query sequences. Note for SaProt you need to protein structures as input For RiNALMo, RNA-FM and mRNA-FM (must be multiples of 3 for mRNA-FM) the input is RNA while CaLM takes as input DNA sequences.
- query
Input protein fasta file. For SaProt only, you can provide a directory where every .pdb file will be embedded or a .txt file where each line is an absolute path to a pdb file.
Named Arguments
- --batch_size
Change batch-size number for embedding proteins. Default is 1, but with more RAM, you can do more
Default: 1
- --finetuned
Input path to your own finetuned ESM model
Default: False
- --per_AA
Add this flag to return the per amino acid / nucleic acid representations.
Default: False
- --avg
Add this flag to return the average, whole sequence representation.
Default: False
score
Use ESM-1v or ESM2 to score protein sequences or ProteinMPNN to score protein structures
trill score [-h] [--mpnn_model MPNN_MODEL] [--lig_mpnn_noise LIG_MPNN_NOISE]
[--global_transmembrane_label GLOBAL_TRANSMEMBRANE_LABEL]
[--transmembrane_buried [TRANSMEMBRANE_BURIED [TRANSMEMBRANE_BURIED ...]]]
[--transmembrane_interface [TRANSMEMBRANE_INTERFACE [TRANSMEMBRANE_INTERFACE ...]]]
[--batch_transmembrane_csv BATCH_TRANSMEMBRANE_CSV]
[--ligand_mpnn_cutoff_for_score LIGAND_MPNN_CUTOFF_FOR_SCORE]
{ESM2_150M,ESM1v,ESM2_650M,ProteinMPNN} query
Positional Arguments
- scorer
Possible choices: ESM2_150M, ESM1v, ESM2_650M, ProteinMPNN
Score protein sequences with ESM-1v, ESM2-650M or protein structures with ProteinMPNN
- query
Path to protein PDB file to score. Can also provide a .txt file with absolute paths to multiple PDBs
Named Arguments
- --mpnn_model
ProteinMPNN: ProteinMPNN, LigandMPNN, Local_Membrane, Global_Membrane and Soluble. Default is ProteinMPNN
Default: “ProteinMPNN”
- --lig_mpnn_noise
ProteinMPNN Noise levels: 002, 005, 010, 020, 030; 010 = .10A noise. Note that 002 is only available for Soluble and Side-Chain_packing models
Default: “010”
- --global_transmembrane_label
Provide global label for global_label_membrane_mpnn model. 1 - transmembrane, 0 - soluble
Default: 0
- --transmembrane_buried
ProteinMPNN: Provide buried residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25 If inputting a .txt file with absolute paths to .pdb’s, make sure that all of the proteins have the same residue labels, else you can provide a .csv file here where the first column is ‘Label’ and the second is ‘Residues’.
Default: “”
- --transmembrane_interface
ProteinMPNN: Provide interface residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25. If inputting a .txt file with absolute paths to .pdb’s, make sure that all of the proteins have the same residue labels, else you can provide a .csv file here where the first column is ‘Label’ and the second is ‘Residues’.
Default: “”
- --batch_transmembrane_csv
ProteinMPNN: You can provide a .csv file to specify mutliple transmembrane buried/interface residues. The first column should be called ‘Label’, the second ‘transmembrane_buried’ and the third ‘transmembrane_interface’.
Default: “”
- --ligand_mpnn_cutoff_for_score
ProteinMPNN: Cutoff in angstroms between protein and context atoms to select residues for reporting score.
Default: 8.0