usage: trill [-h] [--nodes NODES] [--logger LOGGER] [--profiler]
             [--RNG_seed RNG_SEED] [--outdir OUTDIR] [--n_workers N_WORKERS]
             name GPUs
             {visualize,regress,utils,simulate,finetune,fold,dock,lang_gen,inv_fold_gen,classify,diff_gen,embed,score}
             ...

Positional Arguments

name

Name of run

GPUs

Input total number of GPUs per node

Default: 1

command

Possible choices: visualize, regress, utils, simulate, finetune, fold, dock, lang_gen, inv_fold_gen, classify, diff_gen, embed, score

Named Arguments

--nodes

Input total number of nodes. Default is 1

Default: 1

--logger

Enable Tensorboard logger. Default is None

Default: False

--profiler

Utilize PyTorchProfiler

Default: False

--RNG_seed

Input RNG seed. Default is 123

Default: 123

--outdir

Input full path to directory where you want the output from TRILL

Default: “.”

--n_workers

Change number of CPU cores/’workers’ TRILL uses

Default: 1

Sub-commands

visualize

Reduce dimensionality of embeddings to 2D

trill visualize [-h] [--method {PCA,UMAP,tSNE}] [--key KEY] embeddings

Positional Arguments

embeddings

Embeddings to be visualized

Named Arguments

--method

Possible choices: PCA, UMAP, tSNE

Method for reducing dimensions of embeddings. Default is PCA

Default: “PCA”

--key

Input a CSV, with your group mappings for your embeddings where the first column is the label and the second column is the group to be colored.

Default: False

regress

Train you own regressors on input protein sequences and some sort of score.

trill regress [-h] [--key KEY] [--save_emb]
              [--emb_model {Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}]
              [--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
              [--preComputed_Embs PRECOMPUTED_EMBS] [--batch_size BATCH_SIZE]
              [--lr LR] [--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
              [--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
              [--sweep_iters SWEEP_ITERS]
              {Linear,LightGBM} query

Positional Arguments

regressor

Possible choices: Linear, LightGBM

Train a custom regression model

query

Fasta file of protein sequences

Named Arguments

--key

Input a CSV, with your mappings for your embeddings where the first column is the label and the second column is the value.

--save_emb

Save csv of embeddings

Default: False

--emb_model

Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt

Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M

Default: “esm2_t12_35M”

--train_split

Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating

--preTrained

Enter the path to your pre-trained XGBoost binary classifier that you’ve trained with TRILL. This will be a .json file.

--preComputed_Embs

Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.

Default: False

--batch_size

Sets batch_size for embedding.

Default: 1

--lr

LightGBM: Sets the learning rate. Default is 0.2

Default: 0.2

--max_depth

LightGBM: Sets the maximum tree depth. Default is -1, no max tree depth.

Default: -1

--num_leaves

LightGBM: Sets the max number of leaves in one tree. Default is 31

Default: 31

--n_estimators

LightGBM: Number of boosting rounds

Default: 115

--sweep

LightGBM: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.

Default: False

--sweep_cv

LightGBM: Change the number of folds used for cross-validation.

Default: 3

--sweep_iters

LightGBM: Change the number of optimization iterations. Default is 10.

Default: 10

utils

Misc utilities

trill utils [-h] [--dir DIR] [--fasta_paths_txt FASTA_PATHS_TXT]
            [--uniprotDB {UniProtKB,A.thaliana,C.elegans,E.coli,H.sapiens,M.musculus,R.norvegicus,SARS-CoV-2}]
            [--rep {per_AA,avg}]
            {prepare_class_key,fetch_embeddings}

Positional Arguments

tool

Possible choices: prepare_class_key, fetch_embeddings

prepare_class_key: Pepare a csv for use with the classify command. Takes a directory or text file with list of paths for fasta files. Each file will be a unique class, so if your directory contains 5 fasta files, there will be 5 classes in the output key csv.

Named Arguments

--dir

Directory to be used for creating a class key csv for classification.

--fasta_paths_txt

Text file with absolute paths of fasta files to be used for creating the class key. Each unique path will be treated as a unique class, and all the sequences in that file will be in the same class.

--uniprotDB

Possible choices: UniProtKB, A.thaliana, C.elegans, E.coli, H.sapiens, M.musculus, R.norvegicus, SARS-CoV-2

UniProt embedding dataset to download.

--rep

Possible choices: per_AA, avg

The representation to download.

simulate

Use OpenMM to perform molecular dynamics

trill simulate [-h] [--ligand LIGAND]
               [--constraints {None,HBonds,AllBonds,HAngles}] [--rigidWater]
               [--forcefield FORCEFIELD]
               [--solvent {implicit/hct.xml,amber14/tip3p.xml,amber14/tip3pfb.xml}]
               [--step_size STEP_SIZE] [--num_steps NUM_STEPS]
               [--reporting_interval REPORTING_INTERVAL]
               [--output_traj_dcd OUTPUT_TRAJ_DCD]
               [--apply-harmonic-force APPLY_HARMONIC_FORCE]
               [--force-constant FORCE_CONSTANT] [--z0 Z0]
               [--molecule-atom-indices MOLECULE_ATOM_INDICES]
               [--equilibration_steps EQUILIBRATION_STEPS]
               [--periodic_box PERIODIC_BOX]
               [--nonbonded_method {NoCutoff,CutoffNonPeriodic,CutoffPeriodic,Ewald,PME,LJPME}]
               [--just_relax] [--reporter_interval REPORTER_INTERVAL]
               receptor

Positional Arguments

receptor

Receptor of interest to be simulated. Must be either pdb file or a .txt file with the absolute path for each pdb, separated by a new-line.

Named Arguments

--ligand

Ligand of interest to be simulated with input receptor

--constraints

Possible choices: None, HBonds, AllBonds, HAngles

Specifies which bonds and angles should be implemented with constraints. Allowed values are None, HBonds, AllBonds, or HAngles.

Default: “None”

--rigidWater

If true, water molecules will be fully rigid regardless of the value passed for the constraints argument.

--forcefield

Force field to use. Default is amber14-all.xml

Default: “amber14-all.xml”

--solvent

Possible choices: implicit/hct.xml, amber14/tip3p.xml, amber14/tip3pfb.xml

Solvent model to use. Options are ‘implicit/hct.xml’, ‘amber14/tip3p.xml’, or ‘amber14/tip3pfb.xml’. The default is ‘implicit/hct.xml’.

Default: “implicit/hct.xml”

--step_size

Step size in femtoseconds. Default is 2

Default: 2

--num_steps

Number of simulation steps

Default: 5000

--reporting_interval

Reporting interval for simulation

Default: 1000

--output_traj_dcd

Output trajectory DCD file

Default: “trajectory.dcd”

--apply-harmonic-force

Whether to apply a harmonic force to pull the molecule.

Default: False

--force-constant

Force constant for the harmonic force in kJ/mol/nm^2.

--z0

The z-coordinate to pull towards in nm.

--molecule-atom-indices

Comma-separated list of atom indices to which the harmonic force will be applied.

Default: “0,1,2”

--equilibration_steps

Steps you want to take for NVT and NPT equilibration. Each step is 0.002 picoseconds

Default: 300

--periodic_box

Give, in nm, one of the dimensions to build the periodic boundary.

Default: 10

--nonbonded_method

Possible choices: NoCutoff, CutoffNonPeriodic, CutoffPeriodic, Ewald, PME, LJPME

Specify the method for handling nonbonded interactions. Find more info in 3.6.5 of the OpenMM user guide.

Default: “CutoffPeriodic”

--just_relax

Just relaxes the input structure(s) and outputs the fixed and relaxed structure(s). The forcefield that is used is amber14.

Default: False

--reporter_interval

Set interval to save PDB and energy snapshot. Note that the higher the number, the bigger the output files will be and the slower the simulation. Default is 1000

Default: 1000

finetune

Finetune protein language models

trill finetune [-h] [--epochs EPOCHS] [--save_on_epoch] [--lr LR]
               [--batch_size BATCH_SIZE] [--mask_fraction MASK_FRACTION]
               [--pre_masked_fasta PRE_MASKED_FASTA] [--strategy STRATEGY]
               [--ctrl_tag CTRL_TAG] [--finetuned FINETUNED] [--eval EVAL]
               [--grad_accum_steps GRAD_ACCUM_STEPS]
               [--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt}]
               [--warmup_steps WARMUP_STEPS]
               {esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtGPT2,progen2-small,progen2-medium,progen2-large,progen2-oas,progen2-BFD90,progen2-xlarge,ZymCTRL}
               query

Positional Arguments

model

Possible choices: esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtGPT2, progen2-small, progen2-medium, progen2-large, progen2-oas, progen2-BFD90, progen2-xlarge, ZymCTRL

Choose the protein language model to finetune. Note that ESM2 is trained with the MLM objective, while ProtGPT2/ZymCTRL/ProGen2 are trained with the CLM objective. ZymCTRL must be finetuned with –ctrl_tag specifying a Enzymatic Commission number.

query

Input fasta file. For ProGen2, you can provide a .csv file where the first column are absolute paths to fasta files and the second column is the control tag related to that fasta file on the same row.

Named Arguments

--epochs

Number of epochs for fine-tuning. Default is 10

Default: 10

--save_on_epoch

Saves a checkpoint on every successful epoch completed. WARNING, this could lead to rapid storage consumption

Default: False

--lr

Learning rate for optimizer. Default is 0.0001

Default: 0.0001

--batch_size

Change batch-size number for fine-tuning. Default is 1

Default: 1

--mask_fraction

ESM: Change fraction of amino acids masked for MLM training. Default is 0.15

Default: 0.15

--pre_masked_fasta

ESM: Use this flag to specify that your input fasta will be pre-masked and does not need masking performed by TRILL. The sequences will still be randomly shuffled.

Default: False

--strategy

Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html. For ProGen2 only, you can select either deepspeed_stage_1, deepspeed_stage_2 deepspeed_stage_2_offload, deepspeed_stage_3 and deepspeed_stage_3_offload.

--ctrl_tag

ZymCTRL: Choose an Enzymatic Commision (EC) control tag for finetuning ZymCTRL. Note that the tag must match all of the enzymes in the query fasta file. You can find all ECs here https://www.brenda-enzymes.org/index.php. You can also provide a control tag for ProGen2, which can be any arbitrary string specifying a ‘class’ of proteins.

--finetuned

Input path to your previously finetuned model to continue finetuning

Default: False

--eval

ProGen2: You can choose to withold a random proportion of the input data for evaluation to check for overfitting. Input a float, like 0.25, which would hold-out 25% of the data from finetuning for evaluation after every epoch.

Default: 0

--grad_accum_steps

ProGen2: You can choose to change the number of steps to accumulate gradients for before performing a backwards pass, will help with GPU vRAM usage.

Default: 1

--scheduler

Possible choices: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, inverse_sqrt

ProGen2: Choose the learning rate scheduler to use during training, default is constant.

Default: “constant”

--warmup_steps

ProGen2: Number of steps for a warmup ramping up to the set learning rate, default is 0.

Default: 0

fold

Predict 3D protein structures using ESMFold or obtain 3Di structure for use with Foldseek to perform remote homology detection

trill fold [-h] [--strategy STRATEGY] [--batch_size BATCH_SIZE]
           {ESMFold,ProstT5} query

Positional Arguments

model

Possible choices: ESMFold, ProstT5

Choose your desired model.

query

Input fasta file

Named Arguments

--strategy

ESMFold: Choose a specific strategy if you are running out of CUDA memory. You can also pass either 64, or 32 for model.trunk.set_chunk_size(x)

--batch_size

ESMFold: Change batch-size number for folding proteins. Default is 1

Default: 1

dock

Perform molecular docking with proteins and ligands. Note that you should relax your protein receptor with Simulate or another method before docking.

trill dock [-h] [--save_visualisation]
           [--samples_per_complex SAMPLES_PER_COMPLEX] [--no_final_step_noise]
           [--inference_steps INFERENCE_STEPS] [--actual_steps ACTUAL_STEPS]
           [--min_radius MIN_RADIUS] [--max_radius MAX_RADIUS]
           [--min_alpha_spheres MIN_ALPHA_SPHERES]
           [--exhaustiveness EXHAUSTIVENESS] [--blind] [--anm]
           [--swarms SWARMS] [--glowworms GLOWWORMS] [--sim_steps SIM_STEPS]
           [--restraints RESTRAINTS]
           {DiffDock,DiffDock-L,Vina,Smina,LightDock,GeoDock} protein
           [ligand [ligand ...]]

Positional Arguments

algorithm

Possible choices: DiffDock, DiffDock-L, Vina, Smina, LightDock, GeoDock

LightDock and GeoDock are only able to dock proteins-proteins currently. Vina, Smina and DiffDock allow for docking small molecules to proteins.

protein

Protein of interest to be docked with ligand

ligand

Ligand to dock protein with. Note that with Autodock Vina, you can dock multiple ligands at one time. Simply provide them one after another before any other optional TRILL arguments are added. Also, if a .txt file is provided with each line providing the absolute path to different ligands, TRILL will dock each ligand one at a time.

Named Arguments

--save_visualisation

DiffDock: Save a pdb file with all of the steps of the reverse diffusion.

Default: False

--samples_per_complex

DiffDock: Number of samples to generate.

Default: 10

--no_final_step_noise

DiffDock: Use no noise in the final step of the reverse diffusion

Default: True

--inference_steps

DiffDock: Number of denoising steps

Default: 20

--actual_steps

DiffDock: Number of denoising steps that are actually performed

Default: 0

--min_radius

Smina/Vina + Fpocket: Minimum radius of alpha spheres in a pocket. Default is 3Å.

Default: 3.0

--max_radius

Smina/Vina + Fpocket: Maximum radius of alpha spheres in a pocket. Default is 6Å.

Default: 6.0

--min_alpha_spheres

Smina/Vina + Fpocket: Minimum number of alpha spheres a pocket must contain to be considered. Default is 35.

Default: 35

--exhaustiveness

Smina/Vina: Change computational effort.

Default: 8

--blind

Smina/Vina: Perform blind docking and skip binding pocket prediction with fpocket

Default: False

--anm

LightDock: If selected, backbone flexibility is modeled using Anisotropic Network Model (via ProDy)

Default: False

--swarms

LightDock: The number of swarms of the simulations, default is 25

Default: 25

--glowworms

LightDock: The number of glowworms per swarm, default is 200

Default: 200

--sim_steps

LightDock: The number of steps of the simulation. Default is 100

Default: 100

--restraints

LightDock: If restraints_file is provided, residue restraints will be considered during the setup and the simulation

lang_gen

Generate proteins using large language models

trill lang_gen [-h] [--finetuned FINETUNED] [--esm2_arch ESM2_ARCH]
               [--temp TEMP] [--ctrl_tag CTRL_TAG] [--batch_size BATCH_SIZE]
               [--seed_seq SEED_SEQ] [--max_length MAX_LENGTH]
               [--do_sample DO_SAMPLE] [--top_k TOP_K]
               [--repetition_penalty REPETITION_PENALTY]
               [--num_return_sequences NUM_RETURN_SEQUENCES] [--random_fill]
               [--num_positions NUM_POSITIONS]
               {ESM2,ProtGPT2,progen2-small,progen2-medium,progen2-large,progen2-oas,progen2-BFD90,progen2-xlarge,ZymCTRL}

Positional Arguments

model

Possible choices: ESM2, ProtGPT2, progen2-small, progen2-medium, progen2-large, progen2-oas, progen2-BFD90, progen2-xlarge, ZymCTRL

Choose desired language model

Named Arguments

--finetuned

Input path to your own finetuned model

Default: False

--esm2_arch

ESM2_Gibbs: Choose which ESM2 architecture your finetuned model is

Default: “esm2_t12_35M_UR50D”

--temp

Choose sampling temperature.

Default: “1”

--ctrl_tag

ZymCTRL: Choose an Enzymatic Commision (EC) control tag for conditional protein generation based on the tag. You can find all ECs here https://www.brenda-enzymes.org/index.php

--batch_size

Change batch-size number to modulate how many proteins are generated at a time. Default is 1

Default: 1

--seed_seq

Sequence to seed generation, the default is M.

Default: “M”

--max_length

Max length of proteins generated, default is 100

Default: 100

--do_sample

ProtGPT2/ZymCTRL: Whether or not to use sampling for generation; use greedy decoding otherwise

Default: True

--top_k

The number of highest probability vocabulary tokens to keep for top-k-filtering

Default: 950

--repetition_penalty

ProtGPT2/ZymCTRL: The parameter for repetition penalty, the default is 1.2. 1.0 means no penalty

Default: 1.2

--num_return_sequences

Number of sequences to generate. Default is 1

Default: 1

--random_fill

ESM2_Gibbs: Randomly select positions to fill each iteration for Gibbs sampling with ESM2. If not called then fill the positions in order

Default: True

--num_positions

ESM2_Gibbs: Generate new AAs for this many positions each iteration for Gibbs sampling with ESM2. If 0, then generate for all target positions each round.

Default: 0

inv_fold_gen

Generate proteins using inverse folding

trill inv_fold_gen [-h] [--temp TEMP]
                   [--num_return_sequences NUM_RETURN_SEQUENCES]
                   [--max_length MAX_LENGTH] [--top_p TOP_P]
                   [--repetition_penalty REPETITION_PENALTY] [--dont_sample]
                   [--lig_mpnn_model LIG_MPNN_MODEL]
                   [--lig_mpnn_noise LIG_MPNN_NOISE] [--omit_AAs OMIT_AAS]
                   [--fasta_seq_separation FASTA_SEQ_SEPARATION]
                   [--verbose VERBOSE] [--pdb_path_multi PDB_PATH_MULTI]
                   [--fixed_residues FIXED_RESIDUES]
                   [--fixed_residues_multi FIXED_RESIDUES_MULTI]
                   [--redesigned_residues REDESIGNED_RESIDUES]
                   [--redesigned_residues_multi REDESIGNED_RESIDUES_MULTI]
                   [--bias_AA BIAS_AA]
                   [--bias_AA_per_residue BIAS_AA_PER_RESIDUE]
                   [--bias_AA_per_residue_multi BIAS_AA_PER_RESIDUE_MULTI]
                   [--omit_AA_per_residue OMIT_AA_PER_RESIDUE]
                   [--omit_AA_per_residue_multi OMIT_AA_PER_RESIDUE_MULTI]
                   [--symmetry_residues SYMMETRY_RESIDUES]
                   [--symmetry_weights SYMMETRY_WEIGHTS]
                   [--homo_oligomer HOMO_OLIGOMER]
                   [--zero_indexed ZERO_INDEXED] [--batch_size BATCH_SIZE]
                   [--number_of_batches NUMBER_OF_BATCHES]
                   [--save_stats SAVE_STATS]
                   [--ligand_mpnn_use_atom_context LIGAND_MPNN_USE_ATOM_CONTEXT]
                   [--ligand_mpnn_cutoff_for_score LIGAND_MPNN_CUTOFF_FOR_SCORE]
                   [--ligand_mpnn_use_side_chain_context LIGAND_MPNN_USE_SIDE_CHAIN_CONTEXT]
                   [--chains_to_design CHAINS_TO_DESIGN]
                   [--parse_these_chains_only PARSE_THESE_CHAINS_ONLY]
                   [--transmembrane_buried TRANSMEMBRANE_BURIED]
                   [--transmembrane_interface TRANSMEMBRANE_INTERFACE]
                   [--global_transmembrane_label GLOBAL_TRANSMEMBRANE_LABEL]
                   [--parse_atoms_with_zero_occupancy PARSE_ATOMS_WITH_ZERO_OCCUPANCY]
                   [--pack_side_chains PACK_SIDE_CHAINS]
                   [--number_of_packs_per_design NUMBER_OF_PACKS_PER_DESIGN]
                   [--sc_num_denoising_steps SC_NUM_DENOISING_STEPS]
                   [--sc_num_samples SC_NUM_SAMPLES]
                   [--repack_everything REPACK_EVERYTHING]
                   [--force_hetatm FORCE_HETATM]
                   [--packed_suffix PACKED_SUFFIX]
                   [--pack_with_ligand_context PACK_WITH_LIGAND_CONTEXT]
                   {ESM-IF1,ProstT5,LigandMPNN} query

Positional Arguments

model

Possible choices: ESM-IF1, ProstT5, LigandMPNN

Select which model to generate proteins using inverse folding.

query

Input pdb file for inverse folding

Named Arguments

--temp

Choose sampling temperature.

Default: 1

--num_return_sequences

Choose number of proteins to generate.

Default: 1

--max_length

Max length of proteins generated, default is 500 AAs

Default: 500

--top_p

ProstT5: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default is 1

Default: 1

--repetition_penalty

ProstT5: The parameter for repetition penalty. 1.0 means no penalty, the default is 1.2

Default: 1.2

--dont_sample

ProstT5: By default, the model will sample to generate the protein. With this flag, you can enable greedy decoding, where the most probable tokens will be returned.

Default: True

--lig_mpnn_model

LigandMPNN: ProteinMPNN, Soluble, Global_Membrane, Local_Membrane, Side-Chain_Packing

Default: “”

--lig_mpnn_noise

LigandMPNN Noise levels: 002, 005, 010, 020, 030; 010 = .10A noise. Note that 002 is only available for Soluble and Side-Chain_packing models

Default: “010”

--omit_AAs

LigandMPNN: Specify which amino acids should be omitted in the generated sequence, e.g. “AC” would omit alanine and cysteine.

Default: X

--fasta_seq_separation

LigandMPNN: Symbol to use between sequences from different chains

Default: “:”

--verbose

LigandMPNN: Print stuff

Default: 1

--pdb_path_multi

LigandMPNN: Path to json listing PDB paths. {‘/path/to/pdb’: ‘’} - only keys will be used.

Default: “”

--fixed_residues

LigandMPNN: Provide fixed residues, A12 A13 A14 B2 B25

Default: “”

--fixed_residues_multi

LigandMPNN: Path to json mapping of fixed residues for each pdb i.e., {‘/path/to/pdb’: ‘A12 A13 A14 B2 B25’}

Default: “”

--redesigned_residues

LigandMPNN: Provide to be redesigned residues, everything else will be fixed, A12 A13 A14 B2 B25

Default: “”

--redesigned_residues_multi

LigandMPNN: Path to json mapping of redesigned residues for each pdb i.e., {‘/path/to/pdb’: ‘A12 A13 A14 B2 B25’}

Default: “”

--bias_AA

LigandMPNN: Bias generation of amino acids, e.g. ‘A:-1.024,P:2.34,C:-12.34’

Default: “”

--bias_AA_per_residue

LigandMPNN: Path to json mapping of bias {‘A12’: {‘G’: -0.3, ‘C’: -2.0, ‘H’: 0.8}, ‘A13’: {‘G’: -1.3}}

Default: “”

--bias_AA_per_residue_multi

LigandMPNN: Path to json mapping of bias {‘pdb_path’: {‘A12’: {‘G’: -0.3, ‘C’: -2.0, ‘H’: 0.8}, ‘A13’: {‘G’: -1.3}}}

Default: “”

--omit_AA_per_residue

LigandMPNN: Path to json mapping of bias {‘A12’: ‘APQ’, ‘A13’: ‘QST’}

Default: “”

--omit_AA_per_residue_multi

LigandMPNN: Path to json mapping of bias {‘pdb_path’: {‘A12’: ‘QSPC’, ‘A13’: ‘AGE’}}

Default: “”

--symmetry_residues

LigandMPNN: Add list of lists for which residues need to be symmetric, e.g. ‘A12,A13,A14|C2,C3|A5,B6’

Default: “”

--symmetry_weights

LigandMPNN: Add weights that match symmetry_residues, e.g. ‘1.01,1.0,1.0|-1.0,2.0|2.0,2.3’

Default: “”

--homo_oligomer

LigandMPNN: Setting this to 1 will automatically set –symmetry_residues and –symmetry_weights to do homooligomer design with equal weighting.

Default: 0

--zero_indexed

LigandMPNN: 1 - to start output PDB numbering with 0

Default: 0

--batch_size

LigandMPNN: Number of sequence to generate per one pass.

Default: 1

--number_of_batches

LigandMPNN: Number of times to design sequence using a chosen batch size.

Default: 1

--save_stats

LigandMPNN: Save output statistics

Default: 0

--ligand_mpnn_use_atom_context

LigandMPNN: 1 - use atom context, 0 - do not use atom context.

Default: 1

--ligand_mpnn_cutoff_for_score

LigandMPNN: Cutoff in angstroms between protein and context atoms to select residues for reporting score.

Default: 8.0

--ligand_mpnn_use_side_chain_context

LigandMPNN: Flag to use side chain atoms as ligand context for the fixed residues

Default: 0

--chains_to_design

LigandMPNN: Specify which chains to redesign, all others will be kept fixed.

--parse_these_chains_only

LigandMPNN: Provide chains letters for parsing backbones, ‘ABCF’

Default: “”

--transmembrane_buried

LigandMPNN: Provide buried residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25

Default: “”

--transmembrane_interface

LigandMPNN: Provide interface residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25

Default: “”

--global_transmembrane_label

LigandMPNN: Provide global label for global_label_membrane_mpnn model. 1 - transmembrane, 0 - soluble

Default: 0

--parse_atoms_with_zero_occupancy

LigandMPNN: To parse atoms with zero occupancy in the PDB input files. 0 - do not parse, 1 - parse atoms with zero occupancy

Default: 0

--pack_side_chains

LigandMPNN: 1 - to run side chain packer, 0 - do not run it

Default: 0

--number_of_packs_per_design

LigandMPNN: Number of independent side chain packing samples to return per design

Default: 4

--sc_num_denoising_steps

LigandMPNN: Number of denoising/recycling steps to make for side chain packing

Default: 3

--sc_num_samples

LigandMPNN: Number of samples to draw from a mixture distribution and then take a sample with the highest likelihood.

Default: 16

--repack_everything

LigandMPNN: 1 - repacks side chains of all residues including the fixed ones; 0 - keeps the side chains fixed for fixed residues

Default: 0

--force_hetatm

LigandMPNN: To force ligand atoms to be written as HETATM to PDB file after packing.

Default: 0

--packed_suffix

LigandMPNN: Suffix for packed PDB paths

Default: “_packed”

--pack_with_ligand_context

LigandMPNN: 1-pack side chains using ligand context, 0 - do not use it.

Default: 1

classify

Classify proteins using either pretrained classifiers or train/test your own.

trill classify [-h] [--key KEY] [--save_emb]
               [--emb_model {Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}]
               [--train_split TRAIN_SPLIT] [--preTrained PRETRAINED]
               [--preComputed_Embs PRECOMPUTED_EMBS]
               [--batch_size_emb BATCH_SIZE_EMB]
               [--batch_size_mlp BATCH_SIZE_MLP] [--xg_gamma XG_GAMMA]
               [--lr LR] [--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
               [--bagging_freq BAGGING_FREQ] [--bagging_frac BAGGING_FRAC]
               [--feature_frac FEATURE_FRAC] [--xg_reg_alpha XG_REG_ALPHA]
               [--xg_reg_lambda XG_REG_LAMBDA]
               [--if_contamination IF_CONTAMINATION]
               [--n_estimators N_ESTIMATORS] [--sweep] [--sweep_cv SWEEP_CV]
               [--sweep_iters SWEEP_ITERS]
               [--f1_avg_method {macro,weighted,micro,None}] [--epochs EPOCHS]
               [--hidden_layers HIDDEN_LAYERS] [--dropout DROPOUT] [--db DB]
               {TemStaPro,EpHod,ECPICK,PSALM,MLP,XGBoost,LightGBM,iForest,ESM2+MLP,3Di-Search}
               query

Positional Arguments

classifier

Possible choices: TemStaPro, EpHod, ECPICK, PSALM, MLP, XGBoost, LightGBM, iForest, ESM2+MLP, 3Di-Search

Predict thermostability/optimal enzymatic pH using TemStaPro/EpHod or choose custom to train/use your own XGBoost, Multilayer perceptron, LightGBM or Isolation Forest classifier. ESM2+MLP allows you to train an ESM2 model with a classification head end-to-end.

query

Fasta file of sequences to score

Named Arguments

--key

Input a CSV, with your class mappings for your embeddings where the first column is the label and the second column is the class.

--save_emb

Save csv of embeddings

Default: False

--emb_model

Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt

Select desired protein language model for embedding your query proteins to then train your custom classifier. Default is esm2_t12_35M

Default: “esm2_t12_35M”

--train_split

Choose your train-test percentage split for training and evaluating your custom classifier. For example, –train .6 would split your input sequences into two groups, one with 60% of the sequences to train and the other with 40% for evaluating

--preTrained

Enter the path to your pre-trained classifier that you’ve trained with TRILL. This will be a .json file.

--preComputed_Embs

Enter the path to your pre-computed embeddings. Make sure they match the –emb_model you select.

Default: False

--batch_size_emb

EpHod: Sets batch_size for embedding with ESM1v.

Default: 1

--batch_size_mlp

MLP: Sets batch_size for training/evaluating

Default: 1

--xg_gamma

XGBoost: sets gamma for XGBoost, which is a hyperparameter that sets ‘Minimum loss reduction required to make a further partition on a leaf node of the tree.’

Default: 0.4

--lr

XGBoost/LightGBM/ESM2+MLP/MLP: Sets the learning rate. Default is 0.0001 for ESM2+MLP/MLP, 0.2 for XGBoost and LightGBM

Default: 0.2

--max_depth

XGBoost/LightGBM: Sets the maximum tree depth

Default: 8

--num_leaves

LightGBM: Sets the max number of leaves in one tree. Default is 31

Default: 31

--bagging_freq

LightGBM: Int that allows for bagging, which enables random sampling of training data of traingin data. For example, if it is set to 3, LightGBM will randomly sample the –bagging_frac of the data every 3rd iteration. Default is 0

Default: 0

--bagging_frac

LightGBM: Sets fraction of training data to be used when bagging. Must be 0 < –bagging_frac <= 1. Default is 1

Default: 1

--feature_frac

LightGBM: Sets fraction of training features to be randomly sampled for use in training. Must be 0 < –feature_frac <= 1. Default is 1

Default: 1

--xg_reg_alpha

XGBoost: L1 regularization term on weights

Default: 0.8

--xg_reg_lambda

XGBoost: L2 regularization term on weights

Default: 0.1

--if_contamination

iForest: The amount of outliers in the data. Default is automatically determined, but you can set it between (0 , 0.5])

Default: “auto”

--n_estimators

XGBoost/LightGBM: Number of boosting rounds

Default: 115

--sweep

XGBoost/LightGBM: Use this flag to perform cross-validated bayesian optimization over the hyperparameter space.

Default: False

--sweep_cv

XGBoost/LightGBM: Change the number of folds used for cross-validation.

Default: 3

--sweep_iters

XGBoost/LightGBM: Change the number of optimization iterations. Default is 10.

Default: 10

--f1_avg_method

Possible choices: macro, weighted, micro, None

XGBoost/LightGBM: Change the scoring method used for calculated F1. Default is with no averaging.

--epochs

ESM2+MLP/MLP: Set number of epochs to train ESM2+MLP classifier.

Default: 3

--hidden_layers

MLP: Set number of hidden layers. Default is [128,64,32]

Default: [128, 64, 32]

--dropout

MLP: Set dropout rate. Default is 0.3

Default: 0.3

--db

3Di-Search: Specify the path of the fasta file for your database that you want to query against.

diff_gen

Generate proteins using RFDiffusion

trill diff_gen [-h] [--contigs CONTIGS]
               [--RFDiffusion_Override RFDIFFUSION_OVERRIDE]
               [--num_return_sequences NUM_RETURN_SEQUENCES]
               [--Inpaint INPAINT] [--query QUERY] [--partial_T PARTIAL_T]
               [--partial_diff_fix PARTIAL_DIFF_FIX] [--hotspots HOTSPOTS]

Named Arguments

--contigs

Generate proteins between these sizes in AAs for RFDiffusion. For example, –contig 100-200, will result in proteins in this range

--RFDiffusion_Override

Change RFDiffusion model. For example, –RFDiffusion_Override ActiveSite will use ActiveSite_ckpt.pt for holding small motifs in place.

Default: False

--num_return_sequences

Number of sequences for RFDiffusion to generate. Default is 5

Default: 5

--Inpaint

Residues to inpaint.

--query

Input pdb file for motif scaffolding, partial diffusion etc.

--partial_T

Adjust partial diffusion sampling value.

--partial_diff_fix

Pass the residues that you want to keep fixed for your input pdb during partial diffusion. Note that the residues should be 0-indexed.

--hotspots

Define residues that binder must interact with. For example, –hotspots A30,A33,A34 , where A is the chain and the numbers are the residue indices.

embed

Embed sequences of interest

trill embed [-h] [--batch_size BATCH_SIZE] [--finetuned FINETUNED] [--per_AA]
            [--avg]
            {Ankh,Ankh-Large,CaLM,esm2_t6_8M,esm2_t12_35M,esm2_t30_150M,esm2_t33_650M,esm2_t36_3B,esm2_t48_15B,ProtT5-XL,ProstT5,RiNALMo,mRNA-FM,RNA-FM,SaProt}
            query

Positional Arguments

model

Possible choices: Ankh, Ankh-Large, CaLM, esm2_t6_8M, esm2_t12_35M, esm2_t30_150M, esm2_t33_650M, esm2_t36_3B, esm2_t48_15B, ProtT5-XL, ProstT5, RiNALMo, mRNA-FM, RNA-FM, SaProt

Choose language model to embed query sequences. Note for SaProt you need to protein structures as input For RiNALMo, RNA-FM and mRNA-FM (must be multiples of 3 for mRNA-FM) the input is RNA while CaLM takes as input DNA sequences.

query

Input protein fasta file. For SaProt only, you can provide a directory where every .pdb file will be embedded or a .txt file where each line is an absolute path to a pdb file.

Named Arguments

--batch_size

Change batch-size number for embedding proteins. Default is 1, but with more RAM, you can do more

Default: 1

--finetuned

Input path to your own finetuned ESM model

Default: False

--per_AA

Add this flag to return the per amino acid / nucleic acid representations.

Default: False

--avg

Add this flag to return the average, whole sequence representation.

Default: False

score

Use ESM-1v or ESM2 to score protein sequences or ProteinMPNN to score protein structures

trill score [-h] [--mpnn_model MPNN_MODEL] [--lig_mpnn_noise LIG_MPNN_NOISE]
            [--global_transmembrane_label GLOBAL_TRANSMEMBRANE_LABEL]
            [--transmembrane_buried [TRANSMEMBRANE_BURIED [TRANSMEMBRANE_BURIED ...]]]
            [--transmembrane_interface [TRANSMEMBRANE_INTERFACE [TRANSMEMBRANE_INTERFACE ...]]]
            [--batch_transmembrane_csv BATCH_TRANSMEMBRANE_CSV]
            [--ligand_mpnn_cutoff_for_score LIGAND_MPNN_CUTOFF_FOR_SCORE]
            {ESM2_150M,ESM1v,ESM2_650M,ProteinMPNN} query

Positional Arguments

scorer

Possible choices: ESM2_150M, ESM1v, ESM2_650M, ProteinMPNN

Score protein sequences with ESM-1v, ESM2-650M or protein structures with ProteinMPNN

query

Path to protein PDB file to score. Can also provide a .txt file with absolute paths to multiple PDBs

Named Arguments

--mpnn_model

ProteinMPNN: ProteinMPNN, LigandMPNN, Local_Membrane, Global_Membrane and Soluble. Default is ProteinMPNN

Default: “ProteinMPNN”

--lig_mpnn_noise

ProteinMPNN Noise levels: 002, 005, 010, 020, 030; 010 = .10A noise. Note that 002 is only available for Soluble and Side-Chain_packing models

Default: “010”

--global_transmembrane_label

Provide global label for global_label_membrane_mpnn model. 1 - transmembrane, 0 - soluble

Default: 0

--transmembrane_buried

ProteinMPNN: Provide buried residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25 If inputting a .txt file with absolute paths to .pdb’s, make sure that all of the proteins have the same residue labels, else you can provide a .csv file here where the first column is ‘Label’ and the second is ‘Residues’.

Default: “”

--transmembrane_interface

ProteinMPNN: Provide interface residues when using checkpoint_per_residue_label_membrane_mpnn model, A12 A13 A14 B2 B25. If inputting a .txt file with absolute paths to .pdb’s, make sure that all of the proteins have the same residue labels, else you can provide a .csv file here where the first column is ‘Label’ and the second is ‘Residues’.

Default: “”

--batch_transmembrane_csv

ProteinMPNN: You can provide a .csv file to specify mutliple transmembrane buried/interface residues. The first column should be called ‘Label’, the second ‘transmembrane_buried’ and the third ‘transmembrane_interface’.

Default: “”

--ligand_mpnn_cutoff_for_score

ProteinMPNN: Cutoff in angstroms between protein and context atoms to select residues for reporting score.

Default: 8.0