# How to generate different combinations of MSA

## Prerequisite
- HHBLITS should be installed. https://github.com/soedinglab/hh-suite/blob/master/README.md has the instructions.
    - uniclust30_2018_08 was used as the protein database
- Jackhmmer should be installed. http://hmmer.org/ is the source for this tool.
    - uniref90 was used as the protein database

## 1. Run HHBLITS to generate 4 sets of combination
```
[
	{'e': '0.0001', 'cov': '70', 'n': '3'},
	{'e': '0.001', 'cov': '40', 'n': '3'},
	{'e': '0.1', 'cov': '50', 'n': '3'},
	{'e': '10', 'cov': '30', 'n': '1'}
]
```
```
python3 generate_alignments.py <comma_seperated_list_of_ids>
```

## 2. Run Jackhmmer to generate 5th set
```
bash run_jackhmmer.sh <file_list>

OR,

./run_jackhmmer.sh <file_list>

Note: file_list should be a text file of target IDs
```

While generating MSAs, the first four combinations from HHBLITS are saved in the folders indexed from 1 to 4 (eg: index_1, index_2),
and the fifth combination from Jackhmmer is saved in folder indexed as 5 (index_5)

# How to train

## 1. Generate trRosetta prediction

Use the `generate_trRosetta_prediction.py` to generate two prediction files: `.npz` and `.npy` file

Command:
```
python3 generate_trRosetta_prediction.py <msa_path> <target_npz_path> <target_npy_path>
```
Here the first argument is a source file to the script which is an MSA file, and the remaining two arguments are for the output of the prediction.

Similar to MSA generation, save the predicted files in their respective indexed folder for future reference during training.

## 2. Train the model

Train using `train.py` file

```
python3 train.py
```

Note:
Inside `train.py` change the path of the `features_dir` according to your path. Check each location carefully before training.

`training_pdb.lst` contains the ids of the proteins which are training proteins set
`validation_pdb.lst` contains the ids of the proteins which are validation proteins set

`training_pdb_ext.lst` and `validation_pdb_ext.lst` are the list of protein ids with 5 indices attached with their target name which are used in training and validation respectively.