Mediante un Blast local podemos buscar secuencias dentro de una base de datos customizada
Aplicaciones:
- Buscar la existencia de un gen dentro de un genoma (frecuentemente dentro de un ensambldo de novo)
- Anotar genes predichos
Paso 1: Generar una base de datos
En este caso, generamos una base de datos usando el ensamblado a26.fa
makeblastdb -in /home/biocomp/bioinfo_genomas/a26_1000.fa -parse_seqids -title "26_1000" -out a26 -dbtype nucl
Paso 2: Buscar una secuencia dentro del ensamblado
Descargar la secuencia de un gene de la accession: NC_007795
Por ejemplo la beta-galactosidasa: CDS o proteina
Paso 3: Ejectutar Blastn
Siendo beta-gal.fa el fichero del gen que hemos descargado:
blastn -query beta-gal.fa -db a26 -max_target_seqs 5 -outfmt 6 -perc_identity 90 -evalue 1E-10
Resultado
NC_007795.1:c2274567-2273155 NODE_2_length_304732_cov_60.346849 100.000 1413 0 0 11413 296103 297515 0.0 2610
–> el gen de la beta-galactosidasa se ubica en el contig ‘NODE_2’ entre las coordenadas 296103-297515
Parametros adicionales del blast
-qcov_hsp_perc ‘número’ –> cubertura mínima de la secuencia de entrada
Customizar la salida del blast
-outfmt “6 qseqid qlen” –> ID y longitud de la secuencia de entrada.
Abajo una lista de todas las columnas posibles:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ‘;’
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a ‘/’
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ‘;’
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ‘;’
scomnames means unique Subject Common Name(s), separated by a ‘;’
sblastnames means unique Subject Blast Name(s), separated by a ‘;’
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ‘;’
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a ‘<>’
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)