이번에는 BAM 파일 QC 하는 방법에 대해 알아보겠습니다.
간단한 BAMQC 는 samtools 로 진행 할 수 있지만,
이번 포스팅에서는 좀더 쉽고(?) 간단한(?) 방법을 소개 하고자 합니다.
http://qualimap.conesalab.org/
Qualimap: Evaluating next generation sequencing alignment data
Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts. S
qualimap.conesalab.org
Qualimap2 는 자바 및 R 로 작성된 프로그램으로 독립적인 GUI 버전과 명령어 인터페이스를 모두 제공합니다.
이는 맵핑된 SAM/BAM 시퀀싱 데이터 정보를 검사하고 bias 를 감지하게 됩니다.
또한 파이프라인을 따라 추가되는 분석을 위한 의사 결정을 용이하게 하는데 도움을 줍니다.
위 zip 파일을 다운 받아 줍니다.
window, linux 둘다 사용 가능합니다.
해당 working directory 에 zip 파일을 풀어줍니다.
Qualimap 은 JAVA 와 R 이 요구 되기 때문에 사용 전에 미리 깔아줍니다.
JAVA runtime version 6 or above.
R enviroment version 3.1 or above.
JAVA, R installation 은 아래 링크를 참고하여 주세요.
http://qualimap.conesalab.org/doc_html/intro.html#installation
Window 에서 사용하실꺼면 다운받으시고 install 만해주시면 (보통 게임깔듯이) GUI 환경에서 바로 사용 가능합니다.
저는 linux 에서 사용하는 방법을 실습해보겠습니다.
drwxrwxr-x 5 skkwon1048 server 16384 2023-01-20 12:20 qualimap_v2.2.1
-rw-rw-r-- 1 skkwon1048 server 28368668 2023-01-20 10:12 qualimap_v2.2.1.zip
[skkwon1048@server ~]$ unzip qualimap_v2.2.1.zip
# 현재 폴더에서 압축해제
unzip qualimap_v2.2.1.zip
해주면 현재 폴더에서 압축해제 -> qualimap_v2.2.1 생성
-rw-rw-r-- 1 skkwon1048 server 10668 2016-10-04 01:14 HISTORY
drwxrwxr-x 2 skkwon1048 server 16384 2016-10-04 01:14 lib
-rw-rw-r-- 1 skkwon1048 server 17982 2016-10-04 01:14 LICENSE
-rwxr-xr-x 1 skkwon1048 server 1923 2023-01-20 12:20 qualimap
-rw-rw-r-- 1 skkwon1048 server 148 2016-10-04 01:14 qualimap.bat
-rw-rw-r-- 1 skkwon1048 server 1021911 2016-10-04 01:14 qualimap.jar
-rw-rw-r-- 1 skkwon1048 server 411187 2016-10-04 01:14 QualimapManual.pdf
drwxrwxr-x 2 skkwon1048 server 16384 2016-10-04 01:14 scripts
drwxrwxr-x 2 skkwon1048 server 62 2016-10-04 01:14 species
qualimap_v2.2.1 폴더에 들어가보시면
qualimap 이라는 -x 실행 파일이 하나 있을겁니다.
우선 --help 으로 옵션을 살펴보겠습니다.
[skkwon1048@server ~]$./qualimap --help
Java memory size is set to 1200M
Launching application...
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
QualiMap v.2.2.1
Built on 2016-10-03 18:14
usage: qualimap <tool> [options]
To launch GUI leave <tool> empty.
Available tools:
bamqc Evaluate NGS mapping to a reference genome
rnaseq Evaluate RNA-seq alignment data
counts Counts data analysis (further RNA-seq data evaluation)
multi-bamqc Compare QC reports from multiple NGS mappings
clustering Cluster epigenomic signals
comp-counts Compute feature counts
Special arguments:
--java-mem-size Use this argument to set Java memory heap size. Example:
qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
bamqc, rnaseq, counts, multi-bamqc, clustering, comp-counts, 여러가지가 있네요.
rnaseq 옵션 보시면, qualimap 은 whole-genome sequencing data 뿐만아니고 whole-exome, RNA-seq, ChIP-seq 데이터 를 모두 다룰 수 있습니다.
저는 qualimap 을 통해서 BAM QC 를 진행하고자 하였기에 bamqc 옵션을 실행 해보겠습니다.
usage: qualimap <tool> [options]
./qualimap bamqc -bam sample.bam
Java memory size is set to 1200M
Launching application...
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
QualiMap v.2.2.1
Built on 2016-10-03 18:14
Selected tool: bamqc
Available memory (Mb): 32
Max memory (Mb): 1118
Wed Jan 25 21:38:13 KST 2023 WARNING Output folder already exists, the results will be saved there
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 492
Chunk of reads size: 1000
Number of threads: 48
Processed 50 out of 492 windows...
Processed 100 out of 492 windows...
Processed 150 out of 492 windows...
Processed 200 out of 492 windows...
Processed 250 out of 492 windows...
Processed 300 out of 492 windows...
Processed 350 out of 492 windows...
Processed 400 out of 492 windows...
Processed 450 out of 492 windows...
Total processed windows:492
Number of reads: 7407432
Number of valid reads: 7289478
Number of correct strand reads:0
Inside of regions...
Num mapped reads: 7289478
Num mapped first of pair: 3645856
Num mapped second of pair: 3643622
Num singletons: 26933
Time taken to analyze reads: 96
Computing descriptors...
numberOfMappedBases: 701017315
referenceSize: 3137161264
numberOfSequencedBases: 700845143
numberOfAs: 179860582
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 97
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully
Finished
drwxrwxr-x 5 skkwon1048 server 149 2023-01-20 12:27 sample_stats
usage 명령어를 실행하면 bam 파일을 read 들을 읽어가면서
해당 정보들을 제공합니다.
결과정보를 포함 하고 있는 sample_stats 라는 파일이 생성 됩니다.
drwxrwxr-x 2 skkwon1048 server 16384 2023-01-20 12:25 css
-rw-rw-r-- 1 skkwon1048 server 11188 2023-01-25 21:39 genome_results.txt
drwxrwxr-x 2 skkwon1048 server 16384 2023-01-25 21:39 images_qualimapReport
-rw-rw-r-- 1 skkwon1048 server 35098 2023-01-25 21:39 qualimapReport.html
drwxrwxr-x 2 skkwon1048 server 16384 2023-01-20 12:25 raw_data_qualimapReport
이 중에서
genome_results.txt 라는 파일을 살펴보겠습니다.
꽤나 유용한 정보들을 많이 가지고 있는듯 합니다.
read 정보, insert size, mapping quality, ACTG content percent, coverage, 정보를 제공하고
reference genome 에 contig 가 있다면 coverage per contig 정보 까지 확인 할 수 있습니다.
BamQC report
-----------------------------------
>>>>>>> Input
bam file = sample.bam
outfile = sample_stats/genome_results.txt
>>>>>>> Reference
number of bases = 3,137,161,264 bp
number of contigs = 93
>>>>>>> Globals
number of windows = 492
number of reads = 7,407,432
number of mapped reads = 7,289,478 (98.41%)
number of mapped paired reads (first in pair) = 3,645,856
number of mapped paired reads (second in pair) = 3,643,622
number of mapped paired reads (both in pair) = 7,262,545
number of mapped paired reads (singletons) = 26,933
number of mapped bases = 701,017,315 bp
number of sequenced bases = 700,845,143 bp
number of aligned bases = 0 bp
number of duplicated reads (flagged) = 1,459,319
>>>>>>> Insert size
mean insert size = 204,495.709
std insert size = 4,615,630.9726
median insert size = 102
>>>>>>> Mapping quality
mean mapping quality = 45.4415
>>>>>>> ACTG content
number of A's = 179,860,582 bp (25.66%)
number of C's = 166,069,266 bp (23.7%)
number of T's = 182,329,949 bp (26.02%)
number of G's = 172,585,346 bp (24.63%)
number of N's = 177 bp (0%)
GC percentage = 48.32%
>>>>>>> Mismatches and indels
general error rate = 0.0035
number of mismatches = 2,356,408
number of insertions = 46,701
mapped reads with insertion percentage = 0.62%
number of deletions = 59,465
mapped reads with deletion percentage = 0.79%
homopolymer indels = 53.24%
>>>>>>> Coverage
mean coverageData = 0.2235X
std coverageData = 9.1142X
There is a 2.85% of reference with a coverageData >= 1X
There is a 2.54% of reference with a coverageData >= 2X
There is a 0.5% of reference with a coverageData >= 3X
There is a 0.44% of reference with a coverageData >= 4X
There is a 0.12% of reference with a coverageData >= 5X
There is a 0.11% of reference with a coverageData >= 6X
There is a 0.07% of reference with a coverageData >= 7X
There is a 0.07% of reference with a coverageData >= 8X
There is a 0.06% of reference with a coverageData >= 9X
There is a 0.06% of reference with a coverageData >= 10X
There is a 0.06% of reference with a coverageData >= 11X
There is a 0.06% of reference with a coverageData >= 12X
There is a 0.06% of reference with a coverageData >= 13X
There is a 0.06% of reference with a coverageData >= 14X
There is a 0.06% of reference with a coverageData >= 15X
There is a 0.05% of reference with a coverageData >= 16X
There is a 0.05% of reference with a coverageData >= 17X
There is a 0.05% of reference with a coverageData >= 18X
There is a 0.05% of reference with a coverageData >= 19X
There is a 0.05% of reference with a coverageData >= 20X
There is a 0.05% of reference with a coverageData >= 21X
There is a 0.05% of reference with a coverageData >= 22X
There is a 0.05% of reference with a coverageData >= 23X
There is a 0.05% of reference with a coverageData >= 24X
There is a 0.05% of reference with a coverageData >= 25X
There is a 0.05% of reference with a coverageData >= 26X
There is a 0.05% of reference with a coverageData >= 27X
There is a 0.05% of reference with a coverageData >= 28X
There is a 0.05% of reference with a coverageData >= 29X
There is a 0.05% of reference with a coverageData >= 30X
There is a 0.05% of reference with a coverageData >= 31X
There is a 0.05% of reference with a coverageData >= 32X
There is a 0.05% of reference with a coverageData >= 33X
There is a 0.05% of reference with a coverageData >= 34X
There is a 0.05% of reference with a coverageData >= 35X
There is a 0.05% of reference with a coverageData >= 36X
There is a 0.05% of reference with a coverageData >= 37X
There is a 0.05% of reference with a coverageData >= 38X
There is a 0.05% of reference with a coverageData >= 39X
There is a 0.05% of reference with a coverageData >= 40X
There is a 0.05% of reference with a coverageData >= 41X
There is a 0.05% of reference with a coverageData >= 42X
There is a 0.05% of reference with a coverageData >= 43X
There is a 0.05% of reference with a coverageData >= 44X
There is a 0.05% of reference with a coverageData >= 45X
There is a 0.05% of reference with a coverageData >= 46X
There is a 0.05% of reference with a coverageData >= 47X
There is a 0.05% of reference with a coverageData >= 48X
There is a 0.05% of reference with a coverageData >= 49X
There is a 0.04% of reference with a coverageData >= 50X
There is a 0.04% of reference with a coverageData >= 51X
>>>>>>> Coverage per contig
chrM 16571 2145237 129.45730493029993 40.857175963953175
chr1 249250621 97949201 0.39297475210703686 13.350223649874902
chr2 243199373 116274905 0.47810528277965586 13.515499682572164
chr3 198022430 44220163 0.22330885950647106 13.976964732469275
chr4 191154276 20675036 0.1081588988362468 3.881480133175523
chr5 180915260 28955520 0.16005018039937594 6.888916980041564
chr6 171115067 31129888 0.1819237110195562 6.63666421117114
chr7 159138663 56614389 0.3557550876244323 12.073993228915015
chr8 146364022 29961124 0.20470279232966146 9.618357004971429
chr9 141213431 13887776 0.09834599939718199 5.359346834674725
chr10 135534747 30573345 0.22557569683588224 8.034123405501704
chr11 135006516 27967130 0.20715392729636842 8.19313618707895
chr12 133851895 32066957 0.23957043716116233 8.663117962736726
chr13 115169878 16367590 0.14211693442967788 6.747454579397003
chr14 107349540 6630354 0.061764158467749374 0.9694881447693602
chr15 102531392 12640326 0.12328249674012033 5.842435465996139
chr16 90354753 17163693 0.18995893885073206 6.942825315050472
chr17 81195210 16528618 0.20356641728003413 7.256564862022594
chr18 78077248 6557430 0.08398643866136266 2.8881597127990206
chr19 59128983 40479673 0.684599513575263 16.568038141166596
chr20 63025520 11450366 0.18167824716083264 7.308576526437246
chr21 48129895 2621357 0.05446421605532279 0.629346603800844
chr22 51304566 8388896 0.1635116843206509 7.097840402525419
chrX 155270560 22724299 0.1463529145512195 5.276202899796574
chrY 59373566 2445061 0.041180969322273824 1.3019284816783934
chr1_gl000191_random 106433 6124 0.05753854537596422 0.3604683932294615
chr1_gl000192_random 547496 33307 0.06083514765404679 0.39637363543190957
chr4_ctg9_hap1 590426 18188 0.030804876479016845 0.25779840021378597
chr4_gl000193_random 189789 31821 0.16766514392298815 0.7133604994621406
chr4_gl000194_random 191469 19762 0.10321253048796411 0.480707870298712
chr6_apd_hap1 4622290 29502 0.006382550640483397 0.11472252379266212
chr6_cox_hap2 4795371 88361 0.018426311540858883 0.20691318356888658
chr6_dbb_hap3 4610396 73696 0.015984744043678677 0.18107990047499767
chr6_mann_hap4 4683263 65095 0.013899496996004708 0.17272413006920695
chr6_mcf_hap5 4833398 75365 0.015592550003124097 0.19681912620498618
chr6_qbl_hap6 4611984 72388 0.01569563120774053 0.18496427817367903
chr6_ssto_hap7 4928567 82766 0.016793116538742398 0.1939710243560572
chr7_gl000195_random 182896 38440 0.21017408800629866 0.728641039211291
chr8_gl000196_random 38914 1201 0.03086292850901989 0.24835277478291629
chr8_gl000197_random 37175 1496 0.040242098184263615 0.36497492021726824
chr9_gl000198_random 90085 13232 0.14688349891768884 0.5854912449716595
chr9_gl000199_random 169874 839971 4.944670755972074 10.53842223113749
chr9_gl000200_random 187035 5095 0.027240890742374423 0.22904558423403448
chr9_gl000201_random 36148 1321 0.036544207148389955 0.3218606130648401
chr11_gl000202_random 40103 1048 0.026132708276188812 0.2415885615731324
chr17_ctg5_hap1 1680828 52470 0.031216757455254197 0.25136619384170333
chr17_gl000203_random 37498 2702 0.07205717638274041 0.42004822188403856
chr17_gl000204_random 81310 2923 0.03594883778133071 0.27649086150708024
chr17_gl000205_random 174588 35888 0.20555822851513278 0.7432616568700222
chr17_gl000206_random 41001 724 0.0176581058998561 0.18709463673697777
chr18_gl000207_random 4262 404 0.09479117785077429 0.4022765939446781
chr19_gl000208_random 92689 53460 0.5767674697105374 2.179851214172754
chr19_gl000209_random 159169 3059 0.01921856642939266 0.20038426671315498
chr21_gl000210_random 27682 1088 0.03930351853189799 0.26262210421172555
chrUn_gl000211 166566 13203 0.07926587658946004 0.45231825205425474
chrUn_gl000212 186858 89481 0.4788716565520342 6.5538592174356936
chrUn_gl000213 164239 4172 0.025402005613770175 0.22081293245223313
chrUn_gl000214 137718 54360 0.39471964449091623 1.121799531010225
chrUn_gl000215 172545 5134 0.02975455678228868 0.2673761184742546
chrUn_gl000216 172294 237756 1.3799435848027208 9.553572616917938
chrUn_gl000217 172149 16351 0.0949816728531679 0.45818203908084965
chrUn_gl000218 161147 22026 0.1366826562083067 0.5563742631009525
chrUn_gl000219 179198 32407 0.18084465228406568 0.6905329970447371
chrUn_gl000220 161802 673678 4.163595011186512 8.262311110800228
chrUn_gl000221 155397 25598 0.16472647477107022 0.6136076072498677
chrUn_gl000222 186861 14248 0.07624919057481229 0.47602320775818796
chrUn_gl000223 180455 10179 0.05640741459089524 0.37277119427204186
chrUn_gl000224 179693 80053 0.4454987116916074 1.5358699739306765
chrUn_gl000225 211173 147934 0.7005346327418751 2.6126796738038927
chrUn_gl000226 15008 447859 29.841351279317696 22.81917181073404
chrUn_gl000227 128374 4031 0.031400439341299644 0.275567938162207
chrUn_gl000228 129120 29979 0.23217936802973976 1.7370586657928235
chrUn_gl000229 19913 5250 0.26364686385778136 0.8841950286885981
chrUn_gl000230 43691 1757 0.04021423176397885 0.34072840460862786
chrUn_gl000231 27386 6834 0.2495435624041481 0.7618823019988965
chrUn_gl000232 40652 12618 0.3103906326871987 0.9479686512983967
chrUn_gl000233 45941 3581 0.07794780261639929 0.4505794120285086
chrUn_gl000234 40531 10190 0.25141249907478225 0.8382868489333547
chrUn_gl000235 34474 5179 0.15022915820618438 0.6105035642883383
chrUn_gl000236 41934 2900 0.06915629322268327 0.4006554566052661
chrUn_gl000237 45867 15373 0.33516471537270803 0.8973115891103459
chrUn_gl000238 39939 796433 19.94123538396054 65.05512759962023
chrUn_gl000239 33824 5298 0.15663434247871333 0.56171174279999
chrUn_gl000240 41933 3722 0.0887606419764863 0.4342554548597429
chrUn_gl000241 42152 9552 0.22660846460428924 0.695214877286874
chrUn_gl000242 43523 2088 0.047974634101509546 0.35451537897650026
chrUn_gl000243 43341 9350 0.21573106296578298 0.7281273225492091
chrUn_gl000244 39929 2092 0.05239299757068797 0.3359133115675952
chrUn_gl000245 36651 5698 0.1554664265640774 0.5532747547047702
chrUn_gl000246 38154 1844 0.048330450280442416 0.3158744355872739
chrUn_gl000247 36422 2379 0.06531766514743836 0.35783200645194
chrUn_gl000248 39786 1006 0.025285276227818832 0.25723155813673626
chrUn_gl000249 38502 136491 3.545036621474209 27.983774577121636
각 수치들이 시각화된 figure 들까지 확인 할 수 있으니 아주 유용합니다.
genome_coverage_0to50_histogram.png
genome_coverage_across_reference.png
genome_coverage_histogram.png
genome_coverage_quotes.png
genome_gc_content_per_window.png
genome_homopolymer_indels.png
genome_insert_size_across_reference.png
genome_insert_size_histogram.png
genome_mapping_quality_across_reference.png
genome_mapping_quality_histogram.png
genome_reads_clipping_profile.png
genome_reads_content_per_read_position.png
genome_uniq_read_starts_histogram.png
지금까지 BAM QC 하는 방법을 알아 보았습니다.
Qualimap2 이 제공하는 정보를 바탕으로
추가로 진행되는 분석에 도움이 되시길 바랍니다.
reference:
http://qualimap.conesalab.org/doc_html/command_line.html
Command Line Interface — Qualimap 2.2.1 documentation
The main argument for this command is the configuration file describing the input samples (-d). This has to be a 4-column tab-delimited file. The first column should contain the name of the sample, the second - name of the biological condition (e.g treated
qualimap.conesalab.org
'Bioinformatics' 카테고리의 다른 글
NA12878 fastq download (0) | 2023.02.16 |
---|---|
bioinformatics tools installation (2023 최신버전) (0) | 2023.02.07 |
[Samtools] [Bedtools] BAM to FASTQ 파일 전환 (0) | 2023.01.19 |
[Bedtools] BAM to BED 파일 전환 (0) | 2023.01.16 |
ERBB2 bioinformatics analysis (0) | 2023.01.12 |