본문 바로가기
Bioinformatics

Qualimap2 - BAMQC

by 코딩하는 미토콘드리아 bioinformatics 2023. 1. 20.
반응형

이번에는 BAM 파일 QC 하는 방법에 대해 알아보겠습니다.

간단한 BAMQC 는 samtools 로 진행 할 수 있지만,

이번 포스팅에서는 좀더 쉽고(?) 간단한(?) 방법을 소개 하고자 합니다.

 

http://qualimap.conesalab.org/

 

Qualimap: Evaluating next generation sequencing alignment data

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts. S

qualimap.conesalab.org

Qualimap2 는 자바 및 R 로 작성된 프로그램으로 독립적인 GUI 버전과 명령어 인터페이스를 모두 제공합니다.

이는 맵핑된 SAM/BAM 시퀀싱 데이터 정보를 검사하고 bias 를 감지하게 됩니다.

또한 파이프라인을 따라 추가되는 분석을 위한 의사 결정을 용이하게 하는데 도움을 줍니다.

 

http://qualimap.conesalab.org/

위 zip 파일을 다운 받아 줍니다.

window, linux 둘다 사용 가능합니다. 

해당 working directory 에 zip 파일을 풀어줍니다.

 

Qualimap 은 JAVA 와 R 이 요구 되기 때문에 사용 전에 미리 깔아줍니다. 

JAVA runtime version 6 or above.
R enviroment version 3.1 or above.

JAVA, R installation 은 아래 링크를 참고하여 주세요.

http://qualimap.conesalab.org/doc_html/intro.html#installation

 

Window 에서 사용하실꺼면 다운받으시고 install 만해주시면 (보통 게임깔듯이) GUI 환경에서 바로 사용 가능합니다.

 

저는 linux 에서 사용하는 방법을 실습해보겠습니다.

 

drwxrwxr-x 5 skkwon1048 server      16384 2023-01-20 12:20 qualimap_v2.2.1
-rw-rw-r-- 1 skkwon1048 server   28368668 2023-01-20 10:12 qualimap_v2.2.1.zip
[skkwon1048@server ~]$ unzip qualimap_v2.2.1.zip
# 현재 폴더에서 압축해제

unzip qualimap_v2.2.1.zip

해주면 현재 폴더에서 압축해제 -> qualimap_v2.2.1 생성

-rw-rw-r-- 1 skkwon1048 server   10668 2016-10-04 01:14 HISTORY
drwxrwxr-x 2 skkwon1048 server   16384 2016-10-04 01:14 lib
-rw-rw-r-- 1 skkwon1048 server   17982 2016-10-04 01:14 LICENSE
-rwxr-xr-x 1 skkwon1048 server    1923 2023-01-20 12:20 qualimap
-rw-rw-r-- 1 skkwon1048 server     148 2016-10-04 01:14 qualimap.bat
-rw-rw-r-- 1 skkwon1048 server 1021911 2016-10-04 01:14 qualimap.jar
-rw-rw-r-- 1 skkwon1048 server  411187 2016-10-04 01:14 QualimapManual.pdf
drwxrwxr-x 2 skkwon1048 server   16384 2016-10-04 01:14 scripts
drwxrwxr-x 2 skkwon1048 server      62 2016-10-04 01:14 species

 

qualimap_v2.2.1 폴더에 들어가보시면

qualimap 이라는 -x 실행 파일이 하나 있을겁니다.

 

우선 --help 으로 옵션을 살펴보겠습니다.

[skkwon1048@server ~]$./qualimap  --help
Java memory size is set to 1200M
Launching application...

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
QualiMap v.2.2.1
Built on 2016-10-03 18:14

usage: qualimap <tool> [options]

To launch GUI leave <tool> empty.

Available tools:

    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts

Special arguments:

    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G

 

bamqc, rnaseq, counts, multi-bamqc, clustering, comp-counts, 여러가지가 있네요.

rnaseq 옵션 보시면, qualimap 은  whole-genome sequencing data 뿐만아니고 whole-exome, RNA-seq, ChIP-seq 데이터 를 모두 다룰 수 있습니다. 

 

저는 qualimap 을 통해서 BAM QC 를 진행하고자 하였기에 bamqc 옵션을 실행 해보겠습니다. 

 

usage: qualimap <tool> [options]

 

./qualimap bamqc -bam sample.bam

Java memory size is set to 1200M
Launching application...

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
QualiMap v.2.2.1
Built on 2016-10-03 18:14

Selected tool: bamqc
Available memory (Mb): 32
Max memory (Mb): 1118
Wed Jan 25 21:38:13 KST 2023            WARNING Output folder already exists, the results will be saved there

Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 492
Chunk of reads size: 1000
Number of threads: 48
Processed 50 out of 492 windows...
Processed 100 out of 492 windows...
Processed 150 out of 492 windows...
Processed 200 out of 492 windows...
Processed 250 out of 492 windows...
Processed 300 out of 492 windows...
Processed 350 out of 492 windows...
Processed 400 out of 492 windows...
Processed 450 out of 492 windows...
Total processed windows:492
Number of reads: 7407432
Number of valid reads: 7289478
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 7289478
Num mapped first of pair: 3645856
Num mapped second of pair: 3643622
Num singletons: 26933
Time taken to analyze reads: 96
Computing descriptors...
numberOfMappedBases: 701017315
referenceSize: 3137161264
numberOfSequencedBases: 700845143
numberOfAs: 179860582
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 97
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished

 

drwxrwxr-x 5 skkwon1048 server	149 2023-01-20 12:27 sample_stats

usage 명령어를 실행하면 bam 파일을 read 들을 읽어가면서 

해당 정보들을 제공합니다.

결과정보를 포함 하고 있는 sample_stats 라는 파일이 생성 됩니다.

drwxrwxr-x 2 skkwon1048 server 16384 2023-01-20 12:25 css
-rw-rw-r-- 1 skkwon1048 server 11188 2023-01-25 21:39 genome_results.txt
drwxrwxr-x 2 skkwon1048 server 16384 2023-01-25 21:39 images_qualimapReport
-rw-rw-r-- 1 skkwon1048 server 35098 2023-01-25 21:39 qualimapReport.html
drwxrwxr-x 2 skkwon1048 server 16384 2023-01-20 12:25 raw_data_qualimapReport

이 중에서 

genome_results.txt 라는 파일을 살펴보겠습니다. 

꽤나 유용한 정보들을 많이 가지고 있는듯 합니다.

read 정보, insert size, mapping quality, ACTG content percent, coverage, 정보를 제공하고

reference genome 에 contig 가 있다면 coverage per contig 정보 까지 확인 할 수 있습니다.

BamQC report
-----------------------------------

>>>>>>> Input

     bam file = sample.bam
     outfile = sample_stats/genome_results.txt


>>>>>>> Reference

     number of bases = 3,137,161,264 bp
     number of contigs = 93


>>>>>>> Globals

     number of windows = 492

     number of reads = 7,407,432
     number of mapped reads = 7,289,478 (98.41%)

     number of mapped paired reads (first in pair) = 3,645,856
     number of mapped paired reads (second in pair) = 3,643,622
     number of mapped paired reads (both in pair) = 7,262,545
     number of mapped paired reads (singletons) = 26,933

     number of mapped bases = 701,017,315 bp
     number of sequenced bases = 700,845,143 bp
     number of aligned bases = 0 bp
     number of duplicated reads (flagged) = 1,459,319


>>>>>>> Insert size

     mean insert size = 204,495.709
     std insert size = 4,615,630.9726
     median insert size = 102


>>>>>>> Mapping quality

     mean mapping quality = 45.4415


>>>>>>> ACTG content

     number of A's = 179,860,582 bp (25.66%)
     number of C's = 166,069,266 bp (23.7%)
     number of T's = 182,329,949 bp (26.02%)
     number of G's = 172,585,346 bp (24.63%)
     number of N's = 177 bp (0%)

     GC percentage = 48.32%


>>>>>>> Mismatches and indels

    general error rate = 0.0035
    number of mismatches = 2,356,408
    number of insertions = 46,701
    mapped reads with insertion percentage = 0.62%
    number of deletions = 59,465
    mapped reads with deletion percentage = 0.79%
    homopolymer indels = 53.24%


>>>>>>> Coverage

     mean coverageData = 0.2235X
     std coverageData = 9.1142X

     There is a 2.85% of reference with a coverageData >= 1X
     There is a 2.54% of reference with a coverageData >= 2X
     There is a 0.5% of reference with a coverageData >= 3X
     There is a 0.44% of reference with a coverageData >= 4X
     There is a 0.12% of reference with a coverageData >= 5X
     There is a 0.11% of reference with a coverageData >= 6X
     There is a 0.07% of reference with a coverageData >= 7X
     There is a 0.07% of reference with a coverageData >= 8X
     There is a 0.06% of reference with a coverageData >= 9X
     There is a 0.06% of reference with a coverageData >= 10X
     There is a 0.06% of reference with a coverageData >= 11X
     There is a 0.06% of reference with a coverageData >= 12X
     There is a 0.06% of reference with a coverageData >= 13X
     There is a 0.06% of reference with a coverageData >= 14X
     There is a 0.06% of reference with a coverageData >= 15X
     There is a 0.05% of reference with a coverageData >= 16X
     There is a 0.05% of reference with a coverageData >= 17X
     There is a 0.05% of reference with a coverageData >= 18X
     There is a 0.05% of reference with a coverageData >= 19X
     There is a 0.05% of reference with a coverageData >= 20X
     There is a 0.05% of reference with a coverageData >= 21X
     There is a 0.05% of reference with a coverageData >= 22X
     There is a 0.05% of reference with a coverageData >= 23X
     There is a 0.05% of reference with a coverageData >= 24X
     There is a 0.05% of reference with a coverageData >= 25X
     There is a 0.05% of reference with a coverageData >= 26X
     There is a 0.05% of reference with a coverageData >= 27X
     There is a 0.05% of reference with a coverageData >= 28X
     There is a 0.05% of reference with a coverageData >= 29X
     There is a 0.05% of reference with a coverageData >= 30X
     There is a 0.05% of reference with a coverageData >= 31X
     There is a 0.05% of reference with a coverageData >= 32X
     There is a 0.05% of reference with a coverageData >= 33X
     There is a 0.05% of reference with a coverageData >= 34X
     There is a 0.05% of reference with a coverageData >= 35X
     There is a 0.05% of reference with a coverageData >= 36X
     There is a 0.05% of reference with a coverageData >= 37X
     There is a 0.05% of reference with a coverageData >= 38X
     There is a 0.05% of reference with a coverageData >= 39X
     There is a 0.05% of reference with a coverageData >= 40X
     There is a 0.05% of reference with a coverageData >= 41X
     There is a 0.05% of reference with a coverageData >= 42X
     There is a 0.05% of reference with a coverageData >= 43X
     There is a 0.05% of reference with a coverageData >= 44X
     There is a 0.05% of reference with a coverageData >= 45X
     There is a 0.05% of reference with a coverageData >= 46X
     There is a 0.05% of reference with a coverageData >= 47X
     There is a 0.05% of reference with a coverageData >= 48X
     There is a 0.05% of reference with a coverageData >= 49X
     There is a 0.04% of reference with a coverageData >= 50X
     There is a 0.04% of reference with a coverageData >= 51X


>>>>>>> Coverage per contig

	chrM	16571	2145237	129.45730493029993	40.857175963953175
	chr1	249250621	97949201	0.39297475210703686	13.350223649874902
	chr2	243199373	116274905	0.47810528277965586	13.515499682572164
	chr3	198022430	44220163	0.22330885950647106	13.976964732469275
	chr4	191154276	20675036	0.1081588988362468	3.881480133175523
	chr5	180915260	28955520	0.16005018039937594	6.888916980041564
	chr6	171115067	31129888	0.1819237110195562	6.63666421117114
	chr7	159138663	56614389	0.3557550876244323	12.073993228915015
	chr8	146364022	29961124	0.20470279232966146	9.618357004971429
	chr9	141213431	13887776	0.09834599939718199	5.359346834674725
	chr10	135534747	30573345	0.22557569683588224	8.034123405501704
	chr11	135006516	27967130	0.20715392729636842	8.19313618707895
	chr12	133851895	32066957	0.23957043716116233	8.663117962736726
	chr13	115169878	16367590	0.14211693442967788	6.747454579397003
	chr14	107349540	6630354	0.061764158467749374	0.9694881447693602
	chr15	102531392	12640326	0.12328249674012033	5.842435465996139
	chr16	90354753	17163693	0.18995893885073206	6.942825315050472
	chr17	81195210	16528618	0.20356641728003413	7.256564862022594
	chr18	78077248	6557430	0.08398643866136266	2.8881597127990206
	chr19	59128983	40479673	0.684599513575263	16.568038141166596
	chr20	63025520	11450366	0.18167824716083264	7.308576526437246
	chr21	48129895	2621357	0.05446421605532279	0.629346603800844
	chr22	51304566	8388896	0.1635116843206509	7.097840402525419
	chrX	155270560	22724299	0.1463529145512195	5.276202899796574
	chrY	59373566	2445061	0.041180969322273824	1.3019284816783934
	chr1_gl000191_random	106433	6124	0.05753854537596422	0.3604683932294615
	chr1_gl000192_random	547496	33307	0.06083514765404679	0.39637363543190957
	chr4_ctg9_hap1	590426	18188	0.030804876479016845	0.25779840021378597
	chr4_gl000193_random	189789	31821	0.16766514392298815	0.7133604994621406
	chr4_gl000194_random	191469	19762	0.10321253048796411	0.480707870298712
	chr6_apd_hap1	4622290	29502	0.006382550640483397	0.11472252379266212
	chr6_cox_hap2	4795371	88361	0.018426311540858883	0.20691318356888658
	chr6_dbb_hap3	4610396	73696	0.015984744043678677	0.18107990047499767
	chr6_mann_hap4	4683263	65095	0.013899496996004708	0.17272413006920695
	chr6_mcf_hap5	4833398	75365	0.015592550003124097	0.19681912620498618
	chr6_qbl_hap6	4611984	72388	0.01569563120774053	0.18496427817367903
	chr6_ssto_hap7	4928567	82766	0.016793116538742398	0.1939710243560572
	chr7_gl000195_random	182896	38440	0.21017408800629866	0.728641039211291
	chr8_gl000196_random	38914	1201	0.03086292850901989	0.24835277478291629
	chr8_gl000197_random	37175	1496	0.040242098184263615	0.36497492021726824
	chr9_gl000198_random	90085	13232	0.14688349891768884	0.5854912449716595
	chr9_gl000199_random	169874	839971	4.944670755972074	10.53842223113749
	chr9_gl000200_random	187035	5095	0.027240890742374423	0.22904558423403448
	chr9_gl000201_random	36148	1321	0.036544207148389955	0.3218606130648401
	chr11_gl000202_random	40103	1048	0.026132708276188812	0.2415885615731324
	chr17_ctg5_hap1	1680828	52470	0.031216757455254197	0.25136619384170333
	chr17_gl000203_random	37498	2702	0.07205717638274041	0.42004822188403856
	chr17_gl000204_random	81310	2923	0.03594883778133071	0.27649086150708024
	chr17_gl000205_random	174588	35888	0.20555822851513278	0.7432616568700222
	chr17_gl000206_random	41001	724	0.0176581058998561	0.18709463673697777
	chr18_gl000207_random	4262	404	0.09479117785077429	0.4022765939446781
	chr19_gl000208_random	92689	53460	0.5767674697105374	2.179851214172754
	chr19_gl000209_random	159169	3059	0.01921856642939266	0.20038426671315498
	chr21_gl000210_random	27682	1088	0.03930351853189799	0.26262210421172555
	chrUn_gl000211	166566	13203	0.07926587658946004	0.45231825205425474
	chrUn_gl000212	186858	89481	0.4788716565520342	6.5538592174356936
	chrUn_gl000213	164239	4172	0.025402005613770175	0.22081293245223313
	chrUn_gl000214	137718	54360	0.39471964449091623	1.121799531010225
	chrUn_gl000215	172545	5134	0.02975455678228868	0.2673761184742546
	chrUn_gl000216	172294	237756	1.3799435848027208	9.553572616917938
	chrUn_gl000217	172149	16351	0.0949816728531679	0.45818203908084965
	chrUn_gl000218	161147	22026	0.1366826562083067	0.5563742631009525
	chrUn_gl000219	179198	32407	0.18084465228406568	0.6905329970447371
	chrUn_gl000220	161802	673678	4.163595011186512	8.262311110800228
	chrUn_gl000221	155397	25598	0.16472647477107022	0.6136076072498677
	chrUn_gl000222	186861	14248	0.07624919057481229	0.47602320775818796
	chrUn_gl000223	180455	10179	0.05640741459089524	0.37277119427204186
	chrUn_gl000224	179693	80053	0.4454987116916074	1.5358699739306765
	chrUn_gl000225	211173	147934	0.7005346327418751	2.6126796738038927
	chrUn_gl000226	15008	447859	29.841351279317696	22.81917181073404
	chrUn_gl000227	128374	4031	0.031400439341299644	0.275567938162207
	chrUn_gl000228	129120	29979	0.23217936802973976	1.7370586657928235
	chrUn_gl000229	19913	5250	0.26364686385778136	0.8841950286885981
	chrUn_gl000230	43691	1757	0.04021423176397885	0.34072840460862786
	chrUn_gl000231	27386	6834	0.2495435624041481	0.7618823019988965
	chrUn_gl000232	40652	12618	0.3103906326871987	0.9479686512983967
	chrUn_gl000233	45941	3581	0.07794780261639929	0.4505794120285086
	chrUn_gl000234	40531	10190	0.25141249907478225	0.8382868489333547
	chrUn_gl000235	34474	5179	0.15022915820618438	0.6105035642883383
	chrUn_gl000236	41934	2900	0.06915629322268327	0.4006554566052661
	chrUn_gl000237	45867	15373	0.33516471537270803	0.8973115891103459
	chrUn_gl000238	39939	796433	19.94123538396054	65.05512759962023
	chrUn_gl000239	33824	5298	0.15663434247871333	0.56171174279999
	chrUn_gl000240	41933	3722	0.0887606419764863	0.4342554548597429
	chrUn_gl000241	42152	9552	0.22660846460428924	0.695214877286874
	chrUn_gl000242	43523	2088	0.047974634101509546	0.35451537897650026
	chrUn_gl000243	43341	9350	0.21573106296578298	0.7281273225492091
	chrUn_gl000244	39929	2092	0.05239299757068797	0.3359133115675952
	chrUn_gl000245	36651	5698	0.1554664265640774	0.5532747547047702
	chrUn_gl000246	38154	1844	0.048330450280442416	0.3158744355872739
	chrUn_gl000247	36422	2379	0.06531766514743836	0.35783200645194
	chrUn_gl000248	39786	1006	0.025285276227818832	0.25723155813673626
	chrUn_gl000249	38502	136491	3.545036621474209	27.983774577121636

각 수치들이 시각화된 figure 들까지 확인 할 수 있으니 아주 유용합니다.

genome_coverage_0to50_histogram.png
genome_coverage_across_reference.png
genome_coverage_histogram.png
genome_coverage_quotes.png
genome_gc_content_per_window.png
genome_homopolymer_indels.png
genome_insert_size_across_reference.png
genome_insert_size_histogram.png
genome_mapping_quality_across_reference.png
genome_mapping_quality_histogram.png
genome_reads_clipping_profile.png
genome_reads_content_per_read_position.png
genome_uniq_read_starts_histogram.png

 

지금까지 BAM QC 하는 방법을 알아 보았습니다.

Qualimap2 이 제공하는 정보를 바탕으로

추가로 진행되는 분석에 도움이 되시길 바랍니다.

 

reference:

http://qualimap.conesalab.org/doc_html/command_line.html

 

Command Line Interface — Qualimap 2.2.1 documentation

The main argument for this command is the configuration file describing the input samples (-d). This has to be a 4-column tab-delimited file. The first column should contain the name of the sample, the second - name of the biological condition (e.g treated

qualimap.conesalab.org

반응형