본문 바로가기
Bioinformatics

VCF 파일에서 SNP, Indel 개수 세기

by 코딩하는 미토콘드리아 bioinformatics 2024. 3. 13.
반응형

VCF 파일에서 SNP, Indel 개수 세기

 

방법1: python script

#count_SNP_indel.py

SNP = 0
insertion = 0
deletion = 0

with open("sample1.vcf","r") as fr:
	for line in fr:
    	if line.startswith("#"):
    	pass
    else:
    	l = line.split()
    	ref=l[3]
    	alt=l[4]
    
    	if len(ref) == len(alt):
    		SNP += 1
    
    	elif len(ref) > len(alt):
    		deletion += 1
    	elif len(ref) < len(alt):
    		insertion += 1
        
print("SNP:", SNP)
print("Insertion:", insertion)
print("Deletion:", deletion)

 

 

방법2: linux command line

SNPs:
awk '! /\#/' sample1.vcf | awk '{if(length($4) == 1 && length($5) == 1) print}' | wc -l

Indels:
awk '! /\#/' sample1.vcf | awk '{if(length($4) > 1 || length($5) > 1) print}' | wc -l

 

 

방법3: bcftools

SNPs:
bcftools filter --include 'TYPE="snp"'  sample1.vcf  > output_sample1.snps_only.vcf

Indels:
bcftools filter --include 'TYPE="indel"'  sample1.vcf  > output_sample1.indels_only.vcf

 

 

 

참고:https://samtools.github.io/bcftools/bcftools.html

 

bcftools(1)

HTSlib was designed with BCF format in mind. When parsing VCF files, all records are internally converted into BCF representation. Simple operations, like removing a single column from a VCF file, can be therefore done much faster with standard UNIX comman

samtools.github.io

 

반응형