본문 바로가기
Python

Multiple VCF 파일을 하나의 DataFrame 으로 합치기

by 코딩하는 미토콘드리아 Bioinformatics Lab 2024. 7. 14.
728x90

Multiple VCF 파일을  하나의 DataFrame 으로 합치기

 

import pandas as pd
import glob

# VCF 파일들이 저장된 디렉토리 경로
vcf_dir = 'path/to/vcf/files/'

# VCF 파일 목록 가져오기
vcf_files = glob.glob(vcf_dir + '*.vcf')

# VCF 파일 읽기 함수
def read_vcf(file):
    with open(file, 'r') as f:
        lines = f.readlines()

    header_line = [line for line in lines if line.startswith('#') and not line.startswith('##')][0]
    header = header_line.strip().split('\t')
    data_lines = [line for line in lines if not line.startswith('#')]

    data = [line.strip().split('\t') for line in data_lines]
    df = pd.DataFrame(data, columns=header)
    return df

# 모든 VCF 파일을 읽어서 데이터프레임으로 결합
df_list = [read_vcf(file) for file in vcf_files]
df_combined = pd.concat(df_list, ignore_index=True)

print(df_combined.head())

 

 

https://dmnfarrell.github.io/bioinformatics/multi-sample-vcf-dataframe

 

Bioinformatics and other bits - Convert a multi-sample VCF to a pandas DataFrame

Background Here is some code I wrote to convert a vcf file with many samples into a table format. This was done to make the calls for many samples easier to read. Reading a multi sample vcf is tortuous. The vcf is read in using pyVCF and for each record (a

dmnfarrell.github.io