본문 바로가기
Python

Multiple VCF 파일을 하나의 DataFrame 으로 합치기

by 코딩하는 미토콘드리아 bioinformatics 2024. 7. 14.
반응형

Multiple VCF 파일을  하나의 DataFrame 으로 합치기

 

import pandas as pd
import glob

# VCF 파일들이 저장된 디렉토리 경로
vcf_dir = 'path/to/vcf/files/'

# VCF 파일 목록 가져오기
vcf_files = glob.glob(vcf_dir + '*.vcf')

# VCF 파일 읽기 함수
def read_vcf(file):
    with open(file, 'r') as f:
        lines = f.readlines()

    header_line = [line for line in lines if line.startswith('#') and not line.startswith('##')][0]
    header = header_line.strip().split('\t')
    data_lines = [line for line in lines if not line.startswith('#')]

    data = [line.strip().split('\t') for line in data_lines]
    df = pd.DataFrame(data, columns=header)
    return df

# 모든 VCF 파일을 읽어서 데이터프레임으로 결합
df_list = [read_vcf(file) for file in vcf_files]
df_combined = pd.concat(df_list, ignore_index=True)

print(df_combined.head())

 

 

https://dmnfarrell.github.io/bioinformatics/multi-sample-vcf-dataframe

 

Bioinformatics and other bits - Convert a multi-sample VCF to a pandas DataFrame

Background Here is some code I wrote to convert a vcf file with many samples into a table format. This was done to make the calls for many samples easier to read. Reading a multi sample vcf is tortuous. The vcf is read in using pyVCF and for each record (a

dmnfarrell.github.io

 

반응형