Aggiornamento: 23 gen
The reconstruction of the two distinct copies of each chromosome is an essential task to fully characterize the genome of an individual.
The advent of the second-generation of sequencing technologies deeply revolutionised the field of genomics, allowing for a more complete view and understanding of the genome of different species. However, even if these technologies provided a great contribution to the field, the data generated using these technologies remain unsuitable for several applications, including Haplotype Assembly (HA). HA is the problem of assigning all the heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two homologous chromosomes, leveraging data generated by sequencing experiments. The reconstruction of the distinct copies (two copies in somatic human cells) of each chromosome, called haplotypes, is essential to fully characterize the genome of an individual. SNPs remain one of the most studied genetic variations as they play an essential role in several medical applications, including drug-design or disease susceptibility studies, and the characterisation of the effects of SNPs on the expression of phenotypic traits.
“The knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications.”
[Tangherloni et al., BMC Bioinformatics 2020]
The short length of the reads, produced by the second-generation sequencing technologies, generally does not allow for spanning over a relevant number of SNP positions. As a consequence, short and separated haplotype blocks are reconstructed, hindering the possibility of reconstructing the full haplotypes. Thanks to the third-generation of sequencing technologies, it has been possible the production of sequencing data characterised by reads covering hundreds of kilobases, thus, capable of spanning different SNP loci at once. However, there was a drawback that arose, i.e., the higher the length of the reads, the lower the accuracy of the reads. In order to overcome this problem, the read coverage should be increased. The coverage of a sequencing experiment can be defined as the average number of times that each nucleotide is expected to be covered by a read. The HA problem can be solved using different computational strategies. Among them, the Minimum Error Correction (MEC) is one of the most successful approaches. MEC consists of computing the two haplotypes that partition the sequencing reads into two disjoint sets with the least number of corrections to the SNP values. Unfortunately, it was proven to be NP-hard. To address this problem, we designed and developed a novel method based on Genetic Algorithms, named GenHap. GenHap can efficiently solve large instances of the wMEC problem, a weighted variant of MEC, without any a priori hypothesis about the sequencing error distribution in reads.
More information about GenHap is available on GitHub.
GenHap exploits High-Performance Computing architectures to speed-up the required calculations. We showed the scalability and efficiency of GenHap on multi-core architectures. In addition, we also highlighted how the performance of GenHap are affected by different sequencing platforms (i.e., Illumina NovaSeq, Roche/454, PacBio RS II, and Oxford Nanopore Technologies MinION).