1st Summer School on Bioinformatics Data Structures – Bioinformatics and Information Retrieval Data Structures Analysis and Design

The University of Helsinki organizes the 1st Summer School on Bioinformatics Data Structures from 9th to 12th (the week before IWOCA) as part of BIRDS project.

Summer School Organization:

Program Chair: Travis Gagie
Local Chair: Simon J. Puglisi
Organizing Committe: Leena Salmela and Veli Mäkinen

Registration:

There is no registration fee. BIRDS will provide some travel grants and arrange some discounts on rooms, but guests are responsible for their own accommodation and meals. The organization will also arrange childcare, if needed.

If you need further assistance please do not hesitate to contact the organizer: Simon J. Puglisi (puglisi [at] cs [.] helsinki [.] fi)

Location:

It will be held at the Exactum building (Kumpula campus, University of Helsinki).

Getting to Exactum

Program:

(You can see abstracts and lecture materials here)

Tuesday (Aug 9):

10–11.30 Alexandru Tomescu (University of Helsinki, Finland), “Safe and complete genome and metagenome assembly via omnitigs”

11.30–13 lunch

13–14.30 Leena Salmela (University of Helsinki, Finland), “Error correction of sequencing reads”

14.30–16 Antti Laaksonen (Aalto University, Finland), “Maximum subarray problem: History, variations and applications”

Wednesday (Aug 10):

10–11.30 Djamal Belazzougui (Centre de Recherche sur l’Information Scientifique et Technique, Algeria) + Fabio Cunial (Max Planck Institute of Molecular Cell Biology and Genetics, Germany), “Suffix links survival kit 1”

11.30–13 lunch

13–14.30 Shunsuke Inenaga (Kyushu University, Japan), “The myriad virtues of DAWGs”

14.30–16 Jouni Sirén (Wellcome Trust Sanger Institute, UK), “FM-index and the reverse prefix trie”

Thursday (Aug 11):

10–11.30 Costas Iliopoulos (King’s College London, UK), “Popping Superbubbles”, “Finding Clumps”

11.30–13 lunch

13–14.30 (Centre de Recherche sur l’Information Scientifique et Technique, Algeria) + Fabio Cunial (Max Planck Institute of Molecular Cell Biology and Genetics, Germany), “Suffix links survival kit 2”

14.30–16 Jouni Sirén (Wellcome Trust Sanger Institute, UK), “Indexing variation graphs”

Friday (Aug 12):

10–11.30 Juha Kärkkäinen (University of Helsinki, Finland), “The longest common prefix array”

11.30–13 lunch

13–13.30 Szymon Grabowski (Łódź University of Technology, Poland), “Compressed genomic sequences with fast access”

13.30–15 Alexandru Tomescu (University of Helsinki, Finland), “Min-cost flows and applications to bioinformatics”

All lectures will be in room D122 (except Thursday, which will be in room C123) in Exactum on the Kumpula Campus of the University of Helsinki. 90-minute lectures will be split into two 40-minute sections with a 10-minute break between them.

Abstracts and Lecture Materials:

“Safe and complete genome and metagenome assembly via omnitigs” (Alexandru Tomescu)

We start by introducing the genome and metagenome assembly problems, and formulate them under the framework of a safe and complete algorithm. We present some previous work on safe algorithms, such as assembling unitigs and Y-to-V contigs. We then give safe and complete algorithms for the two problems. These are based on graph-theoretic characterizations. We conclude by showing some experimental results.

Slides “Safe and complete genome and metagenome assembly via omnitigs”

“Error correction of sequencing reads” (Leena Salmela)

We will start by discussing the error characteristics of modern sequencing technologies and by introducing the sequencing error correction problem. We will then give an overview of the techniques used for correcting sequencing errors in short read data. The second half of the lecture will discuss methods for correcting the newer long read data with high error rate.

Slides “Error correction of sequencing reads”

“Maximum subarray problem: History, variations and applications” (Antti Laaksonen)

The maximum subarray problem asks to find a contiguous sequence of numbers in an array such that the sum of the numbers is as large as possible. We discuss the history and the variations of the problem, such as allowing updates to the array and maximizing the average instead of the sum. We also present some applications for the problem.

Slides “Maximum subarray sum problem”

“Suffix links surival kit 1 and 2” (Djamal Belazzougui + Fabio Cunial)

Suffix links enable fast suffix tree construction, have a number of applications to string analysis, and their reverse can be often explored with space-efficient data structures. We describe a kit of multifunctional properties of suffix links, Weiner links, and of the suffix-link tree, which students could immediately use to manufacture the tools they might need in their own research. We also explore related concepts of maximal repeats, borders, implicit nodes, Eulerian directed graphs, and the analogies between suffix-link tree and suffix tree of the reverse text.

Slides “Suffix links survival kit” , Notes “Suffix links survival kit”

“The myriad virtues of DAWGs” (Shunsuke Inenaga)

Directed acyclic word graphs (DAWGs), proposed by Blumer et al., are a dual data structure of suffix trees. The DAWG of a string is the smallest automaton accepting the suffixes of the string, and has linear number of nodes and edges in the input string length. In this course, we study relationships between suffix tries and DAWGs, duality of DAWGs and suffix trees, some applications of DAWGs, and efficient construction algorithms for DAWGs.

Slides “The myriad virtues of DAWGs”

“FM-index and the reverse prefix trie” (Jouni Sirén)

We can consider the FM-index the trie of reverse prefixes of the text. Many algorithms using the FM-index can be understood as traversals of the trie. We will take a look at algorithms for a variety of tasks including approximate searching, k-mer counting, all-against-all comparison of sequence collections, and BWT merging. These algorithms often require only the FM-index and very little additional code. They are easy to parallelize and quite efficient in practice.

Slides “FM-index and the reverse prefix trie”

“Popping Superbubbles” “Finding Clumps” (Costas Iliopoulos)

Slides “Superbubbles and Clumps”

“Indexing variation graphs” (Jouni Sirén)

Variation graphs, which represent genetic variation within a population, are replacing sequences as reference genomes. Path indexes are one of the most important tools for working with variation graphs. They generalize text indexes for graphs, allowing one to find the paths matching the query string. We propose using pruned de Bruijn graphs as path indexes, and encode them space-efficiently with the Burrows-Wheeler transform. We also generalize ideas from text indexing literature and use them with graphs. The proposed approach has been implemented in the variation graph toolkit vg.

“Compressed genomic sequences with fast access” (Szymon Grabowski)

The similarity of individual genomes of one species poses a compression challenge, considering the highly desired random access functionalities. We will take a look at algorithms combining efficient genome compression with fast access, also in the scenario when a database of variants (e.g., in VCF format) is available.

Slides “Compressed genomic sequences with fast access”

“The longest common prefix array” (Juha Kärkkäinen)

“Min-cost flows and applications to bioinformatics” (Alexandru Tomescu)

We define the min-cost flow/circulation problems, give a pseudo-polynomial time algorithm for finding a min-cost circulation, and then present some applications to bioinformatics: assembling RNA transcripts and mapping metagenomic reads to reference genomes.

Speakers:

Djamal Belazzougui, Centre de Recherche sur l’Information Scientifique et Technique, Algeria

Fabio Cunial, Max Planck Institute of Molecular Cell Biology and Genetics, Germany

Fabio Cunial is a postdoctoral researcher at MPI-CBG (Dresden). His research focuses on string algorithms, genome and metagenome analysis and comparison, and de novo assembly. Before coming to Dresden he was a postdoctoral researcher at the University of Helsinki, and he obtained his PhD from GeorgiaTech.

Szymon Grabowski, Łódź University of Technology, Poland

Szymon Grabowski is an associate professor at the Lodz University of Technology, Poland. He obtained his doctoral degree from AGH University of Science and Technology AGH in Cracow and habilitation degree from IBS PAN in Warsaw. His interests include string matching algorithms and compressed data structures.

Costas Iliopoulos, King’s College London, UK

Shunsuke Inenaga, Kyushu University, Japan

Shunsuke Inenaga received a master’s degree of science from Kyushu University, Japan, in 2002, and received a PhD from Kyushu University, in 2003. From 2003, He worked as a postdoc researcher for Japan Science Technology Agency, University of Helsinki, Kyoto University, and Japan Society for the Promotion of Science. He became a tenure track research fellow at Kyushu University in 2005, and became an associate professor at Kyushu University in 2011. His research interests include algorithms and data structures for string processing, text compression, and combinatorics on words.

Jouni Sirén, Wellcome Trust Sanger Institute, UK

Jouni Sirén is a postdoctoral fellow at the Wellcome Trust Sanger Institute. He develops space-efficient index structures for projects dealing with massive amounts of sequence data, such as the variation graph toolkit vg and the population BWT. Before coming to Sanger, he received his PhD from the University of Helsinki and worked as a postdoctoral researcher at the University of Chile.

Juha Kärkkäinen, University of Helsinki, Finland

Antti Laaksonen, Aalto University, Finland

Antti Laaksonen is a postdoctoral researcher at the Aalto University. He obtained his doctoral degree from the University of Helsinki. At the moment his research focuses on coding theory, but he is interested in all topics related to algorithm design and discrete mathematics.

Leena Salmela, University of Helsinki, Finland

Leena Salmela is a Postdoctoral Researcher and Docent at the University of Helsinki. She is interested in algorithms for biological sequences. She has developed several methods for various subproblems of genome assembly. Before coming to the University of Helsinki, she obtained her doctoral degree from Aalto University, Finland.

Alexandru Tomescu, University of Helsinki, Finland

Alexandru Tomescu is a Postdoctoral Researcher and Docent at the University of Helsinki. He is interested in applications of graph algorithms and combinatorial optimization problems to bioinformatics. He is co-author of the textbook “Genome-scale Algorithm Design”. Before coming to the University of Helsinki, he obtained his PhD from the University of Udine, Italy, and was a grant-based researcher for 6-months at TU Berlin, Germany.