Ver Post

Inesc - ID - Talk by Klaus Berberich

13 Novembro 2014, 14:14 - Lucília Abreu

Frequent sequence mining is a fundamental building block in data mining. While the problem has been intensively studied, existing methods cannot handle datasets consisting of billions of sequences. Datasets of that scale are common in applications such as natural language processing, when computing n-gram statistics over large-scale document collections, and business intelligence, when analyzing sessions of millions of users. In this talk, I will present two methods that we developed recently to mine frequent sequences using MapReduce as a platform for distributed data processing. Suffix-Sigma, as the first method, targets the special case of contiguous sequences such as n-grams. It relies on sorting and aggregating sequence suffixes, leveraging ideas from string processing. MG-FSM, as the second method, identifies also non-contiguous frequent sequences. To this end, it partitions and prepares the input in such way that frequent sequences can be efficiently mined in isolation on each of the resulting partitions using any existing method. Experiments on two large-scale document collections demonstrate that Suffix-Sigma and MG-FSM are substantially more efficient and scalable than alternative approaches. Furthermore, I will discuss extensions of Suffix-Sigma and MG-FSM, for instance, to report only closed or maximal sequences and thus drastically reduce their output. (* Joint INESC-ID/LASIGE Seminar).
Klaus Berberich is a Senior Researcher at the Max Planck Institute for Informatics where he coordinates the research area Text + Time Search & Analytics. His research is rooted in Information Retrieval and touches the related areas of Data Management and Data Mining. Klaus has built a time machine -- to search in web archives. More recently, he has worked on frequent sequence mining algorithms for modern platforms such as MapReduce. His ongoing research focuses on (i) novelty & diversity in web archive search; (ii) temporal linking of document collections; (iii) mining document collections for insights about the past, present, and future. Klaus holds a doctoral degree (2010, summa cum laude) and a diploma (2004) in Computer Science from Saarland University. He has served on numerous program committees in his research communities of interest (IR, DB, DM).
The information are availible here