Logo Logo
Hilfe
Hilfe
Switch Language to English

Englmeier, Tobias; Büchler, Marco; Gerdjikov, Stefan und Schulz, Klaus U. (2021): Using an Advanced Text Index Structure for Corpus Exploration in Digital Humanities. In: Digital Humanities Quarterly, Bd. 15, Nr. 1, 1

Volltext auf 'Open Access LMU' nicht verfügbar.

Abstract

With suitable index structures many corpus exploration tasks can be solved in an efficient way without rescanning the text repository in an online manner. In this paper we show that symmetric compacted directed acyclic word graphs (SCDAWGs) - a refinement of suffix trees - offer an ideal basis for corpus exploration, helping to answer many of the questions raised in DH research in an elegant way. From a simplified point of view, the advantages of SCDAWGs rely on two properties. First, needing linear computation time, the index offers a joint view on the similarities (in terms of common substrings) and differences between all text. Second, structural regularities of the index help to mine interesting portions of texts (such as phrases and concept names) and their relationship in a language independent way without using prior linguistic knowledge. As a demonstration of the power of these principles we look at text alignment, text reuse in distinct texts or between distinct authors, automated detection of concepts, temporal distribution of phrases in diachronic corpora, and related problems.

Dokument bearbeiten Dokument bearbeiten