Abstract
With suitable index structures many corpus exploration tasks can be solved in an efficient way without rescanning the text repository in an online manner. In this paper we show that symmetric compacted directed acyclic word graphs (SCDAWGs) - a refinement of suffix trees - offer an ideal basis for corpus exploration, helping to answer many of the questions raised in DH research in an elegant way. From a simplified point of view, the advantages of SCDAWGs rely on two properties. First, needing linear computation time, the index offers a joint view on the similarities (in terms of common substrings) and differences between all text. Second, structural regularities of the index help to mine interesting portions of texts (such as phrases and concept names) and their relationship in a language independent way without using prior linguistic knowledge. As a demonstration of the power of these principles we look at text alignment, text reuse in distinct texts or between distinct authors, automated detection of concepts, temporal distribution of phrases in diachronic corpora, and related problems.
Dokumententyp: | Zeitschriftenartikel |
---|---|
Fakultät: | Sprach- und Literaturwissenschaften > Department 2 |
Themengebiete: | 400 Sprache > 400 Sprache |
ISSN: | 1938-4122 |
Sprache: | Englisch |
Dokumenten ID: | 97944 |
Datum der Veröffentlichung auf Open Access LMU: | 05. Jun. 2023, 15:27 |
Letzte Änderungen: | 05. Jun. 2023, 15:27 |