CroALa DokuWiki


Quadrata rotundis? Lexical comparison of two sets of neo-Latin texts

Neven Jovanović, Zagreb

Suppose we have two sets of neo-Latin texts which can be manipulated computationally – that is, every word in every text can be accessed by a computer program. Suppose further that there are basic bibliographic data supplied for both sets. Finally, suppose that one of these sets is relatively familiar, while the other is relatively unknown. How do we go about comparing these sets?

It is our goal to demonstrate an orderly chain of procedures for basic computational comparison of sets of neo-Latin texts. The procedures will not involve sophisticated transformations, statistics or visualizations; rather, they are meant to relatively quickly show us how the sets are similar or dissimilar, and then to lead us toward passages, patterns, and points which could repay close reading – in other words, to textual aspects which could lead us to form interesting hypotheses about the relationship of two sets or their constituents. The demonstration will be presented as a series of questions and answers; each question can be formulated as a request for information from a computer database (a query), and each answer is actually information retrieved from the database. Queries and results are also recorded on the website of this paper (i. e. here).

The sets that we will be comparing are, on the one hand, the collection Croatiae auctores Latini (CroALa), comprising writings by Croatian authors from 976 until 1984 (Jovanović et al. 2009), and, on the other hand, the pilot collection Latinitas Tyroliensis (LatTy, Luggin et al. 2014); in this paper, CroALa will be the familiar, and LatTy the unfamiliar collection. Both are freely available for further uses and research; both are accessible for searching, and their source code is published in a distributed version control system.1) Both CroALa and LatTy comprise modern or historical editions of neo-Latin texts turned into a machine-readable format with added digital markup language elements.

2.1 How many texts are there in the two collections?

To see the answer, click on the XQuery link:

2.2 How many authors are there in CroALa and LatTy?

To see the answer, click on the XQuery:

Note that some prosopographical information can be accessed from the XQuery page.

Findings and observations: 'national' property of Croatian authors ('alienus'). Categories of 'anonymus', but also 'ignotus' (we only know the authors' name, sometimes only his first name). Many authors in one document (paratexts). Level of authorship: translators?

2.3 When were the texts written?

To see an XQuery-produced table of texts by period in CroALa and in LatTy, with links to list of documents in respective periods, click this link:

Or study the chart below.

Findings and observations: peak frequencies of texts in Croatian collection come before and after the Tyrolean set.

2.4 How are the texts distributed according to form (poetry, prose, drama)?

CroALa and LatTy do not distinguish prose from poetry and drama in the same way. CroALa documents have annotations, which we use in this query:

See also prose and poetry over time in CroALa:

For LatTy documents, we check whether they consist of verses or prose paragraphs, and whether they have speeches (for drama), finally doing some quick calculations to see whether prose or poetry prevails:

The solution in LatTy is actually more realistic. Here we apply it to CroALa:

And here we apply it to both collections:

LatTy: Poetry 8, 4 , Madr (= 0.65) : Prose 3, 4 (= 0.35)
CroALa: Poetry 165, 48 (= 0.39) : Prose 303 , 36 (= 0.61)

Findings and observations: first parallells, though very crude: which texts are similar in number of verses, in ratio of prose and poetry.

Which genres are there in the collections?

A list of genres for the two collections will be produced by the following XQueries:

Further links lead to names of documents containing texts of respective genres.

The following XQuery reconciles the genres in CroALa and LatTy according to the list below, and finds documents belonging to respective genres:

declare variable $genrelat as map(*) := map {
 "Brief": "prosa oratio - epistula",
 "Dichtung": "poesis",
 "Epik" : "poesis - epica",
 "Gelegenheitsdichtung": "poesis - sylva",
 "Lehrgedicht" : "poesis - didactica",
 "Geschichtsschreibung": "prosa oratio - historia",
 "Theater": "poesis - drama"

Findings and observations: loose definition of 'occasional poem'; it was hard to find analogies on the same level in the Croatian set. Different profile of letter collections in the two sets. Historiography: Tyrolean travelogues, Croatian narratives.

How are the texts distributed according to genre?

See the numbers in column 2 of previous XQueries (,

Findings and observations: genres with many texts in the Croatian set. Outliers.

2.5 How are genres distributed chronologically?

Genres in periods (thirds of century):

A more synoptic view – table of periods, genres, document names for both collections, with links to specific files:

2.6 How many words?

First we have to decide what to count. Here we present two approaches: a word count for each document (remember: a document can contain several texts of various levels of autonomy), and a word count for each division inside each document (e. g. a chapter, but also an individual poem in a collection).

How many words in collections, documents, texts?

Documents are ordered by word counts (descending). For easier orientation, a link to the text of each document is added to its name.

Documents in LatTy at a glance:

Documents in CroALa at a glance:

Word counts in divs

Here are, once more, word counts in units smaller than documents, but in lowest text divisions (divs)2), i. e. those that do not contain other divisions. The following XQueries provide also links to divisions in question. Divisions are ordered by word count, descending.

How do type/token ratios look like?

So far we have been counting words (tokens). Now we will count word types, i. e. unique words, not their repeated occurrences. Word types are not lemmatised or normalised – we count 'terra' and 'terrae' (and 'terrę' and 'terraeque') as different types.

Ratio of tokens to types (the number of types divided by the number of tokens, usually shortened to TTR) provides quick information on lexical variation.

TTR in CroALa texts:

TTR in LatTy texts:

TTR in divs, CroALA:

TTR in divs, LatTy:

2.7 Which divs in 17th century have a TTR of 0.5?

See all divs in a certain period: (try changing the last part of the URL address, e. g. to

2.8 Verses

How many verses in all, how many verses in texts?

How many verses in periods?

db	saec	doc	vers	verb
latty	15xx_3	6	6,959	43,555
croala	15xx_3	17	7,998	50,584
latty	16xx_1	3	2,127	12,190
croala	16xx_1	7	314	1,919
latty	16xx_2	1	95	397
croala	16xx_2	6	1,391	8,589
latty	16xx_3	1	573	3,055
croala	16xx_3	14	28,249	176,179
latty	17xx_1	1	380	2,396
croala	17xx_1	7	5,014	32,086
latty	17xx_2	3	2,422	15,361
croala	17xx_2	12	20,327	133,094
latty	17xx_3	2	487	2,576
croala	17xx_3	25	97,259	647,815


2.9 Frequent words - modal adverbs and discourse particles

How are scilicet, videlicet, quippe, nempe and nimirum5), and nam, igitur, enim, at, autem, ergo, vero6) distributed in both collections? How are they distributed by author, by text type, by period, by genre, by verses and paragraphs?


Literary history, and perhaps humanities in general, work well with examples, especially with highlights and peaks. We tell the story of a literature by concentrating on the few cases which we, or the tradition, pronounces the best, or most important, or characteristic. We readily accept such approach not only because of human cognitive limitations, but also because from the story of a literature we expect clear and conclusive results: this is what is important, this is what we need to know.

The need for conclusiveness and the pull of the exemplary caused a lot of the unease I felt during this experiment, because in it the approach was just the opposite. The computer did not analyse some texts, but all texts, works of national heroes and classics received same treatment as non-literary documents or occasional poems by schoolteachers. Moreover, both collections were very obviously constructs – they certainly represented themselves, but the reality less certainly – and the amount and completeness of results made it painfully clear how much our conclusions usually simplify and disregard (definition of “text” as a homogenous whole, problematic in the case of a collection of letters or poems, or in a text with many paratexts, as well as definition of “author”, problematic in cases of anonymity, translation, or joint authorship, are just some examples).

But several points and dimensions of similarity and dissimilarity between the collections were definitely identified.

CroALa is larger (over five million words, 190 authors, 450 documents), LatTy is smaller (20 documents, 13 authors, over 370,000 words). CroALa has wider coverage of authors, periods, and genres. Documents in LatTy were written between 1566 and 1799, belong to no more than seven genres, while in CroALa there are 76 different genre labels. In CroALa there is more textual units in prose (61%), in LatTy 60% of texts are dominantly poetic. In CroALa there is no drama, of which LatTy has a significant corpus. Diachronically, for periods covered by both collections, a shift in intensity is noticeable: the years 1566-1765, when texts in LatTy were predominantly being written, are comparatively less well populated by texts of CroALa, whose best coverage belongs to periods before and after LatTy's focus (years 1500-1565 and 1766-1799). In LatTy, one document stands out by its length: the collection of letters by Benedikt Stephani from 1640-1671 (226,731 words), almost tenfold the length of the next longest document. The shortest document (Applausus ad principem Eugenium, 1669) had 273 words; excluding these two extremes, other documents in LatTy are on the average about 8,200 words long. In CroALa, excluding the three documents with more than 300,000 words, the average text length is somewhat lower than in LatTy: c. 7,600 words. But average lengths of text sections (again excluding the extremes), divs, are roughly the same in both collections: 400 words for LatTy, 390 for CroALA.

A survey of type-token ratios (lexical variation) for texts and their sections (divs) finds in LatTy a TTR range of 0.82-0.15 (for 20 texts) and a much tighter 1-0.45 (for 859 divs); in CroALa, ranges are 1-0.08 (in 518 texts) and 1-0.18 (in 13,740 divs). Our queries produced several indices of passages with similar TTR values; but a more detailed investigation is necessary to find out which differences in lexical variation are noticeable in reading or significant in stylistic analysis. A useful further step seems to be combining TTR with other criteria, such as periods, authors, genres, and concentrating on very clear cases (few results, strong contrasts). Moving from sheer numbers to actual words, a large-scale search for completely identical clausulae in 12,692 lines of verse in LatTy and 241,829 verses in CroALa turned up 25 cases of clausulae repeated in LatTy and present in CroALa as well, and 236 clausulae appearing only once in LatTy, but again present in CroALa. This is an excellent task for computers; but an inquiry into causes of these similarities – common models and conventions? reading lists and educational framework? or perhaps even a direct intertextual contact of which until now we have not been aware? – seems quite promising for humans as well.

Another large-scale search that enabled me to compare the usages of nine Latin modal and discourse particles brought out an interesting methodological moment as well. “Macroscopic” analyses make it possible to think not only about positive, but also about negative results, ones which the researchers usually do not even consider. Counts of text sections from which certain particles are absent suggest that in both collections the “most missing” particles are the same (videlicet, scilicet, quippe, and nempe), but the “least missing” rankings correspond only for nam; enim, autem, and igitur are distributed differently in the two collections.

We are moving towards the point where we will be able, even for an exotic language such as Latin, to access not just a few texts, and not after a long process of demanding manipulation, but where all texts (or definitely: too many to read in a lifetime) will be instantly accessible. What shall we do at that point? I hesitate to propose that we all become statisticians, because this is not what we do best, this is not what we enjoy most, and, finally, this is not what these texts were written for. But I also hesitate to propose that we go on interpreting texts and narrating the literary history the way we were taught to, because such partial approaches disregard much of what is within our reach. What I certainly want to do is to invite you to explore further similarities and differences of CroALa and LatTy. Nascitur hic nobis alius, non desinit orbis.

2) Cf. TEI Guidelines: 4.1 Divisions of the Body,
3) cf. Mastandrea, Paolo and Tessarolo, Luigi. De fine versus : repertorio di clausole ricorrenti nella poesia dattilica latina dalle origini a Sidonio Apollinare. Hildesheim; Zürich [etc.]: Olms-Weidmann, 1993.
4) XQuery source here: quad-xquery-clausulae3
6) cf. Langslow, D. R. (2000), 'Latin Discourse Particles, 'Medical Latin' and 'Classical Latin'', Mnemosyne 53 (5) , pp. 537-560:
z/crotyr-quadrata.txt · Last modified: 2015/05/01 20:44 by njovanovic