The Da Vinci Code

Il mondo di Pinocchio
3 February 2017
S. Freud, Cinq leçons de psychanalyse (1904)
3 April 2017

The Da Vinci Code

T-LAB Tools for Text Analysis

Short Samples:
The Da Vinci Code

(last update: March 15th, 2005. The version of T-LAB used was 4.2)

NOTICE: The following example has been realized by using an old version of T-LAB (4.2).
The latest version includes new tools and a new charting system. Click here to find out more.

The idea of this short sample came from a conversation between the author of T-LAB and a reader of Dan Brown's novel.
The former, without having read the novel, was interested in testing a clustering algorithm (see Thematic Document Classification).
The latter, as a precise and passionate reader, has provided precious suggestions for analysis.

Common objective: to verify if and how T-LAB was a good tool for constructing a representation of the book's "contents".

- transformation of the novel into a corpus subdivided into 105 context units (i.e. primary documents), each corresponding to a chapter;
- use of T-LAB functions for linguistic pre-processing. In particular: a) grouping proper nouns used for identifying the characters ( e.g. "Aringarosa" and "Bishop Aringarosa", "Sophie" and "Sophie Neveu", "Collet" and "Lieutenant Collet", etc.); b) automatic lemmatization;
- selection, by means of a T-LAB function, of 1,052 lexical units (i.e. words, lemmas or lexies);
- use of a clustering algorithm (a version of bisecting K-means) for analysing a matrix 105 x 1,052 (context units x lexical units);
- measure of similarity used: the cosine coefficient.

- after number of checks, a solution with 4 clusters was chosen (NB the experimental version of this T-LAB function allows us to easily explore subdivisions from 3 to 10 clusters). The following tables summarize their characteristics: the first 35 typical words of each cluster, selected by chi-square test.

As we can observe, the clusters allow us to identify four different "themes", i.e. four subsets of words co-occurring within the same context units.
NB: Whereas the same "word" can be in more than one clusters, each chapter (i.e. each row of matrix analysed) belongs to one cluster only. Their distribution is as follows:

The same T-LAB function allows us to analyse a table of words x clusters (in this case 1,052 rows x 4 columns) and to represent it by means of Correspondence Analysis. The following is one of the charts produced.


- The same procedure has been applied for analysing the Italian version of the Dan Brown novel. For the most part, the results match (see);
- The function of T-LAB that we are testing will allow a clustering of two kinds of context units: primary documents defined by the user (e.g. newspaper articles, web pages, responses to open-ended questions, etc.) and elementary context corresponding to sentences. In the first case, the rows will contain frequency values, in the second presence/absence values (1/0).

About The Da Vinci Code: a further T-LAB tool (Word Associations) allows us to have fun in a different way (see below).

To download the demo click here.