Short Samples:
The Da Vinci Code (last update: March 15th, 2005. The version of T-LAB used was 4.2)
NOTICE:
The following example has been realized by using
an old version of T-LAB (4.2).
The latest version includes new tools and a new charting system. Click
here to find out more.
The
idea of this short sample came from a conversation between the author of T-LAB
and a reader of Dan Brown's novel.
The former, without having read the novel, was interested in testing a clustering
algorithm (see Thematic Document
Classification).
The latter, as a precise and passionate reader, has provided precious suggestions
for analysis.
Common objective: to verify if and how T-LAB was
a good tool for constructing a representation of the book's "contents".
Methods:
- transformation of the novel into a corpus subdivided into 105
context units (i.e. primary documents),
each corresponding to a chapter;
- use of T-LAB functions for linguistic pre-processing. In particular: a) grouping
proper nouns used for identifying the characters ( e.g. "Aringarosa"
and "Bishop Aringarosa", "Sophie" and "Sophie Neveu",
"Collet" and "Lieutenant Collet", etc.); b) automatic lemmatization;
- selection, by means of a T-LAB function, of 1,052 lexical
units (i.e. words, lemmas or lexies);
- use of a clustering algorithm (a version of bisecting
K-means) for analysing a matrix 105 x 1,052 (context units x lexical
units);
- measure of similarity used: the cosine coefficient.
Results:
- after number of checks, a solution with 4 clusters was chosen (NB the experimental
version of this T-LAB function allows us to easily explore subdivisions from
3 to 10 clusters). The following tables summarize their characteristics: the
first 35 typical words of each cluster, selected
by chi-square test.
As
we can observe, the clusters allow us to identify four different "themes",
i.e. four subsets of words co-occurring within the same context units. NB: Whereas
the same "word" can be in more than one clusters, each chapter (i.e.
each row of matrix analysed) belongs to one cluster only. Their distribution
is as follows:
The
same
T-LAB function allows us to analyse a table of words x clusters (in this case
1,052 rows x 4 columns) and to represent it by means of Correspondence
Analysis. The following is one of the charts produced.
NB:
-
The same procedure has been applied for analysing the Italian version of the
Dan Brown novel. For the most part, the results match (see);
- The function of T-LAB that we are testing will allow a clustering of two kinds
of context units: primary documents defined by
the user (e.g. newspaper articles, web pages, responses to open-ended questions,
etc.) and elementary context corresponding to sentences.
In the first case, the rows will contain frequency values, in the second presence/absence
values (1/0).
About
The Da Vinci Code: a further T-LAB tool (Word
Associations) allows us to have fun in a different way (see below).
We are using cookies to give you the best experience on our website.
You can find out more about which cookies we are using or switch them off in SETTINGS.
Privacy Overview
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!