|
Short Samples:
The Da Vinci Code (last update: March 15th, 2005. The version of T-LAB used was 4.2) |
| NOTICE:
The following example has been realized by using
an old version of T-LAB (4.2). The latest version (T-LAB 7.0) includes new tools and a new charting system. Click here to find out more. |
The
idea of this short sample came from a conversation between the author of T-LAB
and a reader of Dan Brown's novel.
The former, without having read the novel, was interested in testing a clustering
algorithm (see Thematic Document
Classification).
The latter, as a precise and passionate reader, has provided precious suggestions
for analysis.
Common objective: to verify if and how T-LAB was
a good tool for constructing a representation of the book's "contents".
Methods:
- transformation of the novel into a corpus subdivided into 105
context units (i.e. primary documents),
each corresponding to a chapter;
- use of T-LAB functions for linguistic pre-processing. In particular: a) grouping
proper nouns used for identifying the characters ( e.g. "Aringarosa"
and "Bishop Aringarosa", "Sophie" and "Sophie Neveu",
"Collet" and "Lieutenant Collet", etc.); b) automatic lemmatization;
- selection, by means of a T-LAB function, of 1,052 lexical
units (i.e. words, lemmas or lexies);
- use of a clustering algorithm (a version of bisecting
K-means) for analysing a matrix 105 x 1,052 (context units x lexical
units);
- measure of similarity used: the cosine coefficient.
Results:
- after number of checks, a solution with 4 clusters was chosen (NB the experimental
version of this T-LAB function allows us to easily explore subdivisions from
3 to 10 clusters). The following tables summarize their characteristics: the
first 35 typical words of each cluster, selected
by chi-square test.
As
we can observe, the clusters allow us to identify four different "themes",
i.e. four subsets of words co-occurring within the same context units.
NB: Whereas
the same "word" can be in more than one clusters, each chapter (i.e.
each row of matrix analysed) belongs to one cluster only. Their distribution
is as follows:

The same T-LAB function allows us to analyse a table of words x clusters (in this case 1,052 rows x 4 columns) and to represent it by means of Correspondence Analysis. The following is one of the charts produced.

NB:
-
The same procedure has been applied for analysing the Italian version of the
Dan Brown novel. For the most part, the results match (see);
- The function of T-LAB that we are testing will allow a clustering of two kinds
of context units: primary documents defined by
the user (e.g. newspaper articles, web pages, responses to open-ended questions,
etc.) and elementary context corresponding to sentences.
In the first case, the rows will contain frequency values, in the second presence/absence
values (1/0).
About The Da Vinci Code: a further T-LAB tool (Word Associations) allows us to have fun in a different way (see below).

To download the demo click here.