www.tlab.it
Word Associations
N.B.: The pictures shown in this section have been obtained by
using a previous version of TLAB. These pictures look slightly different
in TLAB Plus. Also: a) there
is a new option which allows the user to plot a MDS Map with the most relevant words; b) there is
a new button (Graph Maker) which allows
the user to create several dynamic charts in HTML format; c), by
right clicking on the keyword tables ,
additional options become available; d) a quick access gallery of
pictures which works as an additional menu allows one to switch
between various outputs with a single click. Some of these new
features are highlighted in the below image.
This TLAB tool
allows us to pick out cooccurrence and
similarity relationships which, within any corpus or its subset,
determine the local meaning of selected keyterms.
Such a tool can be used with the default options
(A) or through options selected by the user (B).
In the first case (A: default) word cooccurrences
are computed within the elementary
contexts (e.g. sentences, fragments, paragraphs). In the second
one (B: options selected by the user) word cooccurrences can also
be computed within ngrams (i.e.
sequences of two or more words) and the user is also enabled to
choose the minimum threshold of cooccurrences to be
considered.
The working window (see below) is made available
immediately after the computation of cooccurrences between all the
words included in the list selected by the user has been
done.
On the left of the above window there is a table
with the keyterm list and numerical values indicating the number
of elementary contexts (EC) or ngrams where each keyterm is
present.
Either by clicking a item in the table (see '1' above) or by
clicking on a point of the charts (see '2' above) it is possible to
check the associations of each target word. Moreover, by clicking
any item of the table (see '3' above') it is possible to check
which words are included in the corresponding lemma or semantic
class.
Each time the selection of associated words is
carried out by the computation of an Association Index (see the corresponding item of
the glossary) or by the computation of second order similarities
(see the note at the end of this page). In the first case the
available indexes are six (Cosine, Dice, Jaccard, Equivalence,
Inclusion and Mutual Information) and their computation is quite
fast. In the second case (i.e. second order similarities), as the
computation requires lots of comparisons, it can take a number of
minutes. Moreover the user has to take into account that the
greater the number of words included in his list, the more reliable
the similarity values become.
For each query, TLAB produces graphs and tables. Both graphs
and tables can be saved using the appropriate buttons.
In the radial diagrams
the lemma selected is placed in the center. The others are
distributed around it, each at distance proportional to its degree
of association. The significant relationships are therefore
onetoone, to the central lemma and to each of the others.
Each click on a item produces a new chart and, by using the right
click of the mouse, it is possible to to open a dialog box which
allows several customizations (see below).
Tables reporting
various measures allow us to check the relationships between
occurrences and cooccurrences concerning the words (up to 50) that
are most associated to the target ones.
The reading keys are as follows:
· LEMMA (A) = selected lemma;
· LEMMA (B) = lemmas associated with LEMMA
(A);
· COEFF = value of the selected index;
· TOT EC = total amount of elementary contexts (EC) or ngrams in
the corpus or in the analysed subset;
· EC_A = total amount of EC that contains the selected lemma
(A);
· EC_B = total amount of EC that contains every associated lemma
(B);
· EC_AB = total amount of EC where lemmas "A" and "B" are
associated (cooccurrences);
· CHI2 = chi square value concerning the cooccurrence
signifiance;
· (p) = probability associated with the chi square value
(def=1);
In the case of chi square test,
for each couple of lemmas ("A" and B") the structure of the
analysed table is the following:
Where : nij = EC_AB; Nj = EC_A; Ni = EC_B; N = TOT
EC.
A click on each table item (e.g. 'financial') allows us to save a
HTML file with all the elementary contexts (i.e. sentences or
paragraphs) where the selected lemma cooccurs with the central
word (e.g. 'financial' and 'terrorist').
Further graphs (bar charts) allow us to appreciate the
values of the coefficient used and the
percentage of cooccurrence contexts
(see below).
By clicking the button at the bottom left, the user
can export various types of tables (see the picture below).
A specific TLAB
window (see the picture below) allows us to create
various files which can be edited by software for network analysis (e.g. Gephi, Pajek, Ucinet, yEd
and others). In this case the nodes
are words associated with the target keyterm; so each time it is
possible to map the local network of such a term. The available
options are the following: select the words (i.e. the 'nodes') to
be inserted into the graph (see steps 2 and 3 below), export the
corresponding adjacency matrix (see step 4 below), export the
selected graphical file (see step 5 below).
N.B.: In TLAB Plus the
following window has been replaced by the Graph Maker tool.
For example, .gml files exported by
TLAB
can allow us to create graphs like the
following.
N.B.: The first of the above graphs has been
created by means of Gephi (https://gephi.org/ ), the second by means
of yEd (http://www.yworks.com/en/products_yed_download.html/
), which both are available as free download.
The way TLAB
computes the association (or proximity) indexes is
illustrated in the corresponding section of the Manual/Help (see
the glossary). All these 'first order'
indexes are obtained through a normalization of the cooccurrence
values concerning word pairs; so, in such computation, two words
which never cooccur have an association index equal to '0' (zero).
Differently, the 'second order' indexes highlight similarity
phenomena which are not directly related to cooccurrences between
word pairs; in fact, in this case, two words which never cooccur
can nevertheless have a high similarity index.
By making reference to structural linguistics, we could say that
'first order' indexes point out phenomena concerning the
sintagmatic axis ('in praesentia' combination and proximity, i.e.
each word 'near to' the other), whereas 'second order' indexes
point out phenomena concerning the paradigmatic axis ('in absentia'
association and similarity, i.e. quasisynonymity between keyterms
used within the same corpus).
In order to understand how TLAB
computes 'second order' similarities it is useful to
recall that all 'first order' indexes can be gathered in proximity
matrices like the following (Matrix 'A').
Matrix 'A'  First Order Similarities
In the above 'A' symmetric matrix the values in
yellow (0.373) correspond to the highest 'first order' similarity
between the selected words and indicate the association between
words 'w_03' and 'w_10'. More specifically, 0.373 is an equivalence
index obtained by dividing their squared cooccurrences by the
product of their occurrences (360^2/627*553).
Starting from the above 'A' matrix, TLAB builds a second matrix (see 'B' below)
obtained by computing all cosine coefficients between all 'A'
columns. For example, in matrix 'B' below the highest similarity
index (the one in green: 0.905) has been obtained by computing the
cosine coefficient between the corresponding columns of the 'A'
matrix (i.e. w_06 and w_10), the 'first order' similarity of which
is quite low (0.063).
Matrix 'B'  Second Order Similarities
In other words, a 'first order' index is obtained
by a formula which includes cooccurrence and occurrence values,
whereas a 'second order' index is obtained by multiplying two
normalized feature vectors.
Beyond any computational issue, we have to recall that in the above
two cases ('A' and 'B' matrices) we are dealing with two very
different phenomena. In fact, in the case of 'A' we are focusing on
the cooccurrences between word pairs, whereas in the case of 'B' 
and without any reference to their direct cooccurrences  we are
focusing on the 'similarity' between feature vectors (see the
matrix 'A' columns) which refers to the use (and so to the meaning)
of the corresponding words.
For example, by analysing 'The Audacity of Hope' (i.e. a book
written by B. Obama) it is possible to point out that  when using
'first order' measures  the word 'nuclear' is strongly associated
with cooccurrent words like 'weapon', 'option', 'arms' etc.;
whereas, when using 'second order' measures, 'nuclear' results
strongly associated (i.e. similar) to 'destruction', even so the
cooccurrence value of this word pair (i.e., 'nuclear' and
'destruction') is just '1' (one).
The tables shown by TLAB allow the user to check both the second
order similarities (see column SIMII below) and the first order
indexes (see column EQUI, i.e. Equivalence Index). Moreover, by
clicking any item of such a table, it is possible to generate HTML
files which allow the user to check which features determine the
similarity between each word pair. For example, the following table
shows that the second order similarity between 'war' and 'figh' is
 above all  determined by shared words like 'win', 'terrorism',
etc..
