www.tlab.it

Sequence Analysis


This T-LAB tool allows a Markovian analysis of two kinds of sequences:

A) those concerning the lexical units (words, lemmas or categories) in the network defined by the corpus or by its subsets (see the CORPUS button in the following image);

B) those recorded in an external file made by the user (see the FILE button and the explanation at the end of this section).

In the case of (A), sequences are syntagmatic relationships between the lexical units under analysis, each of them - for each occurrence within the corpus chain - has a predecessor and a successor, that are respectively the lexical unit that comes before it and the lexical unit that comes after it.

Beginning from a matrix in which all the predecessors and all the successors of each lexical unit are recorded , T-LAB calculates the transition probabilities (markov chains) between the lexical units analysed (max 1,500).

The outputs available - all clickable - are graphs and tables.

In the graphs, the lessical units that are closer to the selected one are the lessical units that have the higher probability of coming before (predecessors) and after (successors).

Two tables show the sorted list of predecessors (the first) and successors (the second) of each selected lexical unit.

The list is in descending order according to the probability values ("PROB"). For example, in the following table, the probability that "cost" will follow "healt_care" is equal to 0.105, that is 10.5%.

The option triads allows us to visualize some tables with sequences of three elements in which, according to the choice of the user, the selected word is in the first, in the second or in the third position. For each triad T-LAB shows the corresponding occurrence values.

N.B. Within the triads the empty words are not included.


According to the graph theory, the predecessors and the successors of each node (in this case, lexical unit) can be represented by means of arrows (arcs) coming to (in-degree = types of predecessors) or going out (out-degree = types of successors).

As an example, in the following table "people" has 167 types of successors and 187 types of predecessors.
According to their ratio (successors/predecessors), it is possible to verify the semantic variety engendered by each node in point:
- if the ratio is greater than 1, the node is defined "source";
- if the ratio is equal to 1, the node is defined "relay";
- if the ratio is lower than 1, the node is defined "well".

In the same table, for each lexical unit, the column "cover" (coverage) indicates the percentage of its occurrences preceded or followed by lexical units included in the user list.


When the analysed units cover the totality of those present within the corpus (e.g. use of categories for content analysis and/or use of external files), the cover value is equal to 1; otherwise, it is a lower value.
Moreover: when the cover value is equal to 1, the summations of the probability values (both of predecessors and of successors) are also equal to 1; otherwise, they have lower values.
In both cases, the residual percentage is determined by the fact that there are predecessors and successors not included in the analysis.


For example, the sequence represented in the following image is constituted by 39 events: of these, only 16 (the hypothetical units in analysis) are "covered" (gray boxes). That is because some of them, e.g. those corresponding to the occurrences of the lexical unit "A", have predecessors and successors not included in the analysis (white boxes).

 

Differently, when the user analyses an external file all the events are covered.

N.B. In order to analyse an external file, the user must place a Sequence.dat file into the work folder; then, after opening an existing project, he must select Sequence Analysis ("user" option).

The calculation method, the graphs and the tables are analogous to those already described (see above).

The Sequence.dat file, which can contain numerous kinds of tags (e.g. names of speakers in a conversation, categories obtained by content analysis, kinds of events, etc.), must be made up by "N" lines (min 50 max 10,000), each with a tag of a max of 50 characters, without punctuation marks or blank spaces.

Tag types must be max 250.

Here are some lines of Sequence.dat files in the correct format:

Hamlet
King
Hamlet
Queen
Hamlet
Queen
Hamlet
King
Queen
Hamlet
King
Hamlet
Horatio
Hamlet
Horatio
... ... ...


activist
food
genetic
conservative
activist
genetic
conservative
activist
commerce
conservative
activist
conservative
biology
society
activist
... ... ...


event_01
event_03
event_02
event_03
event_03
event_01
event_05
event_02
event_05
event_01
event_02
event_04
event_03
event_01
event_01
... ... ...

Both in the case of sequences concerning the corpus lexical units and of those included in an external file (Sequence.dat), T-LAB produces four tables in the MY-OUTPUT folder:
- T_Successors.xls, with the transition probabilities of the successors;
- T_Predecessors.xls with the transition probabilities of the predecessors;
- Frequency_Average_Order.xls, only provided when the corpus consists of short texts like responses to open-ended questions, with the frequency and the average order of appearance (or evocation) of each term;
- Adjacency_Matrix.xls, only provided when the list of lexical units includes up to 250 items, which can be used to generate other measures and graphs typical of the Network Analysis.

Moreover T-LAB allows us to export GraphML files which can be edited by yEd software (see below).