www.tlab.it

What T-LAB does and what it enables us to do


T-LAB software is an all-in-one set of linguistic and statistical tools for text analysis which can be used in research fields like Semantic Analysis, Content Analysis, Perceptual Mapping, Text Mining, Discourse Analysis.

In fact, being a text laboratory, T-LAB allows the integrated use of three kinds of tools for text analysis:

A - tools for word co-occurrence analysis: computation of word associations, comparisons between word pairs, co-word analysis and concept mapping, sequence analysis, concordances;
B - tools for thematic analysis of the context units: modeling of emerging themes, thematic analysis of elementary contexts (i.e. chunks, sentences or paragraphs), sequences of themes, key contexts of thematic words, thematic classification of documents, ;
C - tools for comparative analysis of two or more corpus sub-sets: specificity analysis, correspondence analysis, multiple correspondence analysis, cluster analysis.

The user interface is very user-friendly and various types of texts can be analysed:
- a single text (e.g. an interview, a book, etc.);
- a set of texts (e.g. a set of interviews, web pages, newspaper articles, responses to open-ended questions, etc.).

All texts can be encoded with categorical variables and/or with IDnumbers that correspond to context units or cases (e.g. responses to open-ended questions).

Each corpus (one or more texts) must be in plain text (.txt) and can't exceed 30 Mb (about 18,000 pages in ASCII format).

Six steps are that is required to perform a quick verification of the software functionalities:

1 - Select the language of the interface and that of the corpus to be analysed

2 - Select any corpus to analyse

3 - Click "GO" in the first Setup window

During the pre-processing phase, T-LAB carries out the following treatments:

4 - Select a tool from one of the "Analysis" sub-menus

5 - Verify the results

6 - Use the contextual help function to interpret the various graphs and tables


The following information is provided to help the user to better understand what
T-LAB does and how to make full use of it.

From an external point of view, the use of the software is organized from the interface, that is from the main menu, from the sub-menus and from the options that they consist of.

Apart from the user interface, the T-LAB system is organized into two main components:

To understand how T-LAB works and how it can be used, it is essential to have a clear idea as to which analysis units are filed in its database and what statistical algorithms are used in the various analyses. In fact, the analysed data tables always consist of rows and columns the headings of which correspond to the analysis units filed in the database, while the algorithms regulate the processes that make it possible to detect significant relationships between the data and to extract useful information.

The analysis units used in T-LAB are of two types: lexical units and context units.

A - the lexical units are words and multi-words, filed and classified on the basis of a criterion. More precisely, in the T-LAB database each lexical unit consists of a classified record with two fields: word and lemma. In the first field ("word"), the words are listed as they appear in the corpus, while in the second ("lemma") the labels attributed to lexical units groups are listed and classified according to linguistic criteria (e.g. lemmatization) or by dictionaries and semantic grids defined by the user.

B - the context units are portions of text that the corpus can be divided into. More precisely, according to T-LAB logic, there can be three types of context units:

B.1 primary documents, which correspond to the "natural" subdivision of the corpus (e.g. interviews, articles, answers to open-ended questions, etc.), that is the initial context defined by the user;
B.2 elementary contexts, which correspond to syntagmatic units (i.e. chunks, sentences, paragraphs) in which each primary document can be subdivided;
B.3 corpus subsets, which correspond to groups of primary documents which lead to the same category (eg. interviews with "men" or "women", articles in a specific year or a particular magazine and so on) including thematic clusters of documents or elementary contexts obtained by using the corresponding T-LAB tools (see below the section 5 C).


Starting from this database organization, T-LAB makes it possible - in automatic mode - to explore and to analyse the relationships between the analysis units of the whole corpus or its subsets.

In T-LAB, the selection of any analysis tool (click of the mouse) always activates a semi-automatic process that, with a few simple operations, generates an input table, it applies some statistical algorithms and produces some outputs.

Let's consider how a typical work project which uses T-LAB can be managed.
Hypothetically, each project consists of a set of analytical activities (operations) which have the same corpus as their subject and are organized according to the user's strategy and plan. It then begins gathering the texts to be analysed, and concludes with a report.

The succession of the various phases is illustrated in the following diagram:


N.B.
- The six numbered phases, from the corpus preparation to the interpretation of the outputs, are supported by T-LAB tools and are always reversible;
- By using T-LAB automatic settings it is possible to avoid two phases (3 and 4); however, in order to achieve high quality results, their use is, nevertheless, advisable.


1 -
CORPUS PREPARATION: transformation of the texts to be analysed in a file (corpus) that can be processed by the software.

Each corpus which is to be analysed, in order to be imported into T-LAB, must be in the ASCII/ANSI format with the ".txt" extension.

In the case of a single text (or a corpus considered as a single text) T-LAB needs no further work.


Otherwise, if there are coding marks referring to some variables in the corpus, in the preparation phase some criteria must be observed (see the Corpus Preparation section).


At the end of the corpus preparation phase it is recommended that a new folder be created which contains only the corpus to be imported.


2 - CORPUS IMPORTATION: a series of automatic processes that transform the corpus into a set of tables integrated in the T-LAB database.

Starting from the selection of the New Corpus option, the intervention of the user (advanced options) is required in order to to define certain choices (see below):

N.B.:
- The language selection (obligatory) define the lemmatization to be applied. Currently automatic lemmatization is available in five languages: Italian, French, English, Spanish and Portuguese. In any case, without automatic lemmatization and/or using customized dictionaries, texts in all the languages (or dialects) that support ASCII characters can be analysed (see above the "other" option);
- Inexperienced users are advised to accept the preselected options;
-
As the pre-processing options determine both the kind and the number of analysis units (i.e. context units and lexical units), different choices determine different analysis results. For this reason, all T-LAB outputs (i.e. charts and tables) shown in the user's manual and in the on-line help are just indicative.


3 - THE USE OF LEXICAL TOOLS allows us to verify the correct recognition of the lexical units and to customize their classification, that is to verify and to modify the automatic choices made by T-LAB.

The procedures of the various interventions are illustrated in the corresponding help items (and in the manual).

In particular the user is requested to refer to the corresponding help item (and to the manual) for a detailed description of the Dictionary Building process (see below).

4 - THE KEY-WORD SELECTION consists of the arrangement of one or more lists of lexical units (words, lemmas or categories) to be used for producing the data tables to be analysed.

The automatic settings option provides the lists of the key-words selected by T-LAB; nevertheless, since the choice of the analysis units is extremely relevant in relation to subsequent elaborations, the use of customized settings (see below) is highly recommended. In this way the user can choose to modify the list suggested by T-LAB and/or to arrange lists that better correspond to the objectives of his research.

 

In any case, while creating these lists, the user can refer to the following criteria:

- check the quantitative (total of the occurrences) and qualitative importance of the various items;
- check the limitations of the analytical tools that you intend to use (see at the end of this chapter);
- check whether the set of items is compatible with your own research strategies (see item : 5 to follow).

5 - THE USE OF ANALYSIS TOOLS allow the user to obtain outputs (tables and graphs) that represent significant relationships between the analysis units and enables the user to make inferences.

At the moment (7.1 version), T-LAB includes fifteen different analysis tools each of them having its own specific logic; that is, each one generates specific tables, uses specific algorithms and produces specific outputs.
Consequently, depending on the structure of texts to be analysed and on the goals to be achieved, each time the user has to decide which tools are more appropriate for his analysis strategy.


For this purpose, besides the distinction between tools for co-occurrence, comparative and thematic analysis (see below), it can be useful to consider that some of the latter allow us to obtain new units corpus subsets which can be included in further analysis steps.


In particular,
the Modeling of Emerging Themes, Thematic Analysis of Elementary Contexts and Thematic Document Classification tools allow us to find clusters of context units characterized by similarity in meaning. These clusters, as categories obtained by a content analysis, can work in co-occurrence or in comparative analysis of corpus subsets.


Even though the various T-LAB tools can be used in any order, there are nevertheless three ideal starting points in the system which correspond to the three ANALYSIS sub-menus:


A : TOOLS FOR CO-OCCURRENCE ANALYSE

These tools enable us to analyse different kinds of relationships between lexical units (i.e. words).



According to the types of relationships to be analysed, the T-LAB options indicated in this diagram use one or more of the followings statistical tools: Association Indexes, Chi Square Tests, Cluster Analysis, Multidimensional Scaling and Markov chains.

Here are some output examples:

- Word Associations

- Comparison between Word Pairs

- Co-Word Analysis and Concept Mapping

 

- Sequence Analysis

 

 

B : TOOLS FOR COMPARATIVE ANALYSIS

These tools enable us to analyse different kinds of relationships between context units.

Specificity Analysis enables us to check which words are typical or exclusive of a specific corpus subset, either comparing it with the rest of the corpus or with another subset.

Correspondence Analysis allows us to explore similarities and differences between (and within) groups of context units.

Cluster Analysis , which requires a previous Correspondence Analysis, can be carried out using various techniques.

 

C : TOOLS FOR THEMATIC ANALYSIS

In either of the above cases, "themes" are clusters of context units characterized by the same patterns of key-words.

These tools enable us to discover, examine and map "themes" emerging from texts.
As theme is a polysemous word, when using software tools for thematic analysis we have to refer to operational definitions. More precisely, in these T-LAB tools, "theme" is a label used to indicate three different entities:
1- a specific ("thematic") key term used for extracting a set of elementary contexts in which it is associated with a specific group of words pre-selected by the user (see the Key Contexts of Thematic Words tool);
2- a "thematic" cluster of contexts units characterized by the same patterns of key-words (see the Thematic Analysis of Elementary Contexts and Thematic Document Classification tools);
3 - a mixture component of a probabilistic model which represents each context unit (i.e. elementary context or document) as generated from a fixed number of topics or "themes" (see the Modeling of Emerging Themes tool).


In detail:

- through the Key Contexts of Thematic Words tool (see below), which uses the cosine coefficient as similarity measure, we can extract lists of meaningful elementary contexts which allow us to deepen the thematic value of specific key terms.

- through the Modeling of Emerging Themes tool (see below), which uses a Bayesian method, the mixture components - described through their characteristic vocabulary - can be used as categories in qualitative analyses or for the automatic classification of the context units (i.e. documents or elementary contexts).

- both the Thematic Analysis of Elementary Contexts and the Thematic Document Classification tools work in the following way:

a - perform co-occurrence analysis to identify thematic clusters;
b - perform comparative analysis of the profiles of the various clusters;
c - generate various types of graphs and tables (see below);
d - allow you to file the new variables (thematic clusters) for further analysis.

 

6 - INTERPRETATION OF THE OUTPUTS consists in the consultation of the tables and the graphs produced by T-LAB, in the eventual customization of their format and in making inferences on the meaning of the relationships represented by the same.

In the case of tables, according to each case, T-LAB allows the user to export them in files with the following extensions: .DAT, .TXT, .XLS, .HTML. This means that, by using any text editor program and /or any Microsoft Office application, the user can easily import and re-elaborate them.

All graphs and charts can be zoomed, maximized, customized and exported in different formats (right click to show popup menu).

 

Some general criteria for the interpretation of the T-LAB outputs are illustrated in a paper quoted in the Bibliography (Lancia F.: 2005) and are available from the www.tlab.it website. This document presents the hypothesis that the statistical elaboration outputs (tables and graphs) are particular types of texts, that is they are multi-semiotic objects characterized by the fact that the relationships between the signs and the symbols are ordered by measures that refer to specific codes.

In other words, both in the case of texts written in "natural language" and those written in the "statistical language", the possibility of making inferences on the relationships that organize the content forms is guaranteed by the fact that the relationships between the expression forms are not random; in fact, in the first case (natural language) the significant units follow on and are ordered in a linear manner (one after the other in the chain of the discourse), while in the second case (tables and graphs) the organization of the multidimensional semantic spaces comes from statistical measures.

Even if the semantic spaces represented in the T-LAB maps are extremely varied, and each of them require specific interpretative procedures, we can theorize that - in general - the logic of the inferential process is the following:

A - to detect some significant relationships between the units "present" on the expression plan (e.g. between table and/or graph labels);
B - to explore and compare the semantic traits of the same units and the contexts to which they are mentally and culturally associated (content plan);
C - to generate some hypothesis or some analysis categories that, in the context defined by the corpus, give reason for the relationships between expression and content forms.