T-LAB Plus 2017

T-LAB Plus 2016
22 April 2016
T-LAB Plus 2018
6 January 2018

T-LAB Plus 2017

T-LAB Plus 2017

was released on January 20th 2017

The most important improvements concern: (A) the preprocessing steps - e.g. word segmentation, automatic lemmatization and stemming - for many languages, (B) the functionalities of some co-occurrence tools; (C) the performances of the Modeling of Emerging Themes tool.

A - Regarding the preprocessing steps, three new features have been implemented:

A.1-Word segmentation (see https://en.wikipedia.org/wiki/Text_segmentation) for Chinese and Japanese texts, which automatically delimits single words by white-spaces (see below).


N.B.: For the segmentation of the Chinese texts the 'Pan Gu Segment' library is used (http://pangusegment.codeplex.com/).

A.2-Dictionary-based lemmatization for nine (9) further languages;

A.3-Stemming algorithms for fifteen (15) languages;

N.B.: The main difference between (a) lemmatization and (b) stemming lies in how the inflectional forms of each word are normalized. In fact: (a) in the case of the lemmatization (see https://en.wikipedia.org/wiki/Lemmatisation) the normalization consists in grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (e.g.: 'arguing' -> 'argue'); (b) in the case of stemming (https://en.wikipedia.org/wiki/Stemming), which usually simply removes inflectional endings, the stem need not be identical to the morphological root of the word (e.g.: 'arguing' -> 'argu').

Here is the list of the new languages for which the automatic lemmatization or the stemming process is supported by T-LAB Plus 2017.

LEMMATIZATION: Catalan, Croatian, Polish, Romanian, Russian, Serbian, Slovak, Swedish, Ukrainian.

STEMMING: Arabic, Bengali, Bulgarian, Czech, Danish, Dutch, Finnish, Greek, Hindi, Hungarian, Indonesian, Marathi, Norwegian, Persian, Turkish.

When selecting languages in the setup form, while the six languages(*) for which T-LAB already supported the automatic lemmatization can be selected trough the button on the left (see 'A' below), the new one can be selected trough the button on the right (see 'B' below).

(*) English, French, German, Italian, Portuguese and Spanish.

In any case, without automatic lemmatization and / or by using customized dictionaries the user can analyse texts in all languages, provided that words are separated by spaces and / or punctuation.

B - The new functionalities of the co-occurrence tools are listed below.

B.1 - More options are available in the setup form of for the Co-Word Analysis tool

When the 'automatic selection of key terms' is selected, different colours are used for different groups of items in the MDS map (see below);


Moreover, by right-clicking the chart area, a new option allows plotting the strongest links (i.e. those with the association index >0.15).


Finally, when the 'Hierarchical clustering of key- terms' is selected, it is possible to create dendrograms including the elements of each thematic nucleus (see below);


B.2 - When using the Word Associations tool a new option is available which automatically analyses any co-occurrence matrix with up to 3,000 rows and plots a MDS map with the most relevant key-words. This way the user can easily move from the analysis of 'one-to-one' relations to a 'all together' view (and viceversa), either within the entire corpus or within a part of it.


C - The performances of the Modeling of Emerging Themes tool, which uses a topic model algorithm, have been improved and now it allows one to analyse a collection of up to 30,000 documents, provided that the total number of word occurrences (i.e. tokens) doesn't exceed 3,000,000.


Click here to consult the manual.