T-LAB Home
T-LAB 10.2 - ON-LINE HELP Prev Page Prev Page
T-LAB
Introduction
What T-LAB does and what it enables us to do
Requirements and Performances
Corpus Preparation
Corpus Preparation
Structural Criteria
Formal Criteria
File
Import a single file...
Prepare a Corpus (Corpus Builder)
Open an existing project
Settings
Automatic and Customized Settings
Dictionary Building
Co-occurrence Analysis
Word Associations
Co-Word Analysis and Concept Mapping
Comparison between Word pairs
Sequence and Network Analysis
Concordances
Co-occurrence Toolkit
Thematic Analysis
Thematic Analysis of Elementary Contexts
Modeling of Emerging Themes
Thematic Document Classification
Dictionary-Based Classification
Texts and Discourses as Dynamic Systems
Comparative Analysis
Specificity Analysis
Correspondence Analysis
Multiple Correspondence Analysis
Cluster Analysis
Singular Value Decomposition
Lexical Tools
Text Screening / Disambiguations
Corpus Vocabulary
Stop-Word List
Multi-Word List
Word Segmentation
Other Tools
Variable Manager
Advanced Corpus Search
Classification of New Documents
Key Contexts of Thematic Words
Export Custom Tables
Editor
Import-Export Identifiers list
Glossary
Analysis Unit
Association Indexes
Chi-Square
Cluster Analysis
Coding
Context Unit
Corpus and Subsets
Correspondence Analysis
Data Table
Disambiguation
Dictionary
Elementary Context
Frequency Threshold
Graph Maker
Homograph
IDnumber
Isotopy
Key-Word (Key-Term)
Lemmatization
Lexical Unit
Lexie and Lexicalization
Markov Chain
MDS
Multiwords
N-grams
Naïve Bayes
Normalization
Occurrences and Co-occurrences
Poles of Factors
Primary Document
Profile
Specificity
Stop Word List
Test Value
Thematic Nucleus
TF-IDF
Variables and Categories
Words and Lemmas
Bibliography
www.tlab.it

Formal Criteria


In the case of a corpus made up of a single text, and when the user doesn't resort to variables, there are no further operations required: it is possible to continue with the importation phase.

When, on the other hand, the corpus is made up of various text documents and/or categorical variables are used, the corpus preparation must be done by means of the Corpus Builder tool (see above) which, automatically, respects the following criteria:

Each text or subset of it (the "parts" defined by variables and/or IDnumber) is preceded by a coding line.

Each coding line has this format:

- It begins with a four asterisks string (****) followed by a blank space. T-LAB reads this string as: "here begins a user-defined text or a context unit".

- It goes on with the addition of strings made up by single asterisks and labels that define cases (IDnumber), variables and respective categories.

- It ends with the return key.

Here are some examples.

The following line introduces a text (or a corpus subset) codified with three variables - AGE, SEX and OCC (occupation) - and their categories (ADUL, FEM, PROF).

**** *AGE_ADUL *SEX_FEM *OCC_PROF

 

The following line introduces a text (or a corpus subset) codified with the same variables and the IDnumber label

**** *IDnumber_0001 *AGE_ADUL *SEX_FEM *OCC_PROF

The following line introduces a text (or a corpus subset) codified with two variables: YEAR, NEWSP.

**** *YEAR_98 *NEWSP_TIMES

In each coding line these T-LAB rules are observed:

1. Each label (IDnumber, variables and variable categories) cannot be spaced out by blank spaces;
2. Each label - both for variables and variable categories - cannot be longer than 25 characters (min. 2);
3. Each variable label must be linked to the respective category using an underscore ("_");
4. Between two different variables, that is before the next asterisk, a blank space must be inserted;
5. Each variable and respective category must be assigned for each corpus subset;
6. We can use a maximum of 50 variables, each allowing a max of 150 categories which can be compared;
7. The maximum IDnumbers is fixed at 99.999 for short texts (Max. 2,000 characters each, e.g. responses to open-ended questions, twitter messages, etc.) at 30,000 for the other cases
.