Tables
Currently, the dlexDB database contains 21 tables:
Tables based on the annotated version of the corpus:
These tables are based on the notion of an annotated type, i.e., an orthographic type together with its part-of-speech tag and corresponding lemma. Bigrams are sequences of two such types, and trigrams are tripartite sequences.
Tables based on the same corpus without annotation:
These tables are based on the notion of a type which is a purely orthographically defined entity. A type is defined case-sensitively in dlexDB. Bigrams and trigrams are also available here.
Tables based on a downcased version of the corpus:
These tables are based on the notion of a downcased type. A downcased type can be seen as a type taken from a downcased version of the corpus. This gives access to case-insensitive frequencies and other measures. Bigrams and trigrams are also available here.
Tables listing all characters, character bigrams and character trigrams occurring in the corpus:
- Characters
- Character bigrams
- Character trigrams
- Characters DC (downcased)
- Character bigrams DC (downcased)
- Character trigrams DC (downcased)
This is sublexical information: characters and sequences of characters. Also available in downcased form.
Tables listing linguistic entities that have been constructed on the basis of the annotated corpus:
Lemmata are uninflected forms or headwords that have been associated with the actual types from the corpus. The syllables detected are the result of our automated syllabification process.
Tables listing all types together with their orthographic neighbors:
- Neighbors Coltheart
- Neighbors Levenshtein
- Neighbors Coltheart DC (downcased)
- Neighbors Levenshtein DC (downcased)
While the number of orthographic neighbors (and their cumulative frequency) to each type is given in the Types table, these very large tables list the neighboring types themselves.
Contents
Current version
- 0.3
- New tables: all measures in case-insensitive variant.