2016-10-10End of project
With the end of the research project 'dlexDB', data and website are maintained by the Berlin-Brandenburg Academy of Sciences and Humanities. At the moment, we are working hard on transferring the data to the infrastructure of the Digital Dictionary of German (www.dwds.de) where in the future upgrades and extensions of the lexical database will be accessible. Check our website for news and updates.
2013-05-03New: dlexDB JSON API
As of today, dlexDB offers access to its tables via a RESTful web service which delivers results in common JSON format.
Example: Retrieve low-frequency types with a high frequency prefix (first three characters), ordered by frequency descending, first 20 results:
More information on our API overview page.
2012-09-06Workshop 2011: Proceedings published
The proceedings of our 2011 workshop have been published in the Potsdam Cognitive Science Series (Vol. 3). Included are the following contributions:
- R. Harald Baayen
- Resource requirements for neo-generative modeling in (psycho)linguistics
- Lara Kresse, Stefan Kirschner, Stefanie Dipper, Eva Belke
- Towards exploring the specific influences of wordform frequency, lemma frequency and OLD20 on visual word recognition and reading aloud
- Emmanuel Keuleers, Marc Brysbaert, Boris New
- An evaluation of the Google Books ngrams for psycholinguistic research
- Julian Heister, Reinhold Kliegl
- Comparing word frequencies from different German text corpora
- Heike Zinsmeister, Eva Smolka
- Corpus-based evidence for approximating semantic transparency of complex verbs
- Benny B. Briesemeister, Markus J. Hofmann, Lars Kuchinke, Arthur M. Jacobs
- The BAWL databases in research on emotional word processing
2012-06-22New tables, all measures in case-insensitive variant
dlexDB's new version 0.3 brings ten new tables. Six of them are downcased variants of tables already present:
In the new downcased tables, each row aggregates several case variants of a type (as occurring in the corpus) to a single, downcased representation. This representation is artificial and may or may not occur in the corpus verbatim. Together with these downcased representations we provide all the numerical measures you are already familiar with, but calculated case-insensitively. For example, you would get case-insensitive frequencies or neighborhood measures from the downcased tables.
Additionally, there are new bigrams and trigrams tables. We've had bigrams and trigrams of annotated types before, but in order to make it easier for users interested in purely orthographic frequency, there are now separate tables for bigrams and trigrams of types and downcased types, too:
Annotated types | Types (original) | Types (downcased) | |
---|---|---|---|
Bigrams | Annotated type bigrams | Type bigrams | Type bigrams DC |
Trigrams | Annotated type trigrams | Type trigrams | Type trigrams DC |
There is also a functional improvement on the website: list search is now possible with all tables. You can even upload a list of type bigrams or trigrams and run a query against dlexDB based on it.
The user documentation has also been extended and improved. Please visit the following pages to learn about the new tables and about the list search facility:
2011-11-13Continued Funding of dlexDB
We are happy to announce that dlexDB received DFG funding to continue its work. Please expect new data and updates soon!
2011-09-18dlexDB usage on the rise
We are happy to announce that dlexDB sees more and more users in the third year of its development. Users are executing about 3000 database queries per month now. Please continue sending your questions, suggestions and feedback via the contact form.
Click to enlarge
2011-09-13Download/export: Full column names
Registered users may download up to 10,000 rows from any result set. The resulting CSV file has the column names in its first line. Until recently, these column names were given in a short form which stems from our internal database abstraction layer. From now on, the full natural language column descriptions will be given here, like in the results table on the website. Please find a full documentation of all of dlexDB's variables here.
2011-07-01New variables: word bigrams and word trigrams: conditional probabilities; orthographic frequency
In version 0.2.3 we established the Avg. cond. prob., in bigrams of types within dlexDB. Today we present the underlying measures, i.e. the conditional probabilities of type bigrams, given the first component, and of type trigrams, given the initial pair.
Also for Annotated type bigrams and Annotated type trigrams we have now included purely orthographically based frequencies. Update: Purely orthographically based frequencies are now available in the new Type bigrams and Type trigrams tables. In previous versions, only annotated frequencies were given, i.e., the frequencies for a word bigram or trigram were only given separately for each of its (possibly multiple) morphosyntactic analyses. Finally, we have made access to specific bigrams or trigrams easier by implementing a field on the respective query forms where the bigram or trigram can be entered as a whole (separated by spaces).
Current version of dlexDB: 0.2.5
2011-06-01Update: Familiarity, regularity and frequencies of word beginnings fixed
In the Types table the following columns have been updated: Familiarity, Regularity, Initial letter, Initial bigram and Initial trigram. In the previous version these columns contained incorrect values for certain types, especially for types with diacritical marks (Umlaute).
These new values are still expected to change slightly when more errors and corpus artifacts will be eliminated from the database.
Current version of dlexDB: 0.2.4
2011-04-04Conditional probability and information content of types
The Types table has been extended with new columns Avg. cond. prob., in bigrams and Avg. inf. cont., in bigrams. The average information content of a type is a measure that has only recently been discussed by Piantadosi et al., 2011. In their paper, Piantadosi et al. show that average information content of a type is a better predictor of type length than type frequency is. The authors consider their result a refinement of the theory by Zipf, 1936, according to which there is a correlation between type length and type frequency.
Piantadosi et al. used the so-called Google 5-grams (Web) as their corpus base. The Google 5-grams is a large corpus of Web content in ten European languages, including German. dlexDB now provides Piantadosi's measures calculated on the basis of our reference corpus DWDS. This should allow to reproduce the author's results for German on the basis of a well-balanced corpus of printed sources of the 20th century.
Current version of dlexDB: 0.2.3
2011-03-31dlexDB-Workshop archived; dlexDB overview article (in German)
On March 28, 2011, the dlexDB project organized a workshop on “Lexical Resources in Psycholinguistic Research” as a satellite event to the QITL-4 conference on “Quantitative Investigations in Theoretical Linguistics”. The workshop program with (extended) abstracts has been archived on this page, were also the full proceedings will be made available soon.
The publications section on this website has been updated. In January 2011, an overview article of dlexDB has been published in “Psychologische Rundschau”, the journal of the “ Deutsche Gesellschaft für Psychologie” (in German). Please find this article and more on the publications page.
2011-01-18dlexDB-Workshop: Deadline extended
Please note that the deadline for submissions regarding our workshop in March has been extended until Monday, Jan 31, 2011. For further details, please see the announcement below.
2010-12-28New tables: Neighbors Coltheart, Neighbors Levenshtein
With the current version of dlexDB, v0.2.2, two new tables Neighbors Coltheart and Neighbors Levenshtein have been made availabe. In these tables each type from Types is listed together with its orthographic neighbors (edit distance 1) according to the definitions by Coltheart et al. (1977) and Levenshtein (1966). The number of orthographic neighbors for each type, and their cumulative frequency, is still available from the Types table.
Also as of v0.2.2, for any frequency or count given in dlexDB, two variants of frequency rank are given in addition to the absolute, normalized, and logarithmized values.
2010-11-23Workshop Lexical Resources in Psycholinguistic Research 28-Mar-2011
Experimental and quantitative research in the field of human language processing and production strongly depends on the quality of the underlying language material: beside its size, representativeness, variety and balance have been discussed as important factors which influence design, analysis and interpretation of experiments and their results. The workshop aims to bring together creators and users of both general purpose and specialized lexical resources which are used in psychology, psycholinguistics, neurolinguistics and cognitive research. It will be a forum to report experiences and results, review problems and discuss perspectives of any linguistic data used in the field.
Invited speaker: R. Harald Baayen, University of Alberta, Canada
Call for Participation
We invite researchers to contribute by submitting abstracts for 30 minutes talks with additional 10 minutes for discussion. The abstracts should reflect your work on or with lexical resources in the aforementioned research areas and should not exceed 1000 words (excluding figures, tables and references). Please send your submissions to info@dlexdb.de electronically until the Jan 15, 2011 Jan 31, 2011. Each abstract will be reviewed by at least two members of the review board.
QITL-4
The workshop “Lexical Resources in Psycholinguistic Research” takes place on 28th of March 2011 at the 4-th Conference on Quantitative Investigations in Theoretical Linguistics (QITL-4). Details will soon be available on the website.
Workshop organizers: Reinhold Kliegl (University of Potsdam) · Alexander Geyken (BBAW) · Julian Heister (University of Potsdam) · Edmund Pohl (BBAW) · Kay-Michael Würzner (University of Potsdam) · Review Board: Olaf Dimigen (Humboldt University Berlin) · Bryan Jurish (BBAW) Emmanuel Keuleers (Ghent University) · Wolfgang Klein (MPI for Psycholinguistics Nijmegen) · Astrid Schröder (University of Potsdam) · Shravan Vasishth (University of Potsdam) · Christiane Wotschack (Free University Berlin)
2010-08-02Saving queries for later reference
New: Save your queries for later reference in your user profile on dlexDB. Each saved query has a unique URL which you can bookmark for direct access, or publish, or use e.g. in email communication. (Saved queries may be declared private or public.).
This is an example for a public saved query: http://dlexdb.de/Qgmilhc
Please find more information here: Save query
Direct link to the query form: dlexDB query
2010-06-12New tables: characters, character bigrams, character trigrams
The new version 0.2.1 of dlexDB provides three new tables: Characters, Character bigrams und Character trigrams are holding all the characters, character bigrams and character trigrams, respectively, from all types in dlexDB. For each item, both token frequency (number of occurrences in the underlying corpus) and type frequency (number of occurrences in the list of types) are given.
Based on the new sublexical measures, the Types table has been extended with six additional columns: for each type, the cumulative frequencies of its constituent characters, character bigrams and character trigrams are given.
The Types table now also shows the cumulative syllable frequency for each type.
Another new variable in the Types table is the point of orthographical uniqueness. Additionally, the point of lemma uniqueness is given as an experimental measure. Lemma uniqueness is defined as the point where it is possible to single out the underlying lemma.
2010-05-08New tables: word level bigrams, word level trigrams, and syllables
The new version 0.2 of dlexDB provides an extension to the supralexical level (multiword level): The new Annotated type bigrams and Annotated type trigrams tables are holding all the word bigrams and word trigrams, and their respective frequencies, from the underlying corpus.
At same same time, dlexDB is being extended to the sublexical level: The Syllables table lists all the syllables from all the types in dlexDB. For each syllable, both its token frequency (number of occurrences in the corpus) and its type frequency (number of occurences in the list of types) are given.
The query interface has been extended and redesigned: The table selection panel has been moved to the upper right corner of the working space. For each table, all its filters and output options are now aggregated in a hierarchical tree-like structure on the right hand side of the screen.
2009-08-05Public beta release
Public beta release of dlexDB version 0.1. Three tables are provided: The Types table contains all types occurring in the underlying corpus (ca. 2,3 mil.). The Annotated types table contains all annotated types (ca. 2,7 mil.): two annotated types may be orthographically identical, differing only in the part-of-speech tags associated with them (based on morphosyntactical analysis). The Lemmata table contains the 1,8 mil. lemmata associated with the types.
In addition to frequency, a first selection of measures relevant to psycholinguistic research is provided: familiarity, regularity, frequency of word-initial character bigram and neighborhood measures (edit distance). Please refer to the Documentation for a full list.
Current version
- 0.3
- New tables: all measures in case-insensitive variant.
Workshop 2011
The proceedings of our workshop on Lexical Resources in Psycholinguistic Research (March 2011) have been published in the Potsdam Cognitive Science Series. Workshop page