ticclat.ticclat_schema module

SQLAlchemy schema of the TICCLAT database.

Contains all the tables of the database and their connections, defined as SQLAlchemy declarative_base subclasses.

Many of the tables here defined are based on an INT lexicon database created in the IMPACT project (https://ivdnt.org/images/stories/onderzoek_en_onderwijs/publicaties/impact/impact_lexicon_structure.pdf). See https://github.com/TICCLAT/docs/blob/master/database_design.md for more information about the database design.

Based on this, in TICCLAT, we added tables for: - links between wordforms - morphological paradigm groups of wordforms - anagram hashes from TICCL - spelling variants from TICCL - identifiers linking wordforms to external sources like the WNT, MNW, INT.

class ticclat.ticclat_schema.Anahash(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing anahashes.

The anahashes in this table have no direct relation to the wordforms, those links are tracked in the wordforms table. This was done so that the anahashes table can be efficiently searched, e.g. for ranges in anahash “space”.

anahash
anahash_id
class ticclat.ticclat_schema.Corpus(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing corpus metadata.

corpus_documents
corpus_id
name
class ticclat.ticclat_schema.Document(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing document metadata.

author
document_corpora
document_id
document_wordforms
editor
encoding
language
other_languages
parent_document
persistent_id
pub_year
publisher
publishing_location
region
spelling
text_type
title
word_count
year_from
year_to

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing ids from external sources of wordforms.

Used for linking wordforms to external sources, such as the WNT, MNW, INT.

source_id
source_name
wordform_id
class ticclat.ticclat_schema.Lexicon(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing lexicon metadata.

vocabulary (bool): if True, all words in this lexicon are (supposed to be)
valid words, if False, some are misspelled
lexicon_id
lexicon_name
lexicon_wordforms
vocabulary
class ticclat.ticclat_schema.MorphologicalParadigm(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing information about morphological paradigms of wordforms.

The paradigms are determined according to Reynaert’s method (to be published).

V
W
X
Y
Z
paradigm_id
word_type_code
word_type_number
wordform_id
class ticclat.ticclat_schema.TextAttestation(document, wordform, frequency)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing text attestations.

A text attestation entry is defined in the INT schema as the occurrence and frequency of wordforms in documents.

attestation_id
document_id
frequency
ta_document
ta_wordform
wordform_id
class ticclat.ticclat_schema.TicclatVariant(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Contains spelling variants of words, ingested from TICCL

frequency
levenshtein_distance
ticclat_variant_id
wordform
wordform_source
wordform_source_id
class ticclat.ticclat_schema.Wordform(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing wordforms and associated anahashes.

anahash
anahash_id

Add WordformLinks between self and another wordfrom and vice versa.

The WordformLinks are added only in the link does not yet exist.

Inputs:
wordform (Wordform): Wordform that is related to Wordform self.

Add a spelling correction WordformLink.

This method sets the booleans that indicate which Wordforms are correct (according to the lexicon).

Inputs:
corr (Wordform): A correction candidate of Wordform self lexicon (Lexicon): The Lexicon that contains the WordformLink

Add WordformLinks with metadata.

Adds a WordformLink between self and another wordfrom, and vice versa, if these links are not yet in the database. And adds a WordformLinkSource, with Lexicon, and information about which Wordforms are correct according to the Lexicon. No duplicate WordformLinkSources are added.

TODO: add Uniqueconstraint on (wf_from (self), wf_to, lexicon)?

Inputs:

wf_to (Wordform): Wordform self will be linked to (and vice versa) wf_from_correct (boolean): True if Wordform self is correct

according to the lexicon, False otherwise.
wf_to_correct (boolean): True if Wordform wf_to is correct
according to the lexicon, False otherwise.

lexicon (Lexicon): The Lexicon that contains the WordformLink

wf_lexica
wordform
wordform_documents
wordform_id
wordform_lowercase
class ticclat.ticclat_schema.WordformFrequencies(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Materialized view containing overall frequencies of wordforms

The data in this table can be used to filter wordforms on frequency. This is necessary, because there is a lot of noise in the wordforms table, and this makes aggregating over all wordforms expensive.

frequency
wordform
wordform_id

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing links between wordforms.

linked_from
linked_to
wordform_from
wordform_to
class ticclat.ticclat_schema.WordformLinkSource(wflink, wf_from_correct, wf_to_correct, lexicon)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table for storing the sources of links between wordforms.

Wordform links are given by lexica (dictionaries, spelling correction lists, etc.). This table records which lexicon a given link between wordforms was originally ingested from.

anahash_difference
ld
lexicon_id
wfls_lexicon
wordform_from
wordform_from_correct
wordform_to
wordform_to_correct