ticclat.dbutils module

Collection of database access functions.

ticclat.dbutils.add_lexicon(session, lexicon_name, vocabulary, wfs, preprocess_wfs=True)[source]

wfs is pandas DataFrame with the same column names as the database table, in this case just “wordform”

Add wordforms from a lexicon with links to the database.

Lexica with links contain wordform pairs that are linked. The wfs dataframe must contain two columns: the from_column and the to_column, which contains the two words of each pair (per row). Using the arguments from_correct and to_correct, you can indicate whether the columns of this dataframe contain correct words or not (boolean). Typically, there are two types of linked lexica: True + True, meaning it links correct wordforms (e.g. morphological variants) or True + False, meaning it links correct wordforms to incorrect ones (e.g. a spelling correction list).

ticclat.dbutils.add_morphological_paradigms(session, in_file)[source]

Add morphological paradigms to database from CSV file.

ticclat.dbutils.add_ticcl_variants(session, name, df, **kwargs)[source]

Add TICCL variants as a linked lexicon.

ticclat.dbutils.bulk_add_anahashes(session, anahashes, tqdm_factory=None, batch_size=10000)[source]

anahashes is pandas dataframe with the column wordform (index), anahash

ticclat.dbutils.bulk_add_wordforms(session, wfs, preprocess_wfs=True)[source]

wfs is pandas DataFrame with the same column names as the database table, in this case just “wordform”

ticclat.dbutils.connect_anahashes_to_wordforms(session, anahashes, df, batch_size=50000)[source]

Create the relation between wordforms and anahashes in the database.

Given anahashes, a dataframe with wordforms and corresponding anahashes, create the relations between the two in the wordforms and anahashes tables by setting the anahash_id foreign key in the wordforms table.

ticclat.dbutils.create_ticclat_database(delete_existing=False)[source]

Create the TICCLAT database.

Sets the proper encoding settings and uses the schema to create tables.

ticclat.dbutils.create_wf_frequencies_table(session)[source]

Create wordform_frequencies table in the database.

The text_attestations frequencies are summed and stored in this table. This can be used to save time when needing total-database frequencies.

ticclat.dbutils.empty_table(session, table_class)[source]

Empty a database table.

  • table_class: the ticclat_schema class corresponding to the table
ticclat.dbutils.get_anahashes(session, anahashes, wf_mapping, batch_size=50000)[source]

Generator of dictionaries with anahash ID and wordform ID pairs.

Given anahashes, a dataframe with wordforms and corresponding anahashes, yield dictionaries containing two entries each: key ‘a_id’ has the value of the anahash ID in the database, key ‘wf_id’ has the value of the wordform ID in the database.

ticclat.dbutils.get_db_name()[source]

Get the database name from the DATABASE_URL environment variable.

ticclat.dbutils.get_engine(without_database=False)[source]

Create an sqlalchemy engine using the DATABASE_URL environment variable.

ticclat.dbutils.get_or_create_wordform(session, wordform, has_analysis=False, wordform_id=None)[source]

Get a Wordform object of wordform.

The Wordform object is an sqlalchemy field defined in the ticclat schema. It is coupled to the entry of the given wordform in the wordforms database table.

ticclat.dbutils.get_session()[source]

Return an sqlalchemy session object using a sessionmaker from get_session_maker().

ticclat.dbutils.get_session_maker()[source]

Return an sqlalchemy sessionmaker object using an engine from get_engine().

ticclat.dbutils.get_wf_mapping(session, lexicon=None, lexicon_id=None)[source]

Create a dictionary with a mapping of wordforms to wordform_id.

The keys of the dictionary are wordforms, the values are the IDs of those wordforms in the database wordforms table.

ticclat.dbutils.get_word_frequency_df(session, add_ids=False)[source]

Can be used as input for ticcl-anahash.

Returns:
Pandas DataFrame containing wordforms as index and a frequency value as
column, or None if all wordforms in the database already are connected to an anahash value
ticclat.dbutils.session_scope(session_maker)[source]

Provide a transactional scope around a series of operations.

ticclat.dbutils.update_anahashes(session, alphabet_file, tqdm_factory=None, batch_size=50000)[source]

Add anahashes for all wordforms that do not have an anahash value yet.

Requires ticcl to be installed!

Inputs:
session: SQLAlchemy session object. alphabet_file (str): the path to the alphabet file for ticcl.
ticclat.dbutils.update_anahashes_new(session, alphabet_file)[source]

Add anahashes for all wordforms that do not have an anahash value yet.

Requires ticcl to be installed!

Inputs:
session: SQLAlchemy session object. alphabet_file (str): the path to the alphabet file for ticcl.

Write wordform links (obtained from lexica) to JSON files for later processing.

Two JSON files will be written to: links_file and sources_file. The links file contains only the wordform links and corresponds to the wordform_links database table. The sources file contains the source lexicon of each link and also whether either wordform is considered a “correct” form or not, which is defined by the lexicon (whether it is a “dictionary” with only correct words or a correction list with correct words in one column and incorrect ones in the other).