\def\title{UTFpatgen} \def\utfpatgen{{\tt utfpatgen}} \def\patgen{{\tt patgen}} \def\lefthyphenminpar{{\tt lefthyphenmin}} \def\righthyphenminpar{{\tt righthyphenmin}} \def\hyphstartpar{{\tt hyph\_start}} \def\hyphfinishpar{{\tt hyph\_finish}} \def\patstartpar{{\tt pat\_start}} \def\patfinishpar{{\tt pat\_finish}} \def\goodwtpar{{\tt good\_wt}} \def\badwtpar{{\tt bad\_wt}} \def\threshpar{{\tt thresh}} @** Introduction. This is \utfpatgen{} -- reimplementation of the classic \patgen{} program for pattern generation. With \utfpatgen{}, we intend to overcome several limitations of the original, such as the number of hyphenation levels possible, inability to use some reserved characters in dictionary, and most importantly, we enable native usage of the UTF-8 encoding in dictionaries that are no longer limited by the fixed number of lowercase characters permitted by \patgen{}. We provide \utfpatgen{} open-source and free of charge under MIT license. Please note that there is no warranty and despite our greatest effort, the program may contain bugs. @** Terminology. Before diving into the implementation part of the program, it is useful to mention several terms occurring frequently throughout the text. It may come useful especially to those who are not thoroughly familiar with \patgen{}. The definitions here are mostly informal and intended to ease reader's understanding of the topic. {\bf Hyphenation} is a process of splitting words so that they fit better to the paragraphs when typeset. It follows the linguistic rules set for the language in which the text is written. {\bf Pattern} is a sequence of characters containing hyphenation information tied to some position ({\bf dot position}) on the edges of the pattern or in between the characters. Whether this information tells that the position should or should not be hyphenated, we distinguish between {\bf hyphenating} and {\bf inhibiting patterns}. {\bf Hyphen} (in context of \patgen{} and \utfpatgen{}) is a mark between two characters holding the information whether that position should be split during hyphenation and whether this split was detected by current patterns. There are therefore 4 types of hyphens: {\bf NO\_HYF} (not in the data, not marked), {\bf MISS\_HYF} (in the data, not marked), {\bf BAD\_HYF} (not in the data, marked), and {\bf GOOD\_HYF} (in the data, marked). Strictly speaking, there is a hyphen between each pair of neighboring characters in the data, but NO\_HYF marking is omitted implicitly. {\bf Supporting occurrences} of a pattern in the data are those that are in favor of the hyphenation information it holds. For hyphenating patterns, these are the cases where the word has a MISS\_HYF on the dot position pointed by the pattern. For inhibiting patterns, dot position with a BAD\_HYF is a supporting one. {\bf Contradicting occurrences} are those that would break a correct hyphen, thus NO\_HYF for hyphenating, and GOOD\_HYF for inhibiting patterns. Note that an occurrence of a pattern can be neither supporting nor contradicting. {\bf Hyphenation level} marks the strength of a hyphenation. It is represented by a non-negative integer and creates a hierarchy of patterns. Pattern with higher level always takes precedence over any pattern with lower level. At odd levels, the patterns are hyphenating, at even levels inhibiting. {\bf Trie} is a data structure for storing n-ary trees. It can implemented using an associated array with highly effective insertion, deletion and lookup. Furthermore, we can condense it via {\it packing} to save space. @** Algorithm overview. In general, we tried to adhere as tightly as possible to the original ideas of \patgen{}. Therefore, you may find the names and functions similar to those in the \patgen{} technical report. We have nevertheless decided to rewrite several parts of the algorithm in more "modern" way to improve its readability and testability. The main points to mention are: \item{$\bullet$} the algorithm is implemented in CWEB (C being the language of the program), not WEB (with Pascal), \item{$\bullet$} instead of statically defining the sizes of structures (tries, buffers, etc.), these are allocated and reallocated dynamically, allowing for greater flexibility and possibly space savings, \item{$\bullet$} global variables are localized as much as possible, \item{$\bullet$} {\tt goto} statements are eliminated. You may also find a few unit tests appended to the code. These are by no means exhaustive, but feel free to run them and add your own. We decided to present \utfpatgen{} in top-down fashion -- starting with the full overview and moving to details in later sections. Same goes for the structures which have their own dedicated sections. @* Dependencies. All external libraries used in \utfpatgen{} come from the standard C package, so we hope it to be widely portable without greater trouble. All in all, we use fixed-size types from {\tt } and {\tt }, IO support from {\tt }, string manipulation methods from {\tt }, and memory management provided by {\tt }. @= #include "utfpatgen.h" #include @* Main method. The general flow of the program is rather simple: initialize structures, get the parameters, generate patterns, and optionally hyphenate the dictionary. For sure, it gets more complicated the deeper we dive. @c @@; # ifndef TEST int main(int argc, char *argv[]) { for (int i = 0; i < argc; i++) { if (strcmp(argv[i], "--help") == 0) { print_help(); return EXIT_SUCCESS; } else if (strcmp(argv[i], "--version") == 0) { print_version(); return EXIT_SUCCESS; } } print_version(); @; @; @; @; destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_SUCCESS; } # endif @* Initialization sequence. First, the input parameters provided to the program call are read, validated and processed. Unless asking for help ({\tt --help}) or version printout ({\tt --version}), \utfpatgen{} takes exactly 4 inputs, representing 4 files: \item{$\bullet$} {\bf Dictionary file}: contains set of hyphenated words. \item{$\bullet$} {\bf Patterns file}: stores patterns generated in previous runs. \item{$\bullet$} {\bf Output file}: where the patterns will be stored after the run. \item{$\bullet$} {\bf Translate file}: contains the mapping of characters from the dictionary and dictionary-specific parameters. The required formats of these files are the same as for the \patgen{} program and are discussed in dedicated sections. @= struct params *params = init_params(); if (params == NULL){ return EXIT_FAILURE; } if (!parse_input(argv, argc, params)){ destroy_params(params); return EXIT_FAILURE; } struct translate_table *tt = init_tr_table(256, 128); if (tt == NULL){ destroy_params(params); return EXIT_FAILURE; } if (!read_translate(params, tt)){ destroy_params(params); destroy_tr_table(tt); return EXIT_FAILURE; } struct pattern_trie *pt = init_pattern_trie(256, 128); if (pt == NULL){ destroy_params(params); destroy_tr_table(tt); return EXIT_FAILURE; } struct pass_stats ps; if (!read_patterns(params, pt, tt, &ps)){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } @* Level range specification. Besides the command line parameters, the program prompts for the hyperparameters of the algorithm. The \hyphstartpar{} and \hyphfinishpar{} specify the range of hyphenation levels covered during the run. The program can generate patterns up to level 255. @= printf("hyph_start (lowest -), hyph_finish (highest hyphenation level): "); size_t hyph_start, hyph_finish; int result; while (true){ result = scanf("%zu %zu", &hyph_start, &hyph_finish); if (result == 2 && hyph_start >= 1 && hyph_start <= 255 && hyph_finish >= 1 && hyph_finish <= 254){ break; } else { printf("Error: Specify 1 <= hyph_start, hyph_finish <= 255! Insert again: "); while(getchar() != '\n'); } } params->hyph_start = (uint8_t) hyph_start; if (hyph_start > hyph_finish){ params->hyph_finish = ps.max_level; printf("Warning: hyph_start > hyph_finish, using hyph_finish = %u\n", ps.max_level); } else { params->hyph_finish = (uint8_t) hyph_finish; } @* Pattern generation loop. The algorithm runs for each level within the specified range, going from \hyphstartpar{} up. The program prompts for level-specific hyperparameters, generates and prunes the patterns. @= size_t pat_start, pat_finish, good_wt, bad_wt, thresh; for (size_t i = params->hyph_start; i <= params->hyph_finish; i++){ params->hyph_level = i; ps.level_pattern_cnt = 0; if (params->hyph_level > params->hyph_start) { printf("\n"); } if (params->hyph_start <= ps.max_level) { printf("Warning: Largest hyphenation value %u in patterns should be less than hyph_start\n", ps.max_level); } @; @; if (!delete_bad_patterns(pt)){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } printf("total of %zu patterns at hyph_level %u\n", ps.level_pattern_cnt, params->hyph_level); } @* Level hyperparameters input. The program again prompts for hyperparameter input. Firstly, it asks for \patstartpar{} and \patfinishpar{} that define the length range of patterns for respective level. The maximum length of a pattern in \utfpatgen{} is set to 255. Subsequently, the user is prompted to insert the three weights \goodwtpar{}, \badwtpar{} and \threshpar{}. These define the acceptance criteria for candidate patterns -- in order to accept the pattern during the ongoing iteration, its number of {\it good} (supporting) and {\it bad} (contradicting) occurrences must make the following inequality to hold: $ good * good\_wt - bad * bad\_wt \geq thresh $ The maximum value for the weights and threshold in \utfpatgen{} is set to 255. @= printf("pat_start (shortest -), pat_finish (longest pattern explored): "); while (true){ result = scanf("%zu %zu", &pat_start, &pat_finish); if (result == 2 && pat_start >= 1 && pat_finish >= 1 && pat_start <= pat_finish && pat_start <= 255 && pat_finish <= 255){ break; } else { printf("Error: Specify 1 <= pat_start <= pat_finish <= 255! Insert again: "); while (getchar() != '\n'); } } params->pat_start = (uint8_t) pat_start; params->pat_finish = (uint8_t) pat_finish; printf("good_wt (good -), bad_wt (bad pattern weight), threshold: "); while (true){ result = scanf("%zu %zu %zu", &good_wt, &bad_wt, &thresh); if (result == 3 && good_wt >= 1 && bad_wt >= 1 && thresh >= 1 && good_wt <= 255 && bad_wt <= 255 && thresh <= 255){ break; } else { printf("Error: Specify 1 <= good_wt, bad_wt, threshold <= 255! Insert again: "); while (getchar() != '\n'); } } params->good_wt = (uint8_t) good_wt; params->bad_wt = (uint8_t) bad_wt; params->thresh = (uint8_t) thresh; @* Level generation. The single pass of \utfpatgen{} at given hyphenation level comprises iterating through pattern lengths ({\tt pat\_len} parameter, in ascending order) and dot positions ({\tt pat\_dot}, from the middle toward the edges) and processing the dictionary. The algorithm collects supporting and contradicting occurrences and eventually adds new patterns to the set. It can happen that an iteration is skipped if it is known in advance that it will not yield any new patterns. @= uint8_t aux_dot; bool more_this_level[256]; for (size_t i = 0; i < 256; i++){ more_this_level[i] = true; } for (size_t j = params->pat_start; j <= params->pat_finish; j++) { params->pat_len = j; params->pat_dot = params->pat_len / 2; aux_dot = params->pat_dot * 2; while (params->pat_dot != params->pat_len) { params->pat_dot = aux_dot - params->pat_dot; aux_dot = params->pat_len * 2 - aux_dot - 1; if (more_this_level[params->pat_dot]){ if (!process_dictionary(params, tt, pt, &ps)){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } more_this_level[params->pat_dot] = ps.more_to_come; } } for (size_t i = 255; i > 0; i--){ if (!more_this_level[i-1]){ more_this_level[i] = false; } } } @* Final pass. If the user wishes, the dictionary is traversed one last time and hyphenated according to found patterns. The output is stored in file 'pattmp.X'. @= if (!output_patterns(pt, tt, params->output_file)){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } char c; bool output = false; printf("hyphenate word list? (y/n): "); if (scanf(" %c", &c) < 1){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } if (c == 'y' || c == 'Y'){ output = true; } if (!hyphenate_dictionary(params, tt, pt, output, &ps)){ destroy_params(params); destroy_tr_table(tt); destroy_pattern_trie(pt); return EXIT_FAILURE; } @** Implementation details. In this section we focus on the building blocks, each subsection devoted to one particular aspect of the algorithm. @* IO procedures. Methods in this subsection stand on the interface between \utfpatgen{} and the user, together with the {\tt main} method. {\tt parse\_inputs} attempts to open the 4 files provided as inputs into streams and save them for later use. {\tt read\_line} is a utility to simplify reading from these streams that stores the information read into provided buffer. The {\tt print\_help} and {\tt print\_version} methods print out the desired information if the user does not wish to proceed to pattern generation. @ parse\_input. Attempts to open the 4 files provided as inputs into streams and save them for later use. Returns a boolean indicating whether all files have been opened successfully. @c bool parse_input(char *argv[], int argc, struct params *params){ if (argc != 5){ fprintf(stderr, "utfpatgen need exactly 4 arguments.\nTry `utfpatgen --help` for more information.\n"); return false; } FILE *dictionary_file = fopen(argv[1], "rb"); if (dictionary_file == NULL){ fprintf(stderr, "Could not open dictionary file '%s'.\n", argv[1]); return false; } FILE *pattern_file = fopen(argv[2], "rb"); if (pattern_file == NULL){ fprintf(stderr, "Could not open pattern file '%s'.\n", argv[2]); fclose(dictionary_file); return false; } FILE *output_file = fopen(argv[3], "wb"); if (output_file == NULL){ fprintf(stderr, "Could not open output file '%s'.\n", argv[3]); fclose(dictionary_file); fclose(pattern_file); return false; } FILE *translate_file = fopen(argv[4], "rb"); if (translate_file == NULL){ fprintf(stderr, "Could not open translate file '%s'.\n", argv[4]); fclose(dictionary_file); fclose(pattern_file); fclose(output_file); return false; } params->dictionary_file = dictionary_file; params->pattern_file = pattern_file; params->output_file = output_file; params->translate_file = translate_file; return true; } @ read\_line. Reads a line from given stream into a string buffer. If end of file is reached, it sets {\tt eof} flag of the buffer. Returns a boolean indicating whether the line has been read successfully. @c bool read_line(FILE *stream, struct string_buffer *buf){ reset_buffer(buf); char c; while ((c = fgetc(stream)) != EOF) { if (buf->size >= buf->capacity - 1) { if (resize_buffer(buf, 2*buf->capacity) == NULL) { return false; } } if (c == '\n'){ if ((buf->size > 0) && (buf->data[buf->size-1] == '\r')){ // Windows /r/n end of line buf->size -= 1; } break; } buf->data[buf->size] = c; buf->size++; } buf->data[buf->size] = '\0'; if (c == EOF) { buf->eof = true; } return true; } @ print\_help. Prints out a help message to the standard output. @c void print_help(){ printf("Usage: utfpatgen [OPTION]... DICTIONARY PATTERNS OUTPUT TRANSLATE\n"); printf("\tGenerate the OUTPUT hyphenation file for use with TeX\n"); printf("\tfrom the DICTIONARY, PATTERNS, and TRANSLATE files.\n"); printf("\n--help print this help and exit\n"); printf("--version output version information and exit\n"); printf("\nutfpatgen home page: https://ctan.org/pkg/utfpatgen\n"); } @ print\_version. Prints out the version number of \utfpatgen{} to the standard output. @c void print_version(){ printf("This is utfpatgen version %s\n", UTFPATGEN_VERSION); printf("\tCopyright 2026 Ondřej Metelka\n"); printf("\tLicense MIT: https://mit-license.org/\n"); printf("\nThis is free software: you are free to change and redistribute it.\n"); printf("\tThere is NO WARRANTY, to the extent permitted by law.\n"); } @* Translate file processing. Once the translate file has been successfully opened and its stream pointer stored, it is read line by line into translate table. The purpose of the translate file is to list all the characters that occur in the dictionary, together with their uppercase variants. Later during the pattern generation, all of these variants are treated as though they were the same character. Besides, the first line of the file may redefine the language-specific parameters \lefthyphenminpar, \righthyphenminpar, and the symbols used for marking hyphens in the dictionary. The format of the translate file follows the same pattern as required by \patgen: \item{$\bullet$} first line (optional): 'LLRR BMG', where 'LL' is the value of \lefthyphenminpar, 'RR' the value of \righthyphenminpar, 'B' the symbol for {\tt BAD\_HYF}, 'M' the symbol for {\tt MISS\_HYF}, and 'G' the symbol for {\tt GOOD\_HYF}. Should any of these parameters stay blank, the default is used: \lefthyphenminpar$=2$, \righthyphenminpar$=3$, {\tt BAD\_HYF} '.', {\tt MISS\_HYF} '-', {\tt GOOD\_HYF} '*'. \item{$\bullet$} consequent lines: '$\_X\_Y_1\_...Y_n\_\_$', where '$X$' is a lowercase letter, '$Y_k$' an arbitrary (even 0) number of upper-case variants of '$X$', and '$\_$' a delimiter, usually space. Note that this format allows two-digit values of \lefthyphenminpar{} and \righthyphenminpar{} at most and one-byte characters for hyphen symbols. Practically, this is not a problem. The translate file must be provided as an input to \utfpatgen{}, but may be left empty. In that case, the default values for parameters ASCII character mapping are used. @ read\_translate. Reads the contents of the translate file and parses its lines into a translate table. It includes parsing the optional header line and storing the parameters obtained. If the file is empty, the method fetches a default ASCII mapping. Returns a boolean indicating whether the file has been read successfully. @c bool read_translate(struct params *params, struct translate_table *tt){ rewind(params->translate_file); struct string_buffer *buf = init_buffer(64); if (buf == NULL) { return false; } struct trie *helper_trie = init_trie(256); if (helper_trie == NULL) { destroy_buffer(buf); return false; } if (!read_line(params->translate_file, buf)) { destroy_trie(helper_trie); destroy_buffer(buf); return false; } if (buf->eof) { bool default_mapping = default_ascii_mapping(tt, helper_trie); destroy_trie(helper_trie); destroy_buffer(buf); return default_mapping; } size_t out_index, letter_index; letter_index = tt->letter_count + 1; if (letter_index >= tt->letter_capacity) { size_t new_capacity = tt->letter_capacity * 2; size_t *new_array = realloc(tt->index_to_alphabet, new_capacity * sizeof(size_t)); if (new_array == NULL) { return false; } tt->index_to_alphabet = new_array; tt->letter_capacity = new_capacity; } tt->index_to_alphabet[letter_index] = tt->alphabet->size; if (!insert_pattern(tt->mapping, (const char[]){EDGE_OF_WORD, '\0'}, &out_index, helper_trie) || !append_string(tt->alphabet, (const char[]){EDGE_OF_WORD, '\0'})) { return false; } tt->mapping->aux[out_index] = letter_index; tt->letter_count++; bool first_line = true; while (!buf->eof) { if (first_line && parse_header(buf, params)) { // header parsed successfully } else if (!parse_letters(buf, tt, helper_trie)) { destroy_trie(helper_trie); destroy_buffer(buf); return false; } first_line = false; reset_buffer(buf); if (!read_line(params->translate_file, buf)) { destroy_trie(helper_trie); destroy_buffer(buf); return false; } } destroy_trie(helper_trie); destroy_buffer(buf); printf("left_hyphen_min = %u, right_hyphen_min = %u, %zu letters\n", params->left_hyphen_min, params->right_hyphen_min, tt->letter_count); return true; } @ is\_integer. Returns a boolean indicating whether the given character is an ASCII numeric literal. @c bool is_integer(char c){ return (c >= '0' && c <= '9'); } @ is\_space. Returns a boolean indicating whether the given character is an ASCII whitespace (0x20). @c bool is_space(char c){ return (c == ' '); } @ parse\_two\_digit. Attempts to parse 2-character sequence from a buffer into 2-digit number. Returns a boolean indicating whether the sequence is representing a number and parsing was finished successfully. @c bool parse_two_digit(struct string_buffer *buf, size_t pos, int8_t *out){ if (pos + 1 >= buf->size) { return false; } char c1 = buf->data[pos]; char c2 = buf->data[pos + 1]; if (is_space(c1) && is_space(c2)) { return true; } if (is_space(c1)) { c1 = '0'; } if (is_space(c2)) { c2 = '0'; } if (!is_integer(c1) || !is_integer(c2)) { return false; } *out = (c1 - '0') * 10 + (c2 - '0'); return true; } @ parse\_header. Attempts to parse the text stored in a buffer as though it was the header of a translate file. If successful, the values of parameters are stored in {\tt params}. Return value indicates whether the parsing succeeded. @c bool parse_header(struct string_buffer *buf, struct params *params){ int8_t val = -1; if (!parse_two_digit(buf, 0, &val)) { return false; } else if (val != -1) { params->left_hyphen_min = val; } val = -1; if (!parse_two_digit(buf, 2, &val)) { return false; } else if (val != -1){ params->right_hyphen_min = val; } if (buf->size >= 6 && !is_space(buf->data[5])) { params->bad_hyphen = buf->data[5]; } if (buf->size >= 7 && !is_space(buf->data[6])) { params->missed_hyphen = buf->data[6]; } if (buf->size >= 8 && !is_space(buf->data[7])) { params->good_hyphen = buf->data[7]; } return true; } @ parse\_letters. Attempts to parse the text stored in a buffer as though it was a non-header line of a translate file. If successful, the respective letter and its uppercase variants are stored in the translation table. If any of the letters was already in the table, the method fails. Return values indicates the success of parsing. @c bool parse_letters(struct string_buffer *buf, struct translate_table *tt, struct trie *helper_trie){ if (buf->size == 0){ fprintf(stderr, "Empty line in translate file\n"); return false; } char separator = buf->data[0]; if (buf->size > 1 && buf->data[1] == separator){ // a comment return true; } size_t letter_index = tt->letter_count + 1; size_t out_index; struct string_buffer *letter = init_buffer(4); if (letter == NULL) { return false; } bool lower = true; if (!append_char(buf, separator)){ destroy_buffer(letter); return false; } for (size_t i = 1; i < buf->size; i++){ char c = buf->data[i]; if (c == separator){ if (letter->size == 0){ break; } if (!append_char(letter, '\0') || !insert_pattern(tt->mapping, letter->data, &out_index, helper_trie)) { destroy_buffer(letter); return false; } tt->mapping->aux[out_index] = letter_index; if (lower) { if (letter_index >= tt->letter_capacity) { size_t new_capacity = tt->letter_capacity * 2; size_t *new_array = realloc(tt->index_to_alphabet, new_capacity * sizeof(size_t)); if (new_array == NULL) { destroy_buffer(letter); return false; } tt->index_to_alphabet = new_array; tt->letter_capacity = new_capacity; } tt->index_to_alphabet[letter_index] = tt->alphabet->size; if (!append_string(tt->alphabet, letter->data)) { destroy_buffer(letter); return false; } tt->letter_count++; lower = false; } reset_buffer(letter); } else { if (!append_char(letter, c)) { destroy_buffer(letter); return false; } } } destroy_buffer(letter); return true; } @ default\_ascii\_mapping. Fetches default ASCII character mapping into the translate table. Return value indicate the success of fetching. @c bool default_ascii_mapping(struct translate_table *tt, struct trie *helper_trie){ size_t out_index; size_t letter_index; char upper; for (char c = 'a'; c <= 'z'; c++){ letter_index = tt->letter_count + 1; upper = c - ('a' - 'A'); if (letter_index >= tt->letter_capacity) { size_t new_capacity = tt->letter_capacity * 2; size_t *new_array = realloc(tt->index_to_alphabet, new_capacity * sizeof(size_t)); if (new_array == NULL) { return false; } tt->index_to_alphabet = new_array; tt->letter_capacity = new_capacity; } tt->index_to_alphabet[letter_index] = tt->alphabet->size; if (!insert_pattern(tt->mapping, (const char[]){c, '\0'}, &out_index, helper_trie) || !append_string(tt->alphabet, (const char[]){c, '\0'})) { return false; } tt->mapping->aux[out_index] = letter_index; if (!insert_pattern(tt->mapping, (const char[]){upper, '\0'}, &out_index, helper_trie)) { return false; } tt->mapping->aux[out_index] = letter_index; tt->letter_count++; } letter_index = tt->letter_count + 1; if (letter_index >= tt->letter_capacity) { size_t new_capacity = tt->letter_capacity * 2; size_t *new_array = realloc(tt->index_to_alphabet, new_capacity * sizeof(size_t)); if (new_array == NULL) { return false; } tt->index_to_alphabet = new_array; tt->letter_capacity = new_capacity; } tt->index_to_alphabet[letter_index] = tt->alphabet->size; if (!insert_pattern(tt->mapping, (const char[]){EDGE_OF_WORD, '\0'}, &out_index, helper_trie) || !append_string(tt->alphabet, (const char[]){EDGE_OF_WORD, '\0'})) { return false; } tt->mapping->aux[out_index] = letter_index; tt->letter_count++; return true; } @* Pattern file processing. The user can provide \utfpatgen{} with initial set of patterns to work with. Similarly to the translate file, the pattern file must be given as input, but may be left empty. Each line of the file represents one pattern, with following format required: {\narrower $\dots$ \par} Zero levels are omitted implicitly. A special character ('.' in \patgen{}, {\tt EDGE\_OF\_WORD} '0xff' in \utfpatgen{}) denotes edges of the word (it can be only used as the first or last character of a pattern). The original \patgen program that supports only levels up to 9 represents them simply as ASCII numeric literals '0' to '9'. On the other hand, \utfpatgen{} expects each byte representing a level preceded with a special {\tt HYPHEN\_FLAG} byte of hexadecimal value '0xfe', so the range is extended up to 253. The '0xfe' and '0xff' bytes are not used by any UTF-8 character by design. @ read\_patterns. Iterates over the pattern file and reads its entries into a pattern trie. Return value indicates whether the whole file has been read and parsed successfully. @c bool read_patterns(struct params *params, struct pattern_trie *pt, struct translate_table *tt, struct pass_stats *ps){ ps->level_pattern_cnt = 0; ps->max_level = 0; struct string_buffer *buf = init_buffer(16); if (buf == NULL){ return false; } buf->eof = false; struct pattern *pat = init_pattern(16); if (pat == NULL){ destroy_buffer(buf); return false; } struct trie *helper_trie = init_trie(256); if (helper_trie == NULL){ destroy_pattern(pat); destroy_buffer(buf); return false; } while (!buf->eof){ if (!read_line(params->pattern_file, buf) || !parse_pattern(buf, pat, tt) || !insert_new_pattern(pat, pt, ps, helper_trie)){ destroy_trie(helper_trie); destroy_pattern(pat); destroy_buffer(buf); return false; } } printf("%zu patterns read in\n", ps->level_pattern_cnt); printf("pattern trie has %zu nodes, trie_max = %zu, %zu outputs\n", pt->t->occupied, pt->t->node_max, pt->ops->count); destroy_trie(helper_trie); destroy_pattern(pat); destroy_buffer(buf); return true; } @ parse\_pattern. Processes the pattern text from buffer, translates it to lowercase and inserts it into a pattern structure. Return value indicates the success of translation and parsing. @c bool parse_pattern(struct string_buffer *buf, struct pattern *out_pattern, struct translate_table *tt){ reset_pattern(out_pattern); char c; bool next_hyphen = false; struct string_buffer *letter = init_buffer(4); if (letter == NULL){ return false; } for (size_t i = 0; i < buf->size; i++) { c = buf->data[i]; if (c == EDGE_OF_WORD && i != 0 && i != buf->size - 1){ fprintf(stderr, "Edge of word found inside a pattern.\n"); destroy_buffer(letter); return false; } if (next_hyphen){ if (!set_hyphen(out_pattern, out_pattern->size, (uint8_t) c)){ destroy_buffer(letter); return false; } next_hyphen = false; continue; } if (is_utf_start_byte(c) && letter->size > 0){ if (!append_char(letter, '\0')){ destroy_buffer(letter); return false; } size_t letter_index = get_letter_index(tt, letter->data); if (letter_index == 0){ fprintf(stderr, "Unknown letter %s found in a pattern.\n", letter->data); destroy_buffer(letter); return false; } if (!convert_index_to_pattern(letter_index, out_pattern)){ destroy_buffer(letter); return false; } reset_buffer(letter); } if (c == HYPHEN_FLAG){ next_hyphen = true; } else if (!append_char(letter, c)){ destroy_buffer(letter); return false; } } if (next_hyphen){ if (!set_hyphen(out_pattern, out_pattern->size, (uint8_t) c)){ destroy_buffer(letter); return false; } } else if (letter->size > 0){ if (!append_char(letter, '\0')){ destroy_buffer(letter); return false; } size_t letter_index = get_letter_index(tt, letter->data); if (letter_index == 0){ fprintf(stderr, "Unknown letter %s found in a pattern.\n", letter->data); destroy_buffer(letter); return false; } if (!convert_index_to_pattern(letter_index, out_pattern)){ destroy_buffer(letter); return false; } } destroy_buffer(letter); return true; } @ insert\_new\_pattern. Inserts the pattern into pattern trie and collects statistics along the way. Return value indicates the success of insertion. @c bool insert_new_pattern(struct pattern *pat, struct pattern_trie *pt, struct pass_stats *ps, struct trie *helper_trie){ size_t hyphenation_value, node; size_t current_len = 0; if (!insert_pattern(pt->t, pat->text, &node, helper_trie)){ return false; } for (size_t i = 0; i < pat->size; i++){ hyphenation_value = get_hyphen(pat, i); if (hyphenation_value > 0){ ps->level_pattern_cnt++; if (!set_output(pt, node, hyphenation_value, current_len)){ return false; } if (hyphenation_value > ps->max_level){ ps->max_level = hyphenation_value; } } if ((uint8_t) pat->text[i] != 0xff){ current_len++; } } hyphenation_value = get_hyphen(pat, pat->size); if (hyphenation_value > 0){ ps->level_pattern_cnt++; if (!set_output(pt, node, hyphenation_value, current_len)){ return false; } } return true; } @* Dictionary file processing. The dictionary is the main source of data for pattern generation. The file indeed has to be provided as input, and though no error is raised when it is empty, such case does not make much sense. Each line in the file represents single hyphenated word, with hyphens marked using {\tt GOOD\_HYF}, {\tt MISS\_HYF}, and {\tt BAD\_HYF}. Furthermore, both the whole word and separate hyphens can be weighted by preceding them with {\tt HYPHEN\_FLAG} and a value. Similarly to the pattern file, the possible range of weights is increased to 253. If the hyphen weight is omitted, word weight is used. If the word weight is omitted, \utfpatgen{} uses the default value 1. The lines of the dictionary file thus look like this: {\narrower $\dots$ \par} Since some patterns may be tied to the edges of the word, special byte symbol {\tt EDGE\_OF\_WORD} was introduced that marks the edges. Hexadecimal value of the symbol is '0xff' that is not used by the UTF-8 encoding. @ process\_dictionary. Reads and parses the dictionary, and generates new patterns afterwards. Statistics of the pass are printed out at the end of the method. Return value indicates the success of reading, parsing, and pattern generation. @c bool process_dictionary(struct params *params, struct translate_table *tt, struct pattern_trie *pt, struct pass_stats *ps){ ps->good_cnt = 0; ps->bad_cnt = 0; ps->miss_cnt = 0; params->word_weight = 1; if (params->hyph_level % 2 == 1){ params->good_dot = MISS_HYF; params->bad_dot = NO_HYF; } else { params->good_dot = BAD_HYF; params->bad_dot = GOOD_HYF; } struct count_trie *ct = init_count_trie(256, 256); if (ct == NULL){ return false; } printf("processing dictionary with pat_len = %u, pat_dot = %u\n", params->pat_len, params->pat_dot); if (!process_all_words(params, tt, pt, ps, ct)){ destroy_count_trie(ct); return false; } printf("\n%zu good, %zu bad, %zu missed\n", ps->good_cnt, ps->bad_cnt, ps->miss_cnt); if (ps->good_cnt + ps->miss_cnt > 0){ printf("%.2f %%, %.2f %%, %.2f %%\n", (100* (float) ps->good_cnt / (float) (ps->good_cnt + ps->miss_cnt)), (100* (float) ps->bad_cnt/ (float) (ps->good_cnt + ps->miss_cnt)), (100* (float) ps->miss_cnt/ (float) (ps->good_cnt + ps->miss_cnt))); } printf("%zu patterns, %zu nodes in count trie, triec_max = %zu\n", ct->t->pattern_count, ct->t->occupied, ct->t->node_max); if (!collect_count_trie(ct, pt, params, ps)){ destroy_count_trie(ct); return false; } destroy_count_trie(ct); return true; } @ process\_all\_words. Iterates over the lines of the dictionary files and parses them to the count trie. Return value indicates whether the whole file has been processed successfully. @c bool process_all_words(struct params *params, struct translate_table *tt, struct pattern_trie *pt, struct pass_stats *ps, struct count_trie *ct){ rewind(params->dictionary_file); uint8_t dot_min = params->pat_dot; uint8_t dot_max = params->pat_len - params->pat_dot; if (dot_min < params->left_hyphen_min + 1){ dot_min = params->left_hyphen_min + 1; } if (dot_max < params->right_hyphen_min + 1){ dot_max = params->right_hyphen_min + 1; } size_t dot_len = dot_min + dot_max; struct string_buffer *buf = init_buffer(64); if (buf == NULL){ return false; } buf->eof = false; struct word *word = init_word(16); if (word == NULL){ destroy_buffer(buf); return false; } struct trie *helper_trie = init_trie(256); if (helper_trie == NULL){ destroy_word(word); destroy_buffer(buf); return false; } while (!buf->eof){ if (!read_line(params->dictionary_file, buf) || !parse_word(buf, tt, params, word) || !hyphenate_word(word, pt, params)){ destroy_trie(helper_trie); destroy_buffer(buf); destroy_word(word); return false; } count_dots(word, params, ps); if (word->length >= dot_len){ if (!process_word(word, ct, params, helper_trie)){ destroy_trie(helper_trie); destroy_buffer(buf); destroy_word(word); return false; } } } destroy_trie(helper_trie); destroy_buffer(buf); destroy_word(word); return true; } @ parse\_word. Parses the text from buffer to {\tt word} structure. The text is translated to lowercase and hyphens are processed. Return value indicates successful parsing. @c bool parse_word(struct string_buffer *buf, struct translate_table *tt, struct params *params, struct word *out_word){ reset_word(out_word); char edge_of_word[] = {EDGE_OF_WORD, '\0'}; struct string_buffer *letter = init_buffer(4); if (letter == NULL) { return false; } uint8_t weight = params->word_weight; enum hyphen_type hyf = NO_HYF; size_t letter_index = get_letter_index(tt, edge_of_word); if (!convert_index(letter_index, out_word)){ destroy_buffer(letter); return false; } char c; bool has_weight = false; for (size_t i = 0; i < buf->size; i++){ c = buf->data[i]; if (has_weight){ weight = (uint8_t) c; if (i == 1){ params->word_weight = weight; } has_weight = false; continue; } else if (c == HYPHEN_FLAG){ has_weight = true; continue; } else if (c == params->good_hyphen){ hyf = GOOD_HYF; continue; } else if (c == params->missed_hyphen){ hyf = MISS_HYF; continue; } else if (c == params->bad_hyphen) { hyf = BAD_HYF; continue; } else if (is_utf_start_byte(c) && letter->size > 0){ if (!append_char(letter, '\0')){ destroy_buffer(letter); return false; } letter_index = get_letter_index(tt, letter->data); if (letter_index == 0) { fprintf(stderr, "Character '%s' in word '%s' not known\n", letter->data, buf->data); destroy_buffer(letter); return false; } if (!convert_index(letter_index, out_word)){ destroy_buffer(letter); return false; } if (!set_true_hyphen(out_word, out_word->size-1, 4 * weight + hyf)){ destroy_buffer(letter); return false; } hyf = NO_HYF; reset_buffer(letter); } weight = params->word_weight; if (!append_char(letter, c)){ destroy_buffer(letter); return false; } } if (letter->size > 0){ if (!append_char(letter, '\0')){ destroy_buffer(letter); return false; } letter_index = get_letter_index(tt, letter->data); if (letter_index == 0) { fprintf(stderr, "Character '%s' in word '%s' not known\n", letter->data, buf->data); destroy_buffer(letter); return false; } if (!convert_index(letter_index, out_word)){ destroy_buffer(letter); return false; } } letter_index = get_letter_index(tt, edge_of_word); if (!convert_index(letter_index, out_word) || !set_true_hyphen(out_word, 0, 0)){ destroy_buffer(letter); return false; } destroy_buffer(letter); return true; } @ count\_dots. Collects statistics about the hyphens in parsed word. @c void count_dots(struct word *word, struct params *params, struct pass_stats *ps){ if (word->length < (uint8_t) (params->right_hyphen_min + 1)){ return; } size_t current_index = 0; size_t current_pos = 0; bool odd_level; size_t dot_index, hyphenation_value, weight; enum hyphen_type hyf; for (size_t dot_pos = params->left_hyphen_min + 1; dot_pos < (uint8_t) (word->length - params->right_hyphen_min); dot_pos++){ while (current_pos < dot_pos){ if ((uint8_t) word->translated[current_index] != 0xff) { current_pos++; } current_index++; } dot_index = current_index - 1; odd_level = (get_found_hyphen(word, dot_index) % 2 == 1); hyphenation_value = get_true_hyphen(word, dot_index); weight = hyphenation_value / 4; hyf = hyphenation_value % 4; if (hyphenation_value == 0){ fprintf(stderr, "Code I hoped unreachable was reached\n"); continue; } else if (hyf % 2 == 0) { if (odd_level) { ps->bad_cnt += weight; } } else { if (odd_level) { ps->good_cnt += weight; } else { ps->miss_cnt += weight; } } } } @ process\_word. Generates all candidate patterns from the parsed word and inserts them to count trie. Return value indicates the success of generating and insertion. @c bool process_word(struct word *word, struct count_trie *ct, struct params *params, struct trie *helper_trie){ uint8_t dot_min = params->pat_dot; uint8_t dot_max = params->pat_len - params->pat_dot; if (dot_min < params->left_hyphen_min + 1){ dot_min = params->left_hyphen_min + 1; } if (dot_max < params->right_hyphen_min + 1){ dot_max = params->right_hyphen_min + 1; } size_t start_pos, end_pos, start_index, dot_index, end_index, node, weight, cnt_index; size_t current_pos = 0; size_t current_index = 0; bool good_pattern; enum hyphen_type hyf; for (size_t dot_pos = dot_min; dot_pos + dot_max <= word->length; dot_pos++) { while (current_pos < dot_pos){ if ((uint8_t) word->translated[current_index] != 0xff){ current_pos++; } current_index++; } dot_index = current_index - 1; if (get_no_more(word, dot_index)){ continue; } hyf = get_true_hyphen(word, dot_index) % 4; if (get_found_hyphen(word, dot_index) % 2 == 1){ hyf += 2; } if (hyf == params->good_dot){ good_pattern = true; } else if (hyf == params->bad_dot){ good_pattern = false; } else { continue; } start_pos = dot_pos; start_index = current_index; while (start_pos + params->pat_dot >= dot_pos){ if (start_index == 0 || (uint8_t) word->translated[start_index-1] != 0xff){ start_pos--; } start_index--; } start_pos++; start_index++; end_pos = dot_pos; end_index = current_index; while (end_index < word->size && end_pos < start_pos + params->pat_len){ if ((uint8_t) word->translated[end_index] != 0xff){ end_pos++; } end_index++; } if (!insert_substring(ct->t, word->translated, end_index, end_index - start_index, &node, helper_trie)){ return false; } if (ct->cnts->size >= ct->cnts->capacity) { size_t new_capacity = 2 * ct->cnts->capacity; if (resize_pattern_counts(ct->cnts, new_capacity) == NULL) { return false; } } cnt_index = ct->t->aux[node]; if (cnt_index == 0){ cnt_index = ct->cnts->size; ct->t->aux[node] = cnt_index; ct->cnts->size++; } weight = get_true_hyphen(word, dot_index) / 4; if (weight == 0) weight = params->word_weight; if (good_pattern){ ct->cnts->good[cnt_index] += weight; } else { ct->cnts->bad[cnt_index] += weight; } } return true; } @* Pattern collection. After the dictionary has been processed, the count trie contains the number of good (supporting) and bad (contradicting) occurrences of each candidate pattern found. There are 3 types of patterns based on these counts and the \goodwtpar{}, \badwtpar{}, and \threshpar{} parameters: \item{$\bullet$} {\bf good patterns} for which inequality $ good * good\_wt - bad * bad\_wt \geq thresh $ holds. Good patterns are inserted into the pattern trie with corresponding hyphenation level. \item{$\bullet$} {\bf bad patterns}: for which inequality $ good * good\_wt < thresh $ holds. Bad patterns are inserted into the pattern trie with a special {\tt BAD\_OP\_VALUE} level of value 255. Neither them nor their superstrings can become good patterns and are deleted from the pattern trie in the end of the hyphenation level iteration. \item{$\bullet$} {\bf undecided patterns}: for which none of the inequalities above holds. Some of their superstrings may become good or bad patterns later, so need for further investigation is raised by setting the {\tt more\_to\_come} flag. @ collect\_count\_trie. Generates new patterns from the count trie and prints out statistics. Return value indicates whether the generation ended successfully. @c bool collect_count_trie(struct count_trie *ct, struct pattern_trie *pt, struct params *params, struct pass_stats *ps){ double bad_eff = (double) params->thresh / (double) params->good_wt; ps->good_pat_cnt = 0; ps->bad_pat_cnt = 0; ps->good_cnt = 0; ps->bad_cnt = 0; ps->more_to_come = false; if (!traverse_count_trie(ct, pt, params, ps)){ return false; } printf("%zu good and %zu bad patterns added", ps->good_pat_cnt, ps->bad_pat_cnt); ps->level_pattern_cnt += ps->good_pat_cnt; if (ps->more_to_come) { printf(" (more to come)\n"); } else { printf("\n"); } printf("finding %zu good and %zu bad hyphens", ps->good_cnt, ps->bad_cnt); if (ps->good_pat_cnt > 0) { printf(", efficiency = %.2lf\n", (double) ps->good_cnt / (ps->good_pat_cnt + ((double) ps->bad_cnt / bad_eff))); } else { printf("\n"); } printf("pattern trie has %zu nodes, trie_max = %zu, %zu outputs\n", pt->t->occupied, pt->t->node_max, pt->ops->count); return true; } @ traverse\_count\_trie. Searches through the count trie and inserts accepted patterns to the pattern trie. Statistics are collected along the way. Return value indicates whether the whole trie was explored. @c bool traverse_count_trie(struct count_trie *ct, struct pattern_trie *pt, struct params *params, struct pass_stats *ps) { size_t root = 1; size_t current_len = 0; uint8_t c; struct string_buffer *pattern = init_buffer(4 * params->pat_len); if (pattern == NULL){ return false; } struct stack *s_base = init_stack(4 * params->pat_len); if (s_base == NULL) { destroy_buffer(pattern); return false; } struct trie *helper_trie = init_trie(256); if (helper_trie == NULL){ destroy_buffer(pattern); destroy_stack(s_base); return false; } if (!append_char(pattern, '\0') || !put_on_stack(s_base, root)){ destroy_trie(helper_trie); destroy_buffer(pattern); destroy_stack(s_base); return false; } size_t node, op_index, good, bad, cnt_index; while (s_base->top > 0){ root = get_top_value(s_base); pattern->data[pattern->size - 1] += 1; c = (uint8_t) pattern->data[pattern->size - 1]; if (c == 0){ // overflow pattern->size--; s_base->top--; if (pattern->size < 1 || (uint8_t) pattern->data[pattern->size - 1] != 0xff){ current_len--; } continue; } node = root + c; if ((uint8_t) ct->t->nodes[node] != c){ continue; } if ((uint8_t) c != 0xff) { current_len++; } if (current_len == params->pat_len){ cnt_index = ct->t->aux[node]; good = ct->cnts->good[cnt_index]; bad = ct->cnts->bad[cnt_index]; if (good > 0 || bad > 0){ if (params->good_wt * good < params->thresh){ if (!insert_substring(pt->t, pattern->data, pattern->size, pattern->size, &op_index, helper_trie) || !set_output(pt, op_index, BAD_OP_VALUE, params->pat_dot)){ destroy_trie(helper_trie); destroy_buffer(pattern); destroy_stack(s_base); return false; } ps->bad_pat_cnt++; } else if (params->good_wt * good >= params->thresh + params->bad_wt * bad) { if (!insert_substring(pt->t, pattern->data, pattern->size, pattern->size, &op_index, helper_trie) || !set_output(pt, op_index, params->hyph_level, params->pat_dot)){ destroy_trie(helper_trie); destroy_buffer(pattern); destroy_stack(s_base); return false; } ps->good_pat_cnt++; ps->good_cnt += good; ps->bad_cnt += bad; } else { ps->more_to_come = true; } } if ((uint8_t) c != 0xff) { current_len--; } continue; } root = ct->t->links[node]; if (root == 0){ if ((uint8_t) c != 0xff) { current_len--; } continue; } if (!append_char(pattern, '\0') || !put_on_stack(s_base, root)){ destroy_trie(helper_trie); destroy_buffer(pattern); destroy_stack(s_base); return false; } } destroy_trie(helper_trie); destroy_buffer(pattern); destroy_stack(s_base); return true; } @* Pattern pruning. Once all the required pattern lengths and dot positions were explored for given hyphenation level, the pattern trie is pruned of the patterns that were marked as bad. @ delete\_bad\_patterns. Removes bad patterns from the pattern trie and prints out statistics. Return value indicates success of the operation. @c bool delete_bad_patterns(struct pattern_trie *pt){ size_t old_op_cnt = pt->ops->count; size_t old_trie_cnt = pt->t->occupied; if (!delete_patterns(pt)){ return false; } for (size_t h = 1; h <= pt->ops->capacity; h++){ if (pt->ops->data[h].value == BAD_OP_VALUE){ pt->ops->data[h].value = EMPTY_OP_VALUE; pt->ops->count--; pt->ops->data[h].next_op_index = pt->ops->data[0].next_op_index; pt->ops->data[0].next_op_index = h; } } printf("%zu nodes and %zu outputs deleted\n", old_trie_cnt - pt->t->occupied, old_op_cnt - pt->ops->count); return true; } @ delete\_patterns. Searches through the pattern trie and removes outputs pointing to the bad patterns. Nodes that are not used after the deletion are deleted as well. Return value indicates whether the whole trie was explored. @c bool delete_patterns(struct pattern_trie *pt){ size_t root = 1; struct stack *s_base = init_stack(16); if (s_base == NULL){ return false; } struct stack *s_offset = init_stack(16); if (s_offset == NULL){ destroy_stack(s_base); return false; } struct stack *s_freed = init_stack(16); if (s_freed == NULL){ destroy_stack(s_base); destroy_stack(s_offset); return false; } if (!put_on_stack(s_base, root) || !put_on_stack(s_offset, 0) || !put_on_stack(s_freed, (size_t) true)){ destroy_stack(s_base); destroy_stack(s_offset); destroy_stack(s_freed); return false; } size_t node; uint8_t c; while (s_base->top > 0){ root = get_top_value(s_base); set_top_value(s_offset, (uint8_t) get_top_value(s_offset) + 1); c = (uint8_t) get_top_value(s_offset); if (c == 0){ bool child_freed = (get_top_value(s_freed) == (size_t) true); if (child_freed){ if (!set_base_used(pt->t, root, false)){ destroy_stack(s_base); destroy_stack(s_offset); destroy_stack(s_freed); return false; } } s_offset->top--; s_base->top--; s_freed->top--; if (s_base->top > 0) { size_t parent_root = get_top_value(s_base); uint8_t parent_c = (uint8_t) get_top_value(s_offset); size_t parent_node = parent_root + parent_c; if (child_freed) { pt->t->links[parent_node] = 0; if (pt->t->aux[parent_node] == 0 && parent_root != 1) { deallocate_node(pt->t, parent_node); } else { set_top_value(s_freed, (size_t) false); } } else { set_top_value(s_freed, (size_t) false); } } continue; } node = root + c; if ((uint8_t) pt->t->nodes[node] != c){ continue; } if (!link_around_bad_outputs(pt, node)){ destroy_stack(s_base); destroy_stack(s_offset); destroy_stack(s_freed); return false; } if (pt->t->aux[node] > 0 || root == 1){ set_top_value(s_freed, (size_t) false); } else { if (pt->t->links[node] == 0){ deallocate_node(pt->t, node); continue; } } root = pt->t->links[node]; if (root == 0){ continue; } if (!put_on_stack(s_base, root) || !put_on_stack(s_offset, 0) || !put_on_stack(s_freed, (size_t) true)){ destroy_stack(s_base); destroy_stack(s_offset); destroy_stack(s_freed); return false; } } destroy_stack(s_base); destroy_stack(s_offset); destroy_stack(s_freed); return true; } @ deallocate\_node. Removes unused node from a trie and links it to the beginning of free space chain. @c void deallocate_node(struct trie *t, size_t t_index){ size_t old_head = t->links[0]; t->nodes[t_index] = 0; t->links[0] = t_index; t->aux[t_index] = 0; t->links[t_index] = old_head; t->aux[old_head] = t_index; t->occupied--; } @ link\_around\_bad\_outputs. Fixes the output linking to not contain the outputs that will be deleted. Returns true upon success. @c bool link_around_bad_outputs(struct pattern_trie *pt, size_t t_index){ size_t lookup_index = pt->t->aux[t_index]; if (lookup_index == 0){ return true; } size_t op_index = pt->ops->lookup[lookup_index]; size_t free_list_head = pt->ops->data[0].next_op_index; size_t h = 0; pt->ops->data[0].next_op_index = op_index; size_t n = pt->ops->data[0].next_op_index; while (n > 0){ if (pt->ops->data[n].value == BAD_OP_VALUE){ pt->ops->data[h].next_op_index = pt->ops->data[n].next_op_index; } else { h = n; } n = pt->ops->data[h].next_op_index; } if (h == 0){ if (pt->ops->lookup[lookup_index] > 0){ pt->ops->lookup[lookup_index] = 0; pt->ops->lookup_cnt--; } pt->t->aux[t_index] = 0; } else { pt->ops->lookup[lookup_index] = pt->ops->data[0].next_op_index; } pt->ops->data[0].next_op_index = free_list_head; return true; } @* Pattern output. At the end of the run, \utfpatgen{} prints out the final set of patterns to the output file. The format is the same as the required format of the input pattern file -- each value of a hyphenation level is preceded by the {\tt HYPHEN\_FLAG} symbol, the '.' character marking an edge of word. @ output\_patterns. Traverses through the pattern trie and writes all the patterns present there to the output file. Returns true if the whole trie has been explored and patterns successfully written. @c bool output_patterns(struct pattern_trie *pt, struct translate_table *tt, FILE *output_file){ size_t root = 1; uint8_t c; struct string_buffer *pattern = init_buffer(16); if (pattern == NULL){ return false; } struct stack *s_base = init_stack(16); if (s_base == NULL) { destroy_buffer(pattern); return false; } if (!append_char(pattern, '\0') || !put_on_stack(s_base, root)){ destroy_buffer(pattern); destroy_stack(s_base); return false; } size_t node; while (s_base->top > 0){ root = get_top_value(s_base); pattern->data[pattern->size - 1] += 1; c = (uint8_t) pattern->data[pattern->size - 1]; if (c == 0){ pattern->data[pattern->size - 1] = '\0'; pattern->size--; s_base->top--; continue; } node = root + c; if ((uint8_t) pt->t->nodes[node] != c){ continue; } if (pt->t->aux[node] > 0){ output_pattern(pattern, tt, pt->ops, pt->t->aux[node], output_file); } root = pt->t->links[node]; if (root == 0){ continue; } if (!append_char(pattern, '\0') || !put_on_stack(s_base, root)){ destroy_buffer(pattern); destroy_stack(s_base); return false; } } destroy_buffer(pattern); destroy_stack(s_base); return true; } @ output\_pattern. Writes single pattern to the output file. @c void output_pattern(struct string_buffer *pattern, struct translate_table *tt, struct outputs *ops, size_t op_index, FILE *output_file){ if (op_index == 0){ return; } size_t pattern_position = 0; size_t level, letter_index; char *word_index = pattern->data; size_t edge_of_word_idx = get_letter_index(tt, (char[2]){EDGE_OF_WORD, '\0'}); while (word_index < pattern->data + pattern->size){ level = get_highest_level(ops, op_index, pattern_position); if (level > 0){ fputc('\xfe', output_file); fputc((uint8_t) level, output_file); } letter_index = convert_byte_sequence(&word_index); if (letter_index == edge_of_word_idx){ fputc('.', output_file); } else { fprintf(output_file, "%s", tt->alphabet->data + tt->index_to_alphabet[letter_index]); } pattern_position++; } level = get_highest_level(ops, op_index, pattern_position); if (level > 0){ fputc('\xfe', output_file); fputc((uint8_t) level, output_file); } fputc('\n', output_file); } @ get\_highest\_level. Selects the hyphenation level to be used in given dot position. Return the level value. @c size_t get_highest_level(struct outputs *ops, size_t start_index, size_t position){ size_t highest = 0; size_t op_index = ops->lookup[start_index]; struct output op; while (op_index > 0){ op = ops->data[op_index]; if (op.position == position && op.value != BAD_OP_VALUE && op.value > highest){ highest = op.value; } op_index = op.next_op_index; } return highest; } @* Hyphenation. Hyphenation appears on several places in the algorithm. Already in the dictionary processing, the words are hyphenated with the current set of patterns. The other case occurs at the absolute end of \utfpatgen{} when the user wishes to see the hyphenated dictionary -- the final set of patterns is used to hyphenate the dictionary entries into 'pattmp.X' file. The 'X' marks the last hyphenation level explored. @ hyphenate\_word. Hyphenates the given word according to the actual set of patterns from pattern trie. Found hyphens are stored in word's {\tt found\_hyphens} field. Returns true if hyphenation ended without errors. @c bool hyphenate_word(struct word *word, struct pattern_trie *pt, struct params *params){ size_t current_index = 0; size_t current_pos = 0; size_t node, base, start_index, dot_index, end_index, op_index, dot_pos, end_pos; struct output op; if (word->length < (uint8_t) (params->right_hyphen_min + 1)){ return true; } size_t start_pos = 0; for (size_t i = 0; i < word->length - params->right_hyphen_min; i++) { while (current_pos < start_pos) { if ((uint8_t) word->translated[current_index] != 0xff) { current_pos++; } current_index++; } start_index = current_index; end_index = current_index; end_pos = current_pos; node = 1 + (uint8_t) word->translated[start_index]; while (pt->t->nodes[node] == word->translated[end_index]){ if ((uint8_t) word->translated[end_index] != 0xff){ end_pos++; } end_index++; op_index = pt->ops->lookup[pt->t->aux[node]]; while (op_index > 0){ op = pt->ops->data[op_index]; dot_pos = start_pos; dot_index = start_index; while (dot_pos < start_pos + op.position){ if ((uint8_t) word->translated[dot_index] != 0xff){ dot_pos++; } dot_index++; } if (dot_index > 0){ dot_index--; if (op.value < BAD_OP_VALUE && get_found_hyphen(word, dot_index) < op.value){ if (!set_found_hyphen(word, dot_index, op.value)){ return false; } } if (op.value >= params->hyph_level){ if ((end_pos + params->pat_dot <= dot_pos + params->pat_len) && (dot_pos <= start_pos + params->pat_dot)){ if (!set_no_more(word, dot_index, true)){ return false; } } } } op_index = op.next_op_index; } base = pt->t->links[node]; if (base == 0){ break; } node = base + (uint8_t) word->translated[end_index]; } start_pos++; } return true; } @ hyphenate\_dictionary. Hyphenates dictionary entries and writes them to newly created {\tt pattmp} file if required. Return value indicates whether the hyphenation and file printout finished successfully. @c bool hyphenate_dictionary(struct params *params, struct translate_table *tt, struct pattern_trie *pt, bool output, struct pass_stats *ps){ ps->good_cnt = 0; ps->bad_cnt = 0; ps->miss_cnt = 0; params->word_weight = 1; FILE *pattmp = NULL; if (output){ char *filename = malloc(11 * sizeof(char)); if (filename == NULL){ return false; } sprintf(filename, "pattmp.%u", params->hyph_level); pattmp = fopen(filename, "w"); if (pattmp == NULL){ free(filename); return false; } printf("writing %s\n", filename); free(filename); } if (!hyphenate_all_words(params, tt, pt, pattmp, ps)){ if (pattmp != NULL){ fclose(pattmp); } return false; } printf("\n%zu good, %zu bad, %zu missed\n", ps->good_cnt, ps->bad_cnt, ps->miss_cnt); if (ps->good_cnt + ps->miss_cnt > 0){ printf("%.2f %%, %.2f %%, %.2f %%\n", (100* (float) ps->good_cnt / (float) (ps->good_cnt + ps->miss_cnt)), (100* (float) ps->bad_cnt/ (float) (ps->good_cnt + ps->miss_cnt)), (100* (float) ps->miss_cnt/ (float) (ps->good_cnt + ps->miss_cnt))); } if (pattmp != NULL){ fclose(pattmp); } return true; } @ hyphenate\_all\_words. Iterates over the words in the dictionary, parses, hyphenates, and writes them to the {\tt pattmp} file. Returns true if no error occurs. @c bool hyphenate_all_words(struct params *params, struct translate_table *tt, struct pattern_trie *pt, FILE *pattmp, struct pass_stats *ps){ rewind(params->dictionary_file); struct string_buffer *buf = init_buffer(64); if (buf == NULL){ return false; } struct word *word = init_word(16); if (word == NULL){ destroy_buffer(buf); return false; } buf->eof = false; while (!buf->eof){ if (!read_line(params->dictionary_file, buf) || !parse_word(buf, tt, params, word) || !hyphenate_word(word, pt, params)){ destroy_buffer(buf); destroy_word(word); return false; } count_dots(word, params, ps); if (pattmp != NULL && word->length > 2){ output_hyphenated_word(pattmp, word, tt, params); } } destroy_buffer(buf); destroy_word(word); return true; } @ output\_hyphenated\_word. Writes the parsed and hyphenated word to the {\tt pattmp} file. @c void output_hyphenated_word(FILE *pattmp, struct word *word, struct translate_table *tt, struct params *params){ if (params->word_weight > 1){ fprintf(pattmp, "%d", params->word_weight); } size_t weight, letter_index, dot_pos = 0; size_t edge_of_word_idx = get_letter_index(tt, (char[2]){EDGE_OF_WORD, '\0'}); char *word_index = word->translated; bool has_hyphen, found_hyphen; while (word_index < word->translated + word->size){ found_hyphen = (get_found_hyphen(word, word_index - word->translated) % 2 == 1 ); weight = get_true_hyphen(word, word_index - word->translated); has_hyphen = (weight % 2 == 1); letter_index = convert_byte_sequence(&word_index); if (letter_index == edge_of_word_idx){ continue; } fprintf(pattmp, "%s", tt->alphabet->data + tt->index_to_alphabet[letter_index]); dot_pos++; if (weight == 0 || dot_pos < params->left_hyphen_min || dot_pos >= word->length - params->right_hyphen_min - 1){ continue; } weight /= 4; if (weight != params->word_weight){ fprintf(pattmp, "%zu", weight); } if (found_hyphen && has_hyphen){ fputc((char) params->good_hyphen, pattmp); } else if (found_hyphen && !has_hyphen){ fputc((char) params->bad_hyphen, pattmp); } else if (!found_hyphen && has_hyphen){ fputc((char) params->missed_hyphen, pattmp); } } fputc('\n', pattmp); } @* UTF-8 specifics. Following methods greatly simplify dealing with UTF-8 encoding. We took advantage of the design that allows to easily determine whether a byte is the first one in UTF-8 character -- its two highest bits are not '10'. The first byte also encodes the number of bytes that form the UTF-8 character. If the highest bit is '0', the character comprises a single byte and its meaning is the same as it would be in ASCII encoding. If the highest bit is '1', the number of '1' bits on the highest positions equals the number of bytes of the character (with the exception of '10' which is not allowed in the leading byte), e.g., the byte of binary value '1110XXXX' is a beginning of 3-byte UTF-8 character. Although bytes '0xfe' and '0xff' are not formally allowed in UTF-8, we treat them as one-byte characters with special meaning ({\tt HYPHEN\_FLAG}, {\tt EDGE\_OF\_WORD} respectively). @ is\_utf\_start\_byte. Returns true if the given byte is the start byte of a UTF-8 character or a \utfpatgen{} special symbol. @c inline bool is_utf_start_byte(uint8_t byte){ return byte < 0x80 || byte > 0xbf ; } @** Structures. This section focuses on the data structures we used in \utfpatgen{} and their components. Here you can find information about the basic structures such as tries, buffers, and stacks, as well as composite structures like the pattern and count trie, translate table, or word. We tried to name the methods for basic manipulation with the structures consistently across the whole implementation. Thanks to that, you will find several methods with common prefixes: {\tt init\_XX} performs initial allocation of resources for structure {\it XX}, {\tt resize\_XX} reallocates them if the capacity threshold is reached, {\tt reset\_XX} reverts the structure to initial state, and {\tt destroy\_XX} frees all the resources to avoid memory leaks. Furthermore, we use {\tt get\_YY} and {\tt set\_YY} methods to read and write into the fields of given structure instead of direct access. This allows us to perform additional checks on the input values. @* Trie. The packed trie structure forms the backbone of both \patgen{} and \utfpatgen{} algorithms. The implementation is very similar in both cases, a set of arrays representing a n-ary tree, condensed for better space effectiveness. Single node of a trie comprises a {\it value}, {\it link} and {\it aux} pointers, and {\it base} indicator. Every admissible index of the arrays is either occupied by a node, or it is an empty space (its value is 0). Index 0 is left empty and serves as the head of empty space chain, index 1 is the root node of the trie itself. If we want to traverse the trie, that means finding the node corresponding to given sequence of values $x_1 \dots x_n$ (if such node exists), we start in the root node $r_0$. Our next destination is node $n_1 = r_0 + x_1$. Then we check whether $value(n_1) = x_1$ and if the equation holds, we set $r_1 = link(n_1)$ as the new root. If the equation does not hold or $r_1 = 0$, we can end the search and conclude that the sequence is not present in the trie. Otherwise, we repeat the steps for $n_2 \dots n_n$ and return the $n_n$ node as desired result. The internal representation of the trie consists a structure with following fields: \item{$\bullet$} {\bf capacity}: maximal node index that can be inserted, \item{$\bullet$} {\bf occupied}: number of nodes currently in the trie, \item{$\bullet$} {\bf node\_max}: the highest occupied node index, \item{$\bullet$} {\bf base\_max}: the highest base index, \item{$\bullet$} {\bf pattern\_count}: number of patterns (sequences) in the trie, \item{$\bullet$} {\bf nodes}: array of values, \item{$\bullet$} {\bf links}: array of links, pointing either to the base of occupied node's subtree, or to the next empty index in case of an empty space, \item{$\bullet$} {\bf aux}: array of links, pointing either to occupied node's output, or to the previous empty index in case of an empty space, \item{$\bullet$} {\bf taken}: bit array storing the information about whether indices are used as bases. @c struct trie *init_trie(size_t capacity){ struct trie *t = malloc(sizeof(struct trie)); if (t == NULL) { fputs("Allocation error\n", stderr); return NULL; } t->capacity = capacity; t->nodes = calloc(capacity, sizeof(char)); t->links = calloc(capacity, sizeof(size_t)); t->aux = calloc(capacity, sizeof(size_t)); t->taken = calloc((capacity / 8 ) + 1, sizeof(char)); // bit array if (t->nodes == NULL || t->links == NULL || t->aux == NULL || t->taken == NULL || !set_base_used(t, 1, true)) { fputs("Allocation error\n", stderr); free(t->nodes); free(t->links); free(t->aux); free(t->taken); free(t); return NULL; } t->node_max = 0; t->base_max = 0; t->occupied = 0; t->pattern_count = 0; relink_trie(t); return t; } @ put\_first\_level. Inserts the set of one-byte sequences '0x01' to '0xff' into the trie. It can be called during trie initialization, since these nodes will never be moved elsewhere. Return value indicates whether all the insertions succeeded. @c bool put_first_level(struct trie *t){ size_t root = 1; size_t n_bytes = 255; if (t->capacity < n_bytes + 2){ size_t new_capacity = (((n_bytes + 2) / t->capacity) + 1)* t->capacity; if (resize_trie(t, new_capacity) == NULL) { return false; } } for (size_t i = 1; i <= n_bytes; i++) { t->nodes[root+i] = (uint8_t) i; t->links[root + i] = 0; t->aux[root + i] = 0; } t->node_max = root + n_bytes; t->base_max = root; t->occupied = n_bytes; if (!set_base_used(t, root, true)) { return false; } set_links(t, 0, t->node_max + 1); return true; } struct trie *resize_trie(struct trie *t, size_t new_capacity){ void *new_nodes = realloc(t->nodes, new_capacity * sizeof(char)); if (new_nodes == NULL) { fputs("Allocation error\n", stderr); return NULL; } t->nodes = new_nodes; size_t *new_links = realloc(t->links, new_capacity * sizeof(size_t)); if (new_links == NULL) { fputs("Allocation error\n", stderr); return NULL; } t->links = new_links; size_t *new_aux = realloc(t->aux, new_capacity * sizeof(size_t)); if (new_aux == NULL) { fputs("Allocation error\n", stderr); return NULL; } t->aux = new_aux; size_t old_taken_bytes = (t->capacity / 8) + 1; size_t new_taken_bytes = (new_capacity / 8) + 1; char *new_taken = realloc(t->taken, new_taken_bytes * sizeof(char)); if (new_taken == NULL) { fputs("Allocation error\n", stderr); return NULL; } t->taken = new_taken; if (new_taken_bytes > old_taken_bytes) { memset(t->taken + old_taken_bytes, 0, (new_taken_bytes - old_taken_bytes)); } memset(t->nodes + t->capacity, 0, (new_capacity - t->capacity) * sizeof(char)); t->capacity = new_capacity; relink_trie(t); return t; } void destroy_trie(struct trie *t){ free(t->nodes); free(t->links); free(t->aux); free(t->taken); free(t); } @ relink\_trie. Connects all empty spaces in the trie. @c void relink_trie(struct trie *t){ size_t last_free = 0; for (size_t node = 2; node < t->capacity; node++){ if (!is_node_occupied(t, node)) { set_links(t, last_free, node); last_free = node; } } set_links(t, last_free, 0); } @ copy\_node. Replicates a node between two tries. Return value indicates the success of the operation. @c bool copy_node(struct trie *from, size_t from_index, struct trie *to, size_t to_index){ if (to_index >= to->capacity){ size_t new_capacity = ((to_index / to->capacity) + 1) * to->capacity; if (resize_trie(to, new_capacity) == NULL) { return false; } } bool from_free = (from->nodes[from_index] == 0); bool to_free = (to->nodes[to_index] == 0); to->nodes[to_index] = from->nodes[from_index]; to->links[to_index] = from->links[from_index]; to->aux[to_index] = from->aux[from_index]; if (!from_free && to_free) { to->occupied++; } else if (from_free && !to_free) { to->occupied--; } return true; } bool get_base_used(struct trie *t, size_t index){ if (index >= t->capacity) { return false; } size_t byte_index = index / 8; size_t bit_index = index % 8; return (t->taken[byte_index] & (1 << bit_index)) != 0; } bool set_base_used(struct trie *t, size_t index, bool used){ if (index >= t->capacity) { if (resize_trie(t, index + 1) == NULL) { return false; } } size_t byte_index = index / 8; size_t bit_index = index % 8; if (used) { t->taken[byte_index] |= (1 << bit_index); } else { t->taken[byte_index] &= ~(1 << bit_index); } return true; } @ set\_links. Sets the relevant ({\tt link} and {\tt aux}) pointers to link two empty spaces next to each other. @c void set_links(struct trie *t, size_t from, size_t to){ t->links[from] = to; t->aux[to] = from; } @ is\_node\_occupied. Returns true if the given trie index is occupied by a node. @c bool is_node_occupied(struct trie *t, size_t index){ return t->nodes[index] != 0; } @ insert\_pattern. Attempts to put the given pattern into the trie, possibly creating nodes along the way. If the insertion is successful, true is returned and the resulting index is stored in {\tt out\_op\_index}. @c bool insert_pattern(struct trie *t, const char *pattern, size_t *out_op_index, struct trie *helper_trie){ size_t length = strlen(pattern); return insert_substring(t, pattern, length, length, out_op_index, helper_trie); } @ insert\_substring. Attempts to put substring $[end-length, end)$ of the given pattern into the trie, possibly creating nodes along the way. If the insertion is successful, true is returned and the resulting index is stored in {\tt out\_op\_index}. @c bool insert_substring(struct trie *t, const char *pattern, size_t end, size_t length, size_t *out_op_index, struct trie *helper_trie){ size_t index = end - length; size_t base = 1; size_t node = base + (uint8_t) pattern[index]; size_t fit; size_t node_prev = 0; bool new_pattern = false; while (index < end && base > 0) { node = base + (uint8_t) pattern[index]; if (node >= t->capacity) { size_t new_capacity = ((node / t->capacity) + 1) * t->capacity; if (resize_trie(t, new_capacity) == NULL) { return false; } } if (t->nodes[node] != pattern[index]) { new_pattern = true; if (t->nodes[node] == 0) { t->nodes[node] = pattern[index]; set_links(t, t->aux[node], t->links[node]); t->aux[node] = 0; t->links[node] = 0; t->occupied++; if (node > t->node_max) { t->node_max = node; } } else { if (!repack(t, helper_trie, &node_prev, &node, pattern[index])) { return false; } } } index++; node_prev = node; base = t->links[node]; } helper_trie->links[1] = 0; helper_trie->aux[1] = 0; helper_trie->node_max = 1; while (index < end) { helper_trie->nodes[1] = pattern[index]; if (!first_fit(t, helper_trie, &fit)) { return false; } t->links[node] = fit; base = fit; node = base + (uint8_t) pattern[index]; index++; new_pattern = true; } *out_op_index = node; if (new_pattern){ t->pattern_count++; } return true; } @ repack. Attempts to move a subtree of the trie, so that it fits new node with given value. If the repacking finishes successfully, the index for new node is stored in {\tt base} and true is returned. @c bool repack(struct trie *t, struct trie *q, size_t *node, size_t *base, char value){ if (!unpack(t, *base - (uint8_t) value, q)) { return false; } if (q->node_max >= q->capacity){ size_t new_capacity = ((q->node_max / q->capacity) + 1) * q->capacity; if (resize_trie(q, new_capacity) == NULL) { return false; } } q->nodes[q->node_max] = value; q->links[q->node_max] = 0; q->aux[q->node_max] = 0; size_t fit; if (!first_fit(t, q, &fit)) { return false; } *base = fit; t->links[*node] = *base; *base += (uint8_t) value; return true; } @ unpack. Moves all nodes with the given base to auxiliary trie. @c bool unpack(struct trie *from, size_t base, struct trie *to){ to->node_max = 1; for (size_t i = 1; i < 256; i++){ size_t from_index = base + i; if ((uint8_t) from->nodes[from_index] == i) { if (!copy_node(from, from_index, to, to->node_max)) { return false; } deallocate_node(from, from_index); to->node_max++; } } if (!set_base_used(from, base, false)) { return false; } return true; } @ first\_fit. Finds the base of trie {\tt t} that fits the nodes from auxiliary trie {\tt q} and copies them into their new indices. Return true if the search and copying finishes successfully. @c bool first_fit(struct trie *t, struct trie *q, size_t *out_base){ size_t base; if (!find_base_for_first_fit(t, q, &base)) { return false; } for (size_t q_index = 1; q_index <= q->node_max; q_index++) { size_t t_index = base + (uint8_t) q->nodes[q_index]; set_links(t, t->aux[t_index], t->links[t_index]); if (!copy_node(q, q_index, t, t_index)) { return false; } if (t_index > t->node_max) { t->node_max = t_index; } } if (!set_base_used(t, base, true)){ return false; } *out_base = base; return true; } @ find\_base\_for\_first\_fit. Searches through the trie {\tt t} to find base index that fits the nodes from the {\tt q} trie. If the search is successful, resulting index is stored in {\tt out\_base} and true is returned. @c bool find_base_for_first_fit(struct trie *t, struct trie *q, size_t *out_base){ size_t t_index; uint8_t offset; if (q->node_max > 5 && t->capacity > t->node_max + 1){ t_index = t->node_max + 1; } else { t_index = t->links[0]; } for (size_t i = 0; i <= t->capacity; i++) { if (t_index == 0){ t_index = t->capacity + 1; // free for sure after resize size_t new_capacity = 2*t->capacity; if (!resize_trie(t, new_capacity)){ return false; } } offset = (uint8_t) q->nodes[1]; if (t_index <= offset) { t_index = t->links[t_index]; continue; } *out_base = t_index - offset; size_t max_target_index = *(out_base) + 255; if (max_target_index >= t->capacity) { size_t new_capacity = ((max_target_index / t->capacity) + 1) * t->capacity; if (resize_trie(t, new_capacity) == NULL) { return false; } } if (get_base_used(t, *out_base)) { t_index = t->links[t_index]; continue; } bool conflict = false; for (size_t q_index = q->node_max; q_index >= 2; q_index--) { if(is_node_occupied(t, *out_base + (uint8_t) q->nodes[q_index])){ conflict = true; break; } } if (!conflict) { return true; } t_index = t->links[t_index]; } printf("Loop detected!\n"); return false; } @ traverse\_trie. Returns the index of the node corresponding to the given pattern, or 0 if such node does not exist. @c size_t traverse_trie(struct trie *t, const char *pattern){ size_t index = 1; size_t node = (uint8_t) pattern[0] + 1; size_t base = t->links[node]; while (index < strlen(pattern) && base > 0) { base += (uint8_t) pattern[index]; if (t->nodes[base] != pattern[index]) { return 0; } node = base; base = t->links[node]; index++; } if (index < strlen(pattern)) { return 0; } return node; } @* Outputs. The hyphenation information associated to a pattern is stored as a triplet in {\tt output} structure: \item{$\bullet$} {\bf value}: the hyphenation level, \item{$\bullet$} {\bf position}: the dot position, \item{$\bullet$} {\bf next\_op\_index}: the pointer to next output of the same pattern. For space-saving purposes, several patterns may share a single output. This means that, for example, for all patterns having single hyphen of level 1 at position 3 we store only one such output to which all the patterns point. It is thus necessary to access the outputs as effectively as possible. \utfpatgen{} uses a "hash table" with indirect addressing -- simple function $(next\_op\_index + 313*position + 361*value) \% table\_capacity) + 1$ computes the index of an output in the {\tt lookup} array that contains pointers to the real outputs. The whole structure has these fields: \item{$\bullet$} {\bf capacity}: the capacity of the array of outputs, \item{$\bullet$} {\bf count}: the number of outputs currently in the array, \item{$\bullet$} {\bf data}: the array of outputs, \item{$\bullet$} {\bf lookup\_cap}: the capacity of the lookup table, \item{$\bullet$} {\bf lookup\_cnt}: the number of entries currently in the lookup, \item{$\bullet$} {\bf lookup}: the array of lookup entries. The sizes of data and lookup are independent to some extent, so the corresponding arrays are resized independently. Since we do not want to fill the lookup fully at any time due to efficiency reasons, we trigger its resize already at 75 \% capacity. Lookup index 0 does not point to any output and the program interprets it as "no output present". @c struct outputs *init_outputs(size_t capacity){ struct outputs *ops = malloc(sizeof(struct outputs)); if (ops == NULL) { fputs("Allocation error\n", stderr); return NULL; } ops->capacity = capacity; ops->count = 0; ops->data = calloc(capacity + 1, sizeof(struct output)); if (ops->data == NULL) { fputs("Allocation error\n", stderr); free(ops); return NULL; } ops->data[0].next_op_index = 1; ops->lookup_cap = 2*capacity; ops->lookup_cnt = 0; ops->lookup = calloc(2*capacity + 1, sizeof(size_t)); if (ops->lookup == NULL) { fputs("Allocation error\n", stderr); free(ops->data); free(ops); return NULL; } return ops; } struct outputs *resize_outputs(struct outputs *ops, size_t capacity){ struct output *new_data = realloc(ops->data, (capacity + 1) * sizeof(struct output)); if (new_data == NULL) { fputs("Allocation error\n", stderr); return NULL; } ops->data = new_data; size_t diff = capacity - ops->capacity; memset(ops->data + ops->capacity + 1, 0, diff * sizeof(struct output)); ops->capacity = capacity; return ops; } void destroy_outputs(struct outputs *ops){ free(ops->data); free(ops->lookup); free(ops); } bool resize_lookup(struct outputs *ops, size_t new_cap, struct trie *t) { size_t *new_lookup = calloc(new_cap + 1, sizeof(size_t)); size_t *old_lookup = ops->lookup; if (new_lookup == NULL){ return false; } ops->lookup = new_lookup; ops->lookup_cap = new_cap; size_t old_hash, new_hash, op_index; struct output op; for (size_t node = 0; node < t->capacity; node++){ if (!is_node_occupied(t, node) || t->aux[node] == 0){ continue; } old_hash = t->aux[node]; op_index = old_lookup[old_hash]; op = ops->data[op_index]; new_hash = hash_trie_output(ops, op.value, op.position, op.next_op_index); ops->lookup[new_hash] = op_index; t->aux[node] = new_hash; } free(old_lookup); return true; } @ hash\_trie\_output. Computes the value of the hash function for given output and finds a free index in the lookup that will store it. If successful, returns the hash value. @c size_t hash_trie_output(struct outputs *ops, size_t value, size_t position, size_t next_op_index){ size_t hash = ((next_op_index + 313*position + 361*value) % ops->lookup_cap) + 1; size_t op_index; while (true) { op_index = ops->lookup[hash]; if (op_index == 0) { return hash; } else if (ops->data[op_index].value == value && ops->data[op_index].position == position && ops->data[op_index].next_op_index == next_op_index) { return hash; } else if (hash > 1) { hash -= 1; } else { hash = ops->lookup_cap; } } return 0; } @* Pattern trie. The structure that holds already generated set of patterns. It is simply an aggregation of two other structures: \item{$\bullet$} {\bf t}: a trie storing the patterns, \item{$\bullet$} {\bf ops}: outputs array storing the hyphenation information. Technically speaking, the {\tt aux} pointer of each node that corresponds to a pattern contains the lookup index of its output. @c struct pattern_trie *init_pattern_trie(size_t trie_capacity, size_t outputs_capacity){ struct pattern_trie *pt = malloc(sizeof(struct pattern_trie)); if (pt == NULL){ return NULL; } pt->t = init_trie(trie_capacity); if (pt->t == NULL){ free(pt); return NULL; } if (!put_first_level(pt->t)){ free(pt->t); free(pt); return NULL; } pt->ops = init_outputs(outputs_capacity); if (pt->ops == NULL){ free(pt->t); free(pt); return NULL; } return pt; } void destroy_pattern_trie(struct pattern_trie *pt){ destroy_trie(pt->t); destroy_outputs(pt->ops); free(pt); } @ new\_trie\_output. Creates an entry for the given output. If successful, returns true and stores output's lookup index to {\tt op\_index}. @c bool new_trie_output(struct pattern_trie *pt, size_t value, size_t position, size_t next_op_index, size_t *op_index){ if (pt->ops->count >= pt->ops->capacity - 1) { if (resize_outputs(pt->ops, pt->ops->capacity * 2) == NULL) { return false; } } if (pt->ops->lookup_cnt * 4 > pt->ops->lookup_cap * 3) { if (!resize_lookup(pt->ops, pt->ops->lookup_cap * 2, pt->t)){ return false; } } size_t hash = hash_trie_output(pt->ops, value, position, next_op_index); if (pt->ops->lookup[hash] == 0) { pt->ops->count++; struct output new_op = {.value = value, .position = position, .next_op_index = next_op_index}; size_t free_list_head = pt->ops->data[0].next_op_index; if (pt->ops->data[free_list_head].next_op_index == 0){ pt->ops->data[0].next_op_index = pt->ops->count + 1; } else { pt->ops->data[0].next_op_index = pt->ops->data[free_list_head].next_op_index; } pt->ops->data[free_list_head] = new_op; pt->ops->lookup[hash] = free_list_head; pt->ops->lookup_cnt++; } *op_index = hash; return true; } @ set\_output. Creates a new output and links it to given node in the trie. Returns true if the creation and assignment finished successfully. @c bool set_output(struct pattern_trie *pt, size_t node, size_t value, size_t position){ size_t op_index; if (!new_trie_output(pt, value, position, pt->ops->lookup[pt->t->aux[node]], &op_index)) { return false; } pt->t->aux[node] = op_index; return true; } @* Pattern counts. The structure that stores the numbers of supporting and contradicting occurrences for patterns. It comprises 4 fields: \item{$\bullet$} {\bf capacity}: the maximum index that can be used, \item{$\bullet$} {\bf size}: the highest index currently in use, \item{$\bullet$} {\bf good}: array of supporting occurrence counts, \item{$\bullet$} {\bf bad}: array of contradicting occurrence counts. The counts on given index correspond to the same pattern. @c struct pattern_counts *init_pattern_counts(size_t capacity){ struct pattern_counts *pc = malloc(sizeof(struct pattern_counts)); if (pc == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } pc->good = calloc(capacity, sizeof(size_t)); pc->bad = calloc(capacity, sizeof(size_t)); if (pc->good == NULL || pc->bad == NULL){ fprintf(stderr, "Allocation error\n"); free(pc->good); free(pc->bad); free(pc); return NULL; } pc->capacity = capacity; pc->size = 1; return pc; } struct pattern_counts *resize_pattern_counts(struct pattern_counts *pc, size_t new_capacity){ size_t *new_good = realloc(pc->good, new_capacity * sizeof(size_t)); if (new_good == NULL){ fprintf(stderr, "Allocation error\n"); return NULL; } pc->good = new_good; size_t *new_bad = realloc(pc->bad, new_capacity * sizeof(size_t)); if (new_bad == NULL){ fprintf(stderr, "Allocation error\n"); return NULL; } pc->bad = new_bad; size_t diff = new_capacity - pc->capacity; memset(pc->good + pc->capacity, 0, diff * sizeof(size_t)); memset(pc->bad + pc->capacity, 0, diff * sizeof(size_t)); pc->capacity = new_capacity; return pc; } void destroy_pattern_counts(struct pattern_counts *pc){ free(pc->good); free(pc->bad); free(pc); } @* Count trie. This structure stores candidate patterns and their occurrence counts. During dictionary file processing, every word contributes to count trie with its substrings as candidate patterns. In implementation, the count trie is composed from 2 distinct substructures: \item{$\bullet$} {\bf t}: a trie storing the patterns, \item{$\bullet$} {\bf cnts}: pattern counts with the numbers of occurrences. Note that the count trie does not need any outputs, since the only pattern template that the current iteration explores is defined by {\tt pat\_len} and {\tt pat\_dot} parameters. This allows us to store pointers to pattern counts in the {\tt aux} field of occupied trie nodes instead. @c struct count_trie *init_count_trie(size_t trie_capacity, size_t counts_capacity){ struct count_trie *ct = malloc(sizeof(struct count_trie)); if (ct == NULL){ return NULL; } ct->t = init_trie(trie_capacity); if (ct->t == NULL){ free(ct); return NULL; } if (!put_first_level(ct->t)){ free(ct->t); free(ct); return NULL; } ct->cnts = init_pattern_counts(counts_capacity); if (ct->cnts == NULL){ free(ct->t); free(ct); return NULL; } return ct; } void destroy_count_trie(struct count_trie *ct){ destroy_trie(ct->t); destroy_pattern_counts(ct->cnts); free(ct); } @* String buffer. We use buffer for storing text, usually lines read from input files. The structure contains following fields: \item{$\bullet$} {\bf capacity}: the maximum length of the string that the buffer can hold (including the ending '0x00'), \item{$\bullet$} {\bf size}: the current length of string stored in the buffer, \item{$\bullet$} {\bf data}: the array storing the string, \item{$\bullet$} {\bf eof}: if the buffer is used to read lines from a file, this flag indicates whether the end of file was reached.. @c struct string_buffer *init_buffer(size_t capacity){ struct string_buffer *buf = malloc(sizeof(struct string_buffer)); if (buf == NULL) { fputs("Allocation error\n", stderr); return NULL; } buf->capacity = capacity; buf->size = 0; buf->data = malloc(capacity*sizeof(char)); buf->eof = false; if (buf->data == NULL) { fputs("Allocation error\n", stderr); free(buf); return NULL; } buf->data[0] = '\0'; return buf; } struct string_buffer *resize_buffer(struct string_buffer *buf, size_t new_capacity){ char *new_ptr = realloc(buf->data, new_capacity); if (new_ptr == NULL) { fputs("Allocation error\n", stderr); return NULL; } buf->data = new_ptr; buf->capacity = new_capacity; return buf; } void reset_buffer(struct string_buffer *buf){ buf->eof = false; buf->size = 0; buf->data[0] = '\0'; } void destroy_buffer(struct string_buffer *buf){ free(buf->data); free(buf); } @ append\_char. Puts the given character to the current end of buffer. Note that the operation may overwrite the ending '0x00', and the caller should ensure it is not the case. Returns true upon successful write. @c bool append_char(struct string_buffer *buf, char c){ if (buf->size + 1 >= buf->capacity) { if (resize_buffer(buf, 2*buf->capacity) == NULL) { return false; } } buf->data[buf->size] = c; buf->size++; return true; } @ append\_string. Copies the given string to the current end of buffer. Provided string contains the '0x00' at its end by design, and the character is copied to the buffer as well. Returns true if the copying finished successfully. @c bool append_string(struct string_buffer *buf, const char *str){ size_t len = strlen(str) + 1; if (buf->size + len >= buf->capacity) { if (resize_buffer(buf, 2*(buf->size + len)) == NULL) { return false; } } strcpy(&buf->data[buf->size], str); buf->size += len; return true; } @* Translate table. We use this structure to hold the information about character mapping. The translate file documents the relationships between "lowercase" letters and their "uppercase" variants, and the translate table allows easy conversion. Technically the structure is just a combination of 2 substructures: \item{$\bullet$} {\bf mapping}: a trie that stores all the characters (both variants), \item{$\bullet$} {\bf alphabet}: a string buffer containing all lowercase characters separated by '0x00'. In the {\tt aux} field of the trie nodes corresponding to a character we store index of the beginning of its lowercase representation in {\tt alphabet}. @c struct translate_table *init_tr_table(size_t mapping_capacity, size_t alphabet_capacity){ struct translate_table *tt = malloc(sizeof(struct translate_table)); if (tt == NULL){ fprintf(stderr, "Allocation error\n"); return NULL; } struct trie *mapping = init_trie(mapping_capacity); if (mapping == NULL){ free(tt); return NULL; } if (!put_first_level(mapping)){ destroy_trie(mapping); free(tt); return NULL; } struct string_buffer *alphabet = init_buffer(alphabet_capacity); if (alphabet == NULL){ destroy_trie(mapping); free(tt); return NULL; } tt->mapping = mapping; tt->alphabet = alphabet; tt->index_to_alphabet = malloc(alphabet_capacity * sizeof(size_t)); if (tt->index_to_alphabet == NULL){ destroy_trie(tt->mapping); destroy_buffer(tt->alphabet); free(tt); return NULL; } tt->letter_count = 0; tt->letter_capacity = alphabet_capacity; if (!append_char(tt->alphabet, '\0')){ free(tt->index_to_alphabet); destroy_trie(tt->mapping); destroy_buffer(tt->alphabet); free(tt); return NULL; } return tt; } void destroy_tr_table(struct translate_table *tt){ destroy_trie(tt->mapping); destroy_buffer(tt->alphabet); free(tt->index_to_alphabet); free(tt); } @ get\_lower. Returns the lowercase representation of the given letter, or NULL if the letter does not exist in the translate table. @c char *get_lower(struct translate_table *tt, const char *letter){ size_t index = traverse_trie(tt->mapping, letter); if (index == 0 || tt->mapping->aux[index] > tt->letter_count || tt->mapping->aux[index] == 0){ return NULL; } size_t alphabet_offset = tt->index_to_alphabet[tt->mapping->aux[index]]; if (alphabet_offset >= tt->alphabet->size){ return NULL; } return tt->alphabet->data + alphabet_offset; } @ get\_letter\_index. Returns the index of the given letter in the translate table, or 0 if the letter does not exist. @c size_t get_letter_index(struct translate_table *tt, char *letter){ size_t index = traverse_trie(tt->mapping, letter); if (index == 0 || tt->mapping->aux[index] > tt->letter_count || tt->mapping->aux[index] == 0){ return 0; } return tt->mapping->aux[index]; } @ convert\_index. Converts a letter index to a byte sequence and appends it to the word. While the index is greater than 254, appends 0xFF byte to the word and subtracts 254 from the index. Finally appends the byte with the same value as the remainder. @c bool convert_index(size_t index, struct word *word){ if (index == 0){ return true; } size_t ff_count = (index-1) / 254; size_t remainder = ((index - 1) % 254) + 1; for (size_t i = 0; i < ff_count; i++){ if (!append_char_to_word(word, (char) 0xff)){ return false; } } if (!append_char_to_word(word, (char) remainder)){ return false; } return true; } @ convert\_byte\_sequence. Converts a byte sequence to a translate table index. This is the reverse operation of convert\_index. @c size_t convert_byte_sequence(char **sequence){ if (sequence == NULL || *sequence == NULL){ return 0; } size_t ff_count = 0; while (**sequence == (char) 0xff){ ff_count++; (*sequence)++; } size_t remainder = (uint8_t) **sequence; (*sequence)++; return ff_count * 254 + remainder; } @* Params. This structure encompasses all the parameters that influence the computation of \utfpatgen{}. We can sort them by their scope into {\bf global} (usually defined once and used throughout the whole run), {\bf level-specific} (defined specifically for each hyphenation level), and {\bf pass-specific} (defined specifically for each iteration within a level): \item{$\bullet$} {\bf left\_hyphen\_min} (global): the length of prefix of a word that should not be hyphenated, \item{$\bullet$} {\bf right\_hyphen\_min} (global): the length of suffix of a word that should not be hyphenated, \item{$\bullet$} {\bf bad\_hyphen} (global): the symbol representing incorrectly placed hyphen in the dictionary, \item{$\bullet$} {\bf missed\_hyphen} (global): the symbol representing missing hyphen in the dictionary, \item{$\bullet$} {\bf good\_hyphen} (global): the symbol representing correctly placed hyphen in the dictionary, \item{$\bullet$} {\bf hyph\_start} (global): the lowest hyphenation level to be generated, \item{$\bullet$} {\bf hyph\_finish} (global): the highest hyphenation level to be generated, \item{$\bullet$} {\bf word\_weight} (global): the default weight of a hyphen in a word from dictionary, \item{$\bullet$} {\bf dictionary\_file} (global): a pointer to opened file with dictionary entries, \item{$\bullet$} {\bf pattern\_file} (global): a pointer to opened file with initial set of patterns, \item{$\bullet$} {\bf output\_file} (global): a pointer to opened file where the final set of patterns will be written, \item{$\bullet$} {\bf translate\_file} (global): a pointer to opened file with character mapping and language-specific parameters, \item{$\bullet$} {\bf hyph\_level} (level-specific): the current hyphenation level, \item{$\bullet$} {\bf pat\_start} (level-specific): the shortest pattern length, \item{$\bullet$} {\bf pat\_finish} (level-specific): the longest pattern length, \item{$\bullet$} {\bf good\_wt} (level-specific): the supporting occurrence weight, \item{$\bullet$} {\bf bad\_wt} (level-specific): the contradicting occurrence weight, \item{$\bullet$} {\bf thresh} (level-specific): the pattern acceptance threshold, \item{$\bullet$} {\bf good\_dot} (level-specific): the hyphen type considered as correct, \item{$\bullet$} {\bf bad\_dot} (level-specific): the hyphen type considered as incorrect, \item{$\bullet$} {\bf pat\_len} (pass-specific): the current pattern length, \item{$\bullet$} {\bf pat\_dot} (pass-specific): the current dot position. @c struct params *init_params(){ struct params *p = malloc(sizeof(struct params)); if (p == NULL) { fputs("Allocation error\n", stderr); return NULL; } p->left_hyphen_min = 2; p->right_hyphen_min = 3; p->bad_hyphen = '.'; p->missed_hyphen = '-'; p->good_hyphen = '*'; p->word_weight = 1; p->dictionary_file = NULL; p->pattern_file = NULL; p->output_file = NULL; p->translate_file = NULL; return p; } void reset_params(struct params *p){ p->left_hyphen_min = 2; p->right_hyphen_min = 3; p->bad_hyphen = '.'; p->missed_hyphen = '-'; p->good_hyphen = '*'; } void destroy_params(struct params *p){ if (p->dictionary_file != NULL){ fclose(p->dictionary_file); } if (p->pattern_file != NULL){ fclose(p->pattern_file); } if (p->output_file != NULL){ fclose(p->output_file); } if (p->translate_file != NULL){ fclose(p->translate_file); } free(p); } @* Pass stats. This structure groups together variables with mostly statistical meaning. We use them to map the progress and quality of the pattern generation, and present it on standard output. Following fields are monitored: \item{$\bullet$} {\bf good\_pat\_cnt}: the number of new good patterns from an iteration, \item{$\bullet$} {\bf bad\_pat\_cnt}: the number of new bad patterns from an iteration, \item{$\bullet$} {\bf good\_cnt}: the number of hyphens from the dictionary correctly identified with current set of patterns, \item{$\bullet$} {\bf bad\_cnt}: the number of hyphens from the dictionary incorrectly identified with current set of patterns, \item{$\bullet$} {\bf miss\_cnt}: the number of hyphens from the dictionary missed with current set of patterns, \item{$\bullet$} {\bf level\_pattern\_cnt}: the number of patterns generated on current hyphenation level, \item{$\bullet$} {\bf max\_level}: the highest hyphenation level found in the initial patterns, \item{$\bullet$} {\bf more\_to\_come}: a flag indicating that undecided patterns were found on current hyphenation level. @* Stack. We decided to implement a stack-like structure to break the recursive functions from \patgen{}. In most cases the stack holds data related to previous iterations of a method, for instance, the trie bases on the path to current node. There are only 3 fields: \item{$\bullet$} {\bf capacity}: the highest index that can be accessed, \item{$\bullet$} {\bf top}: the index where the last value was inserted, \item{$\bullet$} {\bf data}: the array of values currently on the stack. @c struct stack *init_stack(size_t capacity){ struct stack *s = malloc(sizeof(struct stack)); if (s == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } s->data = malloc(capacity * sizeof(size_t)); if (s->data == NULL){ fprintf(stderr, "Allocation error\n"); free(s->data); free(s); return NULL; } s->capacity = capacity; s->top = 0; return s; } struct stack *resize_stack(struct stack *s, size_t new_capacity){ size_t *new_stack = realloc(s->data, new_capacity * sizeof(size_t)); if (new_stack == NULL){ fprintf(stderr, "Allocation error\n"); destroy_stack(s); return NULL; } s->data = new_stack; s->capacity = new_capacity; return s; } void destroy_stack(struct stack *s){ free(s->data); free(s); } @ put\_on\_stack. Appends the given value to the top of the stack. Returns true if the insertion finishes successfully. @c bool put_on_stack(struct stack *s, size_t value){ if (s->top >= s->capacity){ size_t new_capacity = 2 * (s->top); if (resize_stack(s, new_capacity) == NULL){ return false; } } s->data[s->top] = value; s->top++; return true; } @ get\_top\_value. Returns the value currently at {\tt top} index. Does not change anything on the stack. @c size_t get_top_value(struct stack *s){ if (s->top == 0){ return 0; } return s->data[s->top - 1]; } @ set\_top\_value. Changes the value at {\tt top} index to the given one. @c void set_top_value(struct stack *s, size_t value){ if (s->top == 0){ return; } s->data[s->top - 1] = value; } @* Word. The structure that holds information about a word and its hyphens. It consists of 7 fields: \item{$\bullet$} {\bf capacity}: the maximum length of the word in bytes, \item{$\bullet$} {\bf size}: current length of the word in bytes, \item{$\bullet$} {\bf length}: current length of the word in characters, \item{$\bullet$} {\bf lowercase}: the word translated to lowercase characters according to the translate table, and surrounded by {\tt EDGE\_OF\_WORD} symbols, \item{$\bullet$} {\bf true\_hyphens}: array of hyphen types and their weights retrieved from the dictionary, \item{$\bullet$} {\bf found\_hyphens}: array of hyphenation level identified by current set of patterns, \item{$\bullet$} {\bf no\_more}: array of flags marking the dot positions that can be skipped during processing. Array index $n$ corresponds to the dot position between characters $n$ and $n+1$. The lowest two bits of a {\tt true\_hyphens} value describe the hyphen type, the rest represent its weight (multiplied by 4). Note that the {\tt lowercase} array is not strictly a string as we do not require the closing '0x00'. @c struct word *init_word(size_t capacity){ struct word *word = malloc(sizeof(struct word)); if (word == NULL){ return NULL; } word->translated = calloc(capacity, sizeof(char)); if (word->translated == NULL){ free(word); return NULL; } word->true_hyphens = calloc(capacity, sizeof(size_t)); if (word->true_hyphens == NULL){ free(word->translated); free(word); return NULL; } word->found_hyphens = calloc(capacity, sizeof(uint8_t)); if (word->found_hyphens == NULL){ free(word->translated); free(word->true_hyphens); free(word); return NULL; } word->no_more = calloc(capacity, sizeof(bool)); if (word->no_more == NULL){ free(word->translated); free(word->true_hyphens); free(word->found_hyphens); free(word); return NULL; } word->size = 0; word->length = 0; word->capacity = capacity; return word; } struct word *resize_word(struct word *word, size_t new_capacity){ char *new_translated = realloc(word->translated, new_capacity * sizeof(char)); if (new_translated == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } word->translated = new_translated; size_t *new_true_hyphens = realloc(word->true_hyphens, new_capacity * sizeof(size_t)); if (new_true_hyphens == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } word->true_hyphens = new_true_hyphens; uint8_t *new_found_hyphens = realloc(word->found_hyphens, new_capacity * sizeof(uint8_t)); if (new_found_hyphens == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } word->found_hyphens = new_found_hyphens; bool *new_no_more = realloc(word->no_more, new_capacity * sizeof(bool)); if (new_no_more == NULL) { fprintf(stderr, "Allocation error\n"); return NULL; } word->no_more = new_no_more; size_t diff = new_capacity - word->capacity; memset(word->translated + word->capacity, '\0', diff * sizeof(char)); memset(word->true_hyphens + word->capacity, 0, diff * sizeof(size_t)); memset(word->found_hyphens + word->capacity, 0, diff * sizeof(uint8_t)); memset(word->no_more + word->capacity, false, diff * sizeof(bool)); word->capacity = new_capacity; return word; } void reset_word(struct word *word){ word->length = 0; word->size = 0; memset(word->translated, 0, word->capacity*sizeof(char)); memset(word->true_hyphens, 0, word->capacity*sizeof(size_t)); memset(word->found_hyphens, 0, word->capacity*sizeof(uint8_t)); memset(word->no_more, false, word->capacity*sizeof(bool)); } void destroy_word(struct word *word){ free(word->translated); free(word->true_hyphens); free(word->found_hyphens); free(word->no_more); free(word); } size_t get_true_hyphen(struct word *word, size_t index){ if (index >= word->size){ return 0; } return word->true_hyphens[index]; } bool set_true_hyphen(struct word *word, size_t index, size_t value){ if (index >= word->size){ return false; } word->true_hyphens[index] = value; return true; } uint8_t get_found_hyphen(struct word *word, size_t index){ if (index >= word->size) { return 0; } return word->found_hyphens[index]; } bool set_found_hyphen(struct word *word, size_t index, uint8_t value){ if (index >= word->size){ return false; } word->found_hyphens[index] = value; return true; } bool get_no_more(struct word *word, size_t index){ if (index >= word->size){ return false; } return word->no_more[index]; } bool set_no_more(struct word *word, size_t index, bool value){ if (index >= word->size){ return false; } word->no_more[index] = value; return true; } @ append\_char\_to\_word. Puts the given byte to the end of word. Returns true if the insertion was successful. @c bool append_char_to_word(struct word *word, char c){ if (word->size >= word->capacity - 1){ if (!resize_word(word, 2 * word->capacity)){ return false; } } word->translated[word->size] = c; word->size++; if ((uint8_t) c != 0xff){ word->length++; } return true; } @* Pattern. This structure's purpose is to hold information about a pattern parsed from the pattern file. It has similar fields as {\tt word}, but is stripped of the arrays it does not need: \item{$\bullet$} {\bf capacity}: the maximum length of the pattern in bytes, \item{$\bullet$} {\bf size}: current length of the pattern in bytes, \item{$\bullet$} {\bf length}: current length of the pattern in characters, \item{$\bullet$} {\bf text}: the pattern translated to lowercase characters according to the translate table, \item{$\bullet$} {\bf hyphens}: array of hyphenation levels of the pattern. @c struct pattern *init_pattern(size_t capacity){ struct pattern *pat = malloc(sizeof(struct pattern)); if(pat == NULL){ return NULL; } pat->text = calloc(capacity, sizeof(char)); if(pat->text == NULL){ free(pat); return NULL; } pat->hyphens = calloc(capacity, sizeof(uint8_t)); if(pat->hyphens == NULL){ free(pat->text); free(pat); return NULL; } pat->length = 0; pat->size = 0; pat->capacity = capacity; return pat; } struct pattern *resize_pattern(struct pattern *pat, size_t new_capacity){ char* new_text = realloc(pat->text, new_capacity*sizeof(char)); uint8_t* new_hyphens= realloc(pat->hyphens, new_capacity*sizeof(uint8_t)); if (new_text == NULL || new_hyphens == NULL){ fprintf(stderr,"Allocation error\n"); return NULL; } pat->text= new_text; pat->hyphens= new_hyphens; memset(pat->text + pat->capacity, '\0', (new_capacity - pat->capacity)*sizeof(char)); memset(pat->hyphens + pat->capacity, 0, (new_capacity - pat->capacity)*sizeof(uint8_t)); pat->capacity= new_capacity; return pat; } void reset_pattern(struct pattern *pat){ pat->length = 0; pat->size = 0; memset(pat->text, '\0', pat->capacity * sizeof(char)); memset(pat->hyphens, 0, pat->capacity * sizeof(uint8_t)); } void destroy_pattern(struct pattern *pat){ free(pat->text); free(pat->hyphens); free(pat); } uint8_t get_hyphen(struct pattern *pat, size_t index){ if (index >= pat->capacity){ return 0; } return pat->hyphens[index]; } bool set_hyphen(struct pattern *pat, size_t index, uint8_t value){ if (index >= pat->capacity){ return false; } pat->hyphens[index] = value; return true; } @ convert\_index\_to\_pattern. Converts a letter index to a byte sequence and appends it to the pattern. While the index is greater than 254, appends 0xFF byte to the pattern and subtracts 254 from the index. Finally appends the byte with the same value as the remainder. @c bool convert_index_to_pattern(size_t index, struct pattern *pat){ if (index == 0) return true; size_t ff_count = (index-1) / 254; size_t remainder = ((index - 1) % 254) + 1; size_t total_bytes = ff_count + 1; if(pat->size >= pat->capacity - total_bytes){ if(!resize_pattern(pat, 2*(pat->capacity + total_bytes))){ return false; } } for (size_t i = 0; i < ff_count; i++){ pat->text[pat->size] = (char) 0xff; pat->size++; } pat->text[pat->size] = (char) remainder; pat->size++; return true; } @** Index. List of used identifiers