Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin using Recursive Neural Networks

Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin using Recursive Neural Networks Minh Nguyen, Gia H. Ngo , and Nancy F. Chen © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Minh Nguyen is currently with University of California - Davis, but part of the work was done at the Institute for Infocomm Research, A^*STAR.Gia H. Ngo is currently with Cornell University, but part of this work was done at the Institute for Infocomm Research, A^*STAR.Nancy F. Chen is currently with the Institute for Infocomm Research, A^*STAR.Nancy F. Chen is the corresponding author (nancychen@alum.mit.edu).

Logographs (Chinese characters) have recursive structures (i.e. hierarchies of sub-units in logographs) that contain phonological and semantic information, as developmental psychology literature suggests that native speakers leverage on the structures to learn how to read. Exploiting these structures could potentially lead to better embeddings that can benefit many downstream tasks. We propose building hierarchical logograph (character) embeddings from logograph recursive structures using treeLSTM, a recursive neural network. Using recursive neural network imposes a prior on the mapping from logographs to embeddings since the network must read in the sub-units in logographs according to the order specified by the recursive structures. Based on human behavior in language learning and reading, we hypothesize that modeling logographs’ structures using recursive neural network should be beneficial. To verify this claim, we consider two tasks (1) predicting logographs’ Cantonese pronunciation from logographic structures and (2) language modeling. Empirical results show that the proposed hierarchical embeddings outperform baseline approaches. Diagnostic analysis suggests that hierarchical embeddings constructed using treeLSTM is less sensitive to distractors, thus is more robust, especially on complex logographs.

recursive structure, morphology, logograph, embeddings, neural networks.

I I §I <tag close=" ">I</tag><text font="smallcaps">Introduction</text>

Logographic structures contain phonological and semantic information about the logographs []. Language learners usually exploit logographic structures to learn logographs’ pronunciation by focusing on salient sub-units of logographs that hint at pronunciations []. Being able to focus on sub-units of logographs might explain how humans can remember the pronunciation and meanings of thousands of distinct characters. Figure shows how logographic structures encode phonological and semantic information. The {CJK*}UTF8bsmi氶 sub-unit (position 6) hints at the nucleus and coda in the logographs’ pronunciation. In addition, the {CJK*}UTF8bsmi火 sub-unit 1 1 footnote 1 {CJK*}UTF8bsmi火 is written as {CJK*}UTF8gkai灬 when is it at the bottom position. (position 5) suggests that the logographs containing this sub-unit must be related to fire. For the four logographs {CJK*}UTF8bsmi蒸, 烝, 丞, 氶, the structure of one logograph is nested within that of the preceding logograph. For example, {CJK*}UTF8bsmi烝 is nested within {CJK*}UTF8bsmi蒸. Modeling this hierarchy should allow models to pick out {CJK*}UTF8bsmi氶 as the most relevant sub-unit for determining the pronunciation of {CJK*}UTF8bsmi蒸, 烝, 丞, 氶 and {CJK*}UTF8bsmi火 as the most relevant sub-unit for determining the semantic of {CJK*}UTF8bsmi蒸 and {CJK*}UTF8bsmi烝.

Figure 1 1 Figure 1 1 An example of logographic structure. The left panel shows a binary tree representing the logograph {CJK*}UTF8bsmi蒸. The leaf nodes (position 2, 5, 6, 7) are sub-units forming the logograph (analogous to letters forming English words). The inner nodes (position 1, 3, 4) are composition operators (such as vertical stacking) applied to children nodes. The logograph {CJK*}UTF8bsmi蒸 is formed by composing all the nodes in the tree in a bottom-up fashion. The sub-trees rooted at positions 3, 4, 5, 6, 7 also form logographs ({CJK*}UTF8bsmi烝, 丞, 氶, 火, 一). The right table shows the logographs’ meanings and their pronunciation in Cantonese. Figure 1 An example of logographic structure. The left panel shows a binary tree representing the logograph {CJK*}UTF8bsmi蒸. The leaf nodes (position 2, 5, 6, 7) are sub-units forming the logograph (analogous to letters forming English words). The inner nodes (position 1, 3, 4) are composition operators (such as vertical stacking) applied to children nodes. The logograph {CJK*}UTF8bsmi蒸 is formed by composing all the nodes in the tree in a bottom-up fashion. The sub-trees rooted at positions 3, 4, 5, 6, 7 also form logographs ({CJK*}UTF8bsmi烝, 丞, 氶, 火, 一). The right table shows the logographs’ meanings and their pronunciation in Cantonese.

Given the link between logographic structures and their phonology and semantics, we investigate methods to construct logograph (character) embeddings that are useful for different downstream tasks. We consider two tasks (1) predicting logographs’ Cantonese pronunciation from logographic structures and (2) language modeling. Pronunciation prediction task requires the embeddings to contain phonological information while language modeling requires the embeddings to contain semantic information. We propose constructing hierarchical logograph (character) embeddings of logographs from their recursive structures using treeLSTM []. treeLSTM model exploits structures explicitly since it must read in sub-units in the logographs according to the order specified by the recursive structures. We compare hierarchical embeddings against two different approaches that are commonly used to construct embeddings. The first approach is standard embeddings [] in which logographs are mapped to representations without utilizing the logographs’ structures. The second approach is to construct logograph embeddings from linearized structures using LSTM []. The second approach only exploits structures implicitly since the structural information is in the input data and not in the model. Without a lot of training data, this approach is prone to overfitting and learning solutions that may not generalize well.

Modeling structures is expected to help models generalize better especially when there is limited training data []. Modeling structures has led to improvement in multiple tasks such as machine translation [], sentiment analysis [], natural language inference [], and parsing []. Despite these successes, there are also cases whereby there is little improvement []. The lack of improvement could be due to either (1) the models cannot exploit structures effectively or (2) the structures do not provide information relevant to the tasks. Thus, it is important to ensure both the high quality of structure annotations and ability of models to exploit structures effectively so as to improve overall task performance. However, ensuring consistently high quality annotation is not simple, especially for complex tasks where multiple annotations are plausible. The quality of structure annotation may vary between training sets and test sets or even within examples in the training sets. Variation in annotations of training samples may happen due to disagreement between human annotators. Variation between annotations between training and test samples may happen when models are trained on annotations provided by humans but are tested on annotations provided by parsers that were trained to mimic human annotators. In contrast, for logographic structures, annotations are consistent since they are constructed automatically using a rule-based parser. The rules [] are defined by human experts from the Ideographic Rapporteur Group, a committee advising the Unicode Consortium about logographs therefore the annotation should be of reasonably high quality 2 2 footnote 2 The Kyoto University’s CHaracter Information Service Environment (CHISE) project: http://www.chise.org/. Hence, compared to other tasks which utilize structures, tasks involving logographs could benefit more from effective modeling of structures.

In Section , we introduce the model to construct the hierarchical embeddings. We apply the proposed hierarchical character embeddings to two distinct tasks: (1) pronunciation prediction (Section ) focusing on a case study to isolate the effects of modeling recursive structures, and (2) language modeling (Section ) which is an useful auxiliary task, as it characterizes many aspects of language beyond semantics (including syntactic structure and discourse processing), and language modeling can be used to pretrain many other tasks [], thus, it has a lot of down-stream applications. However, due to the multifaceted nature of the language modeling task, it is hard to analyze the result qualitatively. Section discussed our work in relation to other work.

II II §II <tag close=" ">II</tag><text font="smallcaps">Model</text> II-A II-A §II-A <tag close=" ">II-A</tag><text font="italic">Rule-based Parser</text>

Decomposition of logographs into sub-units is necessary to locate the sub-units hinting at the phonetic or semantic information. Logographs are decomposed recursively using a rule-based parser. The substitution rules used by the parser are defined by human experts from the Ideographic Rapporteur Group. A substitution rule defines a mapping from one logograph to sub-units and a geometric operator (Ideographic Description Character) which denotes the relative position of the sub-units. The output of the parser is a binary tree as shown in Figure .

At the start, there is only the root node, which is the logograph itself. The parser extends the tree by recursively replacing nodes in the tree with sub-trees defined by the substitution rules. The root of the sub-trees are the geometric operator and the children of the sub-trees are the sub-units. This process is repeated until there is no more node in the tree that can be further decomposed.

Figure 2 2 Figure 2 2 Construction of logographic recursive structure using the ruled-based parser. In this example, there are only four rules used which are shown at the bottom. The rule used at each decomposition step is in red. Figure 2 Construction of logographic recursive structure using the ruled-based parser. In this example, there are only four rules used which are shown at the bottom. The rule used at each decomposition step is in red.

Figure shows how the structure (represented as a binary tree) is constructed for the logograph {CJK*}UTF8bsmi仕. At step 1, there is a root node {CJK*}UTF8bsmi仕 with no children. At step 2, using the rule in red, the node {CJK*}UTF8bsmi仕 is replaced by a geometric operator (horizontal stacking) and two children nodes. At step 3, using the rules in red, the nodes are further simplified into by {CJK*}UTF8bsmi人, {CJK*}UTF8bsmi十 and {CJK*}UTF8bsmi一. The process terminates at step 4 where there are four leaf nodes with three distinct values {CJK*}UTF8bsmi人, {CJK*}UTF8gkai丨, and {CJK*}UTF8bsmi一 which cannot be simplified further. There are 505 sub-units which can be leaf nodes. These sub-units are not hand-picked, thus whether or not the representation of the sub-units carries phonetic or semantic information is automatically learned during training. Hence, the hierarchical embeddings can be used in different tasks. The phonetic and semantic sub-units can be at depth 1 (children of the root node) or they can reside deeper in the trees.

To construct logograph embeddings from trees, one can use bag-of-words models, sequence models, or tree-structured models []. Since the ordering of sub-units within logographs is important in determining the logographs’ pronunciation and meaning, order-agnostic models such as bag-of-words models are sub-optimal for constructing logograph embeddings. Since sequence models and tree-structured models are sensitive to the ordering of sub-units, they can be used to construct logograph embeddings. Sequence models such as recurrent neural networks (RNNs), in particular LSTM [], can be used with tree inputs by first linearizing the trees into sequences. In contrast, recursive neural networks, such as treeLSTM, can consume tree inputs directly to yield the logograph embeddings. We compared LSTM and bi-directional LSTM (biLSTM), which are structure-agnostic, and treeLSTM, which is innate for modeling tree structures.

II-B II-B §II-B <tag close=" ">II-B</tag><text font="italic">Constructing Embeddings Using LSTM</text>

At each position $t$ in a linearized tree of length $T$ , $x t$ , $c t$ , $h t$ are the input, cell value, and hidden state of the LSTM respectively. The last hidden state, $h T$ , is used as the logograph embedding. Figure shows the LSTM model.

II-C II-C §II-C <tag close=" ">II-C</tag><text font="italic">Constructing Embeddings Using Bi-directional LSTM</text>

The biLSTM consists of two LSTMs, the forward LSTM and the backward LSTM, which read the linearized trees in opposite direction. The logograph embedding, $h T$ , is formed by concatenating the last hidden states of the backward and forward LSTMs, i.e. $h b T$ and $h f T$ .

II-D II-D §II-D <tag close=" ">II-D</tag><text font="italic">Constructing Embeddings Using CNN</text>

The model structure is similar to that of []. The input to the model are also sequences formed by linearizing the trees of logographs. The CNN model consists of 7 parallel 1D convolutional layers with kernel size from 1 to 7. Each convolutional layer has 200 filters. The convolutional layers are followed by max-pooling layers. After that, the outputs are concatenated and fed through a fully-connected layer. The output of the fully-connected layer is the logograph embedding.

II-E II-E §II-E <tag close=" ">II-E</tag><text font="italic">Constructing Hierarchical Embeddings Using treeLSTM</text>

At each node $n$ of the binary tree with two children $l$ and $r$ , $x n$ , $c n$ , $h n$ are the input, cell value, and hidden state of the treeLSTM respectively. $i$ , $f l$ , $f r$ , $o$ are the input gate, left forget gate, right forget gate, and output gate respectively. The forward pass of a treeLSTM unit is given by:

= i ⁢ σ (+ ⁢ U l i h l ⁢ U r i h r ⁢ V i x n ⁢ V l i x l ⁢ V r i x r)

i

=

⁢ σ (+ ⁢ U l i h l ⁢ U r i h r ⁢ V i x n ⁢ V l i x l ⁢ V r i x r)

= f l ⁢ σ (+ ⁢ U l f l h l ⁢ U r f l h r ⁢ V f l x n ⁢ V l f l x l ⁢ V r f l x r)

f l

=

⁢ σ (+ ⁢ U l f l h l ⁢ U r f l h r ⁢ V f l x n ⁢ V l f l x l ⁢ V r f l x r)

= f r ⁢ σ (+ ⁢ U l f r h l ⁢ U r f r h r ⁢ V f r x n ⁢ V l f r x l ⁢ V r f r x r)

f r

=

⁢ σ (+ ⁢ U l f r h l ⁢ U r f r h r ⁢ V f r x n ⁢ V l f r x l ⁢ V r f r x r)

= o ⁢ σ (+ ⁢ U l o h l ⁢ U r o h r ⁢ V o x n ⁢ V l o x l ⁢ V r o x r)

o

=

⁢ σ (+ ⁢ U l o h l ⁢ U r o h r ⁢ V o x n ⁢ V l o x l ⁢ V r o x r)

= ~ c tanh (+ ⁢ U l ~ c h l ⁢ U r ~ c h r ⁢ V ~ c x n ⁢ V l ~ c x l ⁢ V r ~ c x r)

~ c

=

tanh (+ ⁢ U l ~ c h l ⁢ U r ~ c h r ⁢ V ~ c x n ⁢ V l ~ c x l ⁢ V r ~ c x r)

= c n + ⊙ i ~ c ⊙ f l c l ⊙ f r c r

c n

=

+ ⊙ i ~ c ⊙ f l c l ⊙ f r c r

(1) 1

= h n ⊙ o tanh (c n)

h n

=

⊙ o tanh (c n)

The hidden state of the root node, $h ⁢ r o o t$ , is considered as the representation of the entire tree. Figure shows the treeLSTM model.

Figure 3 3 Figure 3

(a) 2(a) (a)LSTM (a)LSTM

(b) 2(b) (b)treeLSTM (b)treeLSTM 3 LSTM and treeLSTM models. The last hidden layer

h 7

of LSTM is the logograph embedding. The root hidden layer

h 7

of treeLSTM is the logograph hierarchical embedding. Figure 3 LSTM and treeLSTM models. The last hidden layer

h 7

of LSTM is the logograph embedding. The root hidden layer

h 7

of treeLSTM is the logograph hierarchical embedding. II-F II-F §II-F <tag close=" ">II-F</tag><text font="italic">Implementation Details</text>

One problem with tree-structured models is that training is very slow []. It is hard to batch the training samples as they might have different tree shapes []. As a result, training with batch size of one is very common for tree-structured model [], which fails to maximize parallel computation and thus leads to slow training. Instead, we used dynamic batching to speed up training and inference. Dynamic batching in Pytorch 3 3 footnote 3 https://devblogs.nvidia.com/recursive-neural-networks-pytorch/ has been used to create batches of nodes on the fly to speed up the SPINN model [] training and inference. In our experiments, using a batch size of 128 results in more than 10 times faster training and inference. Besides, we only considered binary trees and converted any ternary nodes (nodes with three children) to two nested binary nodes. This is done to reduce the number of parameters that the model has to learn therefore improving the learning efficiency. The tree representation is sensitive to the order of the children nodes as swapping the left child and the right child in a tree results in a character with potentially different meaning and pronunciation. Thus, we need separate weight matrices for each of the children. As such, modeling both binary and ternary nodes would require from 3 to 5 weight matrices whereas modeling binary trees only requires 2 weight matrices. Since the amount of data is limited, we preferred models with fewer parameters and thus we converted all ternary nodes to binary nodes.

III III §III <tag close=" ">III</tag><text font="smallcaps">Experiments — Pronunciation Prediction</text>

We compared embeddings produced using treeLSTM against LSTM and biLSTM. treeLSTM operates directly on the tree form of the logograph in order to exploit the recursive structure of logographs most effectively. In contrast, LSTM and biLSTM use more implicit structural information of the logograph in the form of linearized trees. Since standard embeddings do not consider logographic structures, every input logograph is distinct so this approach cannot learn similarities between logographs. Hence, we did not compare the hierarchical embeddings against standard embeddings.

III-A III-A §III-A <tag close=" ">III-A</tag><text font="italic">Data</text>

The data was extracted from UniHan database 4 4 footnote 4 https://www.unicode.org/charts/unihan.html, which is a pronunciation database of characters of Han logographic languages. Each entry consists of a character and its pronunciations in various languages such as Cantonese and Mandarin. For entry with multiple pronunciations, since the dominant pronunciation is not indicated, we randomly picked one of the variants. For this task, the input is the logographic character and output is the Cantonese pronunciation. The pronunciation includes onset, nucleus, and coda. As far as we know, lexical tones are not directly determined by logographic structures so we did not include lexical tones as prediction targets.

There are two types of logographs used in Cantonese, namely traditional and simplified characters. Simplified characters, as the name implies, are derived from their traditional counterparts by removing or replacing some complex sub-units with simpler ones. Non-simplified characters include both traditional characters and the subset of Chinese characters that are identical for traditional and simplified counterparts. Hence, simplified and traditional Chinese characters are quite different in terms of unique sub-units and their complexity.

III-B III-B §III-B <tag close=" ">III-B</tag><text font="italic">Setup</text>

A common weakness of deep learning models is that they often merely memorize patterns and do not generalize well on unseen data []. LSTM has the same weakness as it performs well when there is abundant training data and test distribution is the same as the training distribution []. When the test and training distributions are different, LSTM does not perform as well. Strong generalization requires models to extrapolate to out-of-distribution data points rather than to interpolate using data points within distribution [].

To test the generalizability of standard LSTM and treeLSTM, the original UniHan dataset was split into training and test sets in three different scenarios described in Table . In the first scenario, the training and test set’s distribution were homogeneous: both contained traditional and simplified characters. In the second scenario, the test set only contained simplified characters and the training set contained non-simplified characters. In the third scenario, the distributions were different and the training data was limited: the test set contained only simplified characters while the training sets contained corresponding traditional characters.

Table I I Table I I Number of characters (logographs) used for training and testing in each of the scenario. Tr: Traditional, Sp: Simplified

Table I Number of characters (logographs) used for training and testing in each of the scenario. Tr: Traditional, Sp: Simplified
Scenario	Training	Validation	Test
1. Tr, Sp $→$ Tr, Sp	16000	2400	2400
2. Non-Sp $→$ Sp	16000	2400	2400
3. Tr $→$ Sp	2302	200	2400

The third scenario is inspired by the fact that humans being able to predict pronunciations of simplified characters given the corresponding traditional characters, although they may rely on word context. Given that human performance is high, it should not be impossible for models to generalize to simplified characters even when trained solely on traditional characters. By contrasting results obtained from scenario 1 and 2, we could determine whether the models merely memorized patterns or they learned the underlying rules to predict pronunciation as humans, since models that merely memorize patterns would do well in scenario 1 but not scenario 2. In addition, contrasting scenario 2 and 3 would hint at how models perform in low-resource scenarios of limited training data as well as whether the bias induced by the logographic structures is useful for improving model generalization. It should be noted that scenario 1 is the ideal case in which one is very careful in collecting data and performs data normalization. If data is collected indiscriminately, one can end up in scenario 2.

III-C III-C §III-C <tag close=" ">III-C</tag><text font="italic">Task-specific Layer</text>

The task-specific layer uses the logograph embedding to predict the logograph’s pronunciation, which includes onset, nucleus and coda. Probability of each sub-syllabic unit’s pronunciation is given by:

= ⁢ C D ⁢ softmax (⁢ W ⁢ C D h),

⁢ C D

=

⁢ softmax (⁢ W ⁢ C D h),

= ⁢ N U ⁢ softmax (⁢ W ⁢ N U [h, ⁢ C D]),

⁢ N U

=

⁢ softmax (⁢ W ⁢ N U [h, ⁢ C D]),

= ⁢ O N ⁢ softmax (⁢ W ⁢ O N [h, ⁢ C D, ⁢ N U])

⁢ O N

=

⁢ softmax (⁢ W ⁢ O N [h, ⁢ C D, ⁢ N U])

where $W ⁢ C D$ , $W ⁢ N U$ and $W ⁢ O N$ are weights of the fully-connected layer specific to each sub-syllabic unit.

The setup for treeLSTM to predict a logograph’s pronunciation using hierarchical embeddings is shown in Figure .

Figure 4 4 Figure 4 4 Phonological prediction model using hierarchical embeddings. (A) The input logograph is decomposed into the logographic structure using the rule-based parser. (B) treeLSTM constructs hierarchical embedding from the structure. (C) The embedding is then used to predict the pronunciation. Figure 4 Phonological prediction model using hierarchical embeddings. (A) The input logograph is decomposed into the logographic structure using the rule-based parser. (B) treeLSTM constructs hierarchical embedding from the structure. (C) The embedding is then used to predict the pronunciation. III-D III-D §III-D <tag close=" ">III-D</tag><text font="italic">Metrics</text>

We evaluated models’ performance using string error rate (SER) and token error rate (TER). A wrongly predicted phoneme (onset, nucleus or coda) was counted as one token error. An output containing at least one token error was counted as one string error. We used modified Obuchowski statistical test [] to assess the difference in predictive differences.

III-E III-E §III-E <tag close=" ">III-E</tag><text font="italic">Hyperparameters</text>

The size of hidden layers is fixed as 256. We used dropout [] on input and hidden layers to prevent overfitting. We optimized the models using the Adam [] optimizer. The batch size was 128. For each of the model, we searched for the best learning rates and dropout rates using grid-search. The learning rate ranges from $× 3 10 - 2$ to $× 1 10 - 4$ . The drop out rate ranges from 0.0 to 0.5.

III-F III-F §III-F <tag close=" ">III-F</tag><text font="italic">Linearization Order</text>

Since there are multiple ways to linearize trees into sequences, in this section, we investigated what is the optimal linearization order for the models. We compared three different schemes namely: in-order, pre-order, post-order linearization. We paired each of the models (there are five models in total) with the 3 different linearization schemes. This resulted in fifteen different combinations. For each combination, we conducted hyperparameter search on the development set. The lowest TER for each of the combination is reported in Table .

Table II II Table II II Lowest TER on development set for different models and linearization schemes

Table II Lowest TER on development set for different models and linearization schemes
	Pre-order	Post-order	In-order
LSTM 1 layer	34.14	34.60	34.58
LSTM 2-layer	33.69	34.00	33.76
biLSTM 1-layer	34.46	35.04	34.90
biLSTM 2-layer	33.88	34.17	33.94
CNN	36.54	36.95	37.02

For all the models, the difference in performance between different linearization schemese is quite small. However, across all models, the pre-order linearization is slightly better than the post-order and the in-order linearization. Hence, for the subsequent experiments, we use pre-order linearization to convert from trees to sequences.

III-G III-G §III-G <tag close=" ">III-G</tag><text font="italic">Results</text>

Table shows the prediction results by LSTM, biLSTM, and treeLSTM for three experimental scenarios listed in Table . In scenario 1 and 2, biLSTM performed slightly worse than LSTM so we only compared LSTM against treeLSTM. In scenario 1 where the training and test distributions were the same, treeLSTM yields 1.8% $(= p ⁢ 2 e - 4)$ and 2.0% $(= p ⁢ 6 e - 5)$ lower absolute TER (5.4% and 6.0% relative TER) than 1-layer and 2-layer LSTM respectively. treeLSTM also yields 1.6% $(= p 0.06)$ and 0.6% $(= p 0.4)$ lower absolute SER (2.7% and 1.0% relative SER) than 1-layer and 2-layer LSTM respectively. The trends are similar when individual output units (i.e., onset, nucleus, coda) are considered. This result is unlikely due to treeLSTM having a higher capacity since the 2-layer LSTM had more parameters than treeLSTM.

Table III III Table III III Cantonese phonemes prediction percentage error rate. Tr: Traditional, Sp: Simplified

Table III Cantonese phonemes prediction percentage error rate. Tr: Traditional, Sp: Simplified
	SER	TER	On.	Nu.	Cd.
Scenario 1: Tr, Sp $→$ Tr, Sp
LSTM 1-layer	58.5	33.1	42.8	37.5	19.0
LSTM 2-layer	57.5	33.3	42.8	38.3	18.9
biLSTM 1-layer	59.1	33.4	43.7	37.2	19.3
biLSTM 2-layer	57.8	32.9	42.5	36.9	19.2
CNN	62.1	35.9	45.0	41.3	21.4
treeLSTM	56.9	31.3	40.9	35.7	17.3
Scenario 2: Non-Sp $→$ Sp
LSTM 1-layer	73.5	48.5	57.3	53.0	35.3
LSTM 2-layer	71.3	45.8	55.5	50.0	32.0
biLSTM 1-layer	74.1	48.4	57.2	53.0	35.0
biLSTM 2-layer	71.5	47.0	56.0	50.9	34.0
CNN	79.1	52.1	62.4	56.9	37.1
treeLSTM	69.6	43.8	51.8	48.6	31.0
Scenario 3: Tr $→$ Sp
LSTM 1-layer	77.2	55.5	62.2	59.5	44.8
LSTM 2-layer	77.4	57.7	65.2	61.3	46.4
biLSTM 1-layer	73.5	51.6	57.9	55.2	41.8
biLSTM 2-layer	75.7	55.4	62.0	60.5	43.7
CNN	70.5	48.1	54.1	49.7	40.5
treeLSTM	68.8	47.7	53.7	50.7	38.9

When training and test distributions are different (scenario 2), models that have better inductive bias should perform better []. For example, the convolution operation in convolutional neural network (CNN) has translation equivariant bias []. This bias enforces that the representation of an object is the same regardless of its position in an image. This bias makes CNN generalize much better and require few training samples than fully-connected neural networks. For logographs, the inductive bias is that the interaction between sub-units is local in space. This inductive bias is enforced in the treeLSTM model since a child node only interacts with its sibling. The result is that the hierarchical embeddings is much more data-efficient than the LSTM. The result shown in Table indicates that treeLSTM can generalize better than LSTM models even when the test set has out-of-distribution samples. treeLSTM yields 4.7% $(< p ⁢ 1 e - 12)$ and 2.0% $(= p ⁢ 6 e - 4)$ lower absolute TER (9.6% and 4.3% relative TER) than 1-layer LSTM and 2-layer LSTM respectively. Besides, treeLSTM yields 3.9% $(= p ⁢ 3 e - 6)$ and 1.7% $(= p ⁢ 3 e - 2)$ lower absolute SER (5.3% and 2.3% relative SER) than 1-layer LSTM and 2-layer LSTM respectively. The trends are similar when individual sub-syllabic classes (i.e., onset, nucleus, coda) are considered.

When training and test distributions are different and the amount of training data is limited, good inductive biases are even more important to obtain good generalization. Comparing scenario 2 and 3, treeLSTM is less affected than LSTM by the limited training data. In the limited training data regime, 2-layer LSTM clearly overfits badly compared to 1-layer LSTM and treeLSTM. It is interesting to note that although the CNN model is the most competitive baseline in scenario 3 although it is worse than the LSTM and biLSTM when there is more data (scenario 1 and 2). However, compared to the CNN, the treeLSTM still has lower SER $(= p 0.07)$ and TER $(= p 0.5)$ .

III-H III-H §III-H <tag close=" ">III-H</tag><text font="italic">Ablation</text>

We conducted ablation experiments to see how much the models depend on the composition operators. Without the operators, the LSTM, biLSTM, and CNN cannot discern the hierarchical grouping of sub-units. On the other hand, even without the composition operations, the treeLSTM model still receives some structural information from ordering of the sub-units in a tree.

Table IV IV Table IV IV Results on the test set of scenario 1

Table IV Results on the test set of scenario 1
	+ operators		- operators
Model	SER	TER	SER	TER
LSTM 1-layer	58.5	33.1	62.0	35.5
LSTM 2-layer	57.5	33.3	59.4	34.3
biLSTM 1-layer	59.1	33.4	63.8	36.5
biLSTM 2-layer	57.8	32.9	61.3	35.5
CNN	62.1	35.9	67.2	40.2
treeLSTM	56.9	31.3	57.3	32.0

In order to implement the case where there is no composition operators in the input, for the LSTM, biLSTM, and CNN models, the operators were removed from the input sequences. For the treeLSTM model, all the $V$ terms were removed from the equation of the inner nodes. We searched the best hyperparameters for each of the model using the development set of scenario 1. We picked scenario 1 because it is the most common way to split data into training/validation/test sets, i.e. standard split. The result is shown in Table . It can be seen that the composition operators do provide salient information for the task since taking them out results in worse performance across all models (as reflected by increases in error rates). However, the treeLSTM performance does not degrade by much, it is more certain that treeLSTM learns to compose the sub-units chiefly from the tree structure.

III-I III-I §III-I <tag close=" ">III-I</tag><text font="italic">Prediction Order of Output Phonemes</text>

The phonetic subunits in Chinese characters usually predict nucleus and coda more reliably than onset. This trend can be seen in Figure whereby all the nuclei and codas are the same across the first four characters which share the same phonetic subunit. However, the most effective ordering of input and output in machine learning is may not align with human intuition. For example, reversing the order of the input sentence boosted the performance of machine translation [], while swapping the order of onsets and nuclei in Thai syllables boosted the performance of English-to-Thai transliteration []. We adopted the Coda-Nucleus-Onset prediction order in this paper as shown in Section . However, we also tried using a different prediction order which is Onset-Nucleus-Coda. We replaced the task-specific layer of the proposed model and searched for the optimal hyperparameters. The model with the best hyperparameters is then applied on the test set. Empirically, we observed little difference in performance between the two orders.

Table V V Table V V Comparing different orders of predicting output phonemes. treeLSTM results on the test set of scenario 1

Table V Comparing different orders of predicting output phonemes. treeLSTM results on the test set of scenario 1
Output order	SER	TER	On.	Nu.	Cd.
Coda-Nucleus-Onset	56.9	31.3	40.9	35.7	17.3
Onset-Nucleus-Coda	57.3	32.0	42.1	36.8	17.2

IV IV §IV <tag close=" ">IV</tag><text font="smallcaps">Experiments — Language Modeling</text>

We evaluated how well the hierarchical embeddings can improve language modeling in Chinese. We compare hierarchical embeddings against standard embeddings to quantify the usefulness of sub-unit semantic information since hierarchical embeddings are imbued with semantic information from the sub-units while standard embeddings are not.

IV-A IV-A §IV-A <tag close=" ">IV-A</tag><text font="italic">Data</text>

As the characters (logographs) in the output of language models are not independent, it is difficult to design meaningful statistical tests to evaluate the effectiveness of our proposed approach. Instead we chose a wide variety of five different datasets, consisting of three datasets using simplified characters (Chinese Penn Treebank (CTB) Version 5.1 [], Beijing University (PKU) dataset [], and Microsoft Research (MSR) dataset []) and two datasets using traditional characters (City University of Hong Kong (CITYU) dataset [] and Academia Sinica (AS) dataset []). If we can show consistent improvements across these datasets, it implies the proposed hierarchical embeddings are effective. Table shows the data split for each of the datasets. Data splits for CTB and PKU datasets are taken from []. 5 5 footnote 5 https://s3.eu-west-2.amazonaws.com/k-kawakami/seg.zip

Table VI VI Table VI VI Number of sentences in the training, validation, and test sets in each of the datasets.

Table VI Number of sentences in the training, validation, and test sets in each of the datasets.
Dataset	Training	Validation	Test
CTB (Simplified)	50,734	349	345
PKU (Simplified)	17,149	1,841	1,790
MSR (Simplified)	83,000	3,924	3,985
CITYU (Traditional)	51,000	2,019	1,493
AS (Traditional)	690,000	18,953	14,431

IV-B IV-B §IV-B <tag close=" ">IV-B</tag><text font="italic">Setup</text>

We used AWD-LSTM (ASGD Weight-Dropped LSTM) model [] as the core in the language modeling experiment. The input to AWD-LSTM is either hierarchical embeddings (Figure ) or standard character embeddings (Figure ). We considered the standard character embeddings as the baseline. We trained the model using the training set for a fixed number of epochs and used the validation set to select the best model. The best model performance was evaluated on the test set after training finished.

Figure 5 5 Figure 5

(a) 4(a) (a) Standard embeddings (baseline) (a) Standard embeddings (baseline)

(b) 4(b) (b) Hierarchical embeddings (proposed) (b) Hierarchical embeddings (proposed) 5Language model (LM) Figure 5Language model (LM) IV-C IV-C §IV-C <tag close=" ">IV-C</tag><text font="italic">Metrics</text>

We evaluated models’ performance using perplexity (PPL) and bits-per-character (BPC). BPC is a standard evaluation metric for character-level LMs [].

= ⁢ B P C - ⁢ 1 | x | ∑ ⁢ log 2 p (| x t x < t)

⁢ B P C

=

- ⁢ 1 | x | ∑ ⁢ log 2 p (| x t x < t)

= ⁢ P P L 2 ⁢ B P C

⁢ P P L

=

2 ⁢ B P C

where $x$ is the whole corpus, $x t$ is the character at position $t$ , and $| x |$ is the length of the corpus.

IV-D IV-D §IV-D <tag close=" ">IV-D</tag><text font="italic">Hyperparameters</text>

The same hyperparameters are used across the datasets. We optimized the models using the Adam [] optimizer for 300 epochs. The learning rate was set at 0.002 and is divided by 10 after 250 epochs. The size of hidden layer is fixed as 1000. The size of the embedding is fixed as 200. The AWD-LSTM has three hidden layers with sizes 1000, 1000, 200 respectively. We used dropout [] on input and hidden layers to prevent overfitting. Dropout rates were set as 0.1, 0.1, and 0.25 for the input, hidden and output layers of the AWD-LSTM. L2 weight decay was set as $× 1.2 10 - 6$ . Weight dropout was set at 0.5. The batch size was 100. To improve computational speed, only embeddings of the characters appearing in the training batch were updated. During testing, the embeddings were constructed once and then cached, hence using hierarchical embedding was nearly as fast as standard embeddings. The caching technique was similar to [].

IV-E IV-E §IV-E <tag close=" ">IV-E</tag><text font="italic">Results</text> Table VII VII Table VII VII Language modeling performance on test sets from different datasets. hier-emb: hierarchical embedding, glyph-emb: glyph embeddings, baseline: standard embeddings, ext: additional bias term in treeLSTM. Results for the LSTM, Segmental Neural LM, and the glyph embeddings were taken from the original papers. We also reimplemented the glyph embeddings for a fairer comparison.

Table VII Language modeling performance on test sets from different datasets. hier-emb: hierarchical embedding, glyph-emb: glyph embeddings, baseline: standard embeddings, ext: additional bias term in treeLSTM. Results for the LSTM, Segmental Neural LM, and the glyph embeddings were taken from the original papers. We also reimplemented the glyph embeddings for a fairer comparison.
Model	Perplexity	BPC
Dataset: CTB (Simplified)
LSTM []	30.78	4.944
Segmental Neural LM []	28.56	4.836
AWD-LSTM, baseline	19.14	4.259
AWD-LSTM, hier-emb	18.71	4.226
AWD-LSTM, hier-emb, ext	18.85	4.237
Dataset: PKU (Simplified)
LSTM []	73.66	6.203
Segmental Neural LM []	59.01	5.883
AWD-LSTM, baseline	55.42	5.792
AWD-LSTM, hier-emb	53.96	5.754
AWD-LSTM, hier-emb, ext	56.09	5.810
Dataset: MSR (Simplified)
GRU []	47.53	5.571
GRU, glyph-emb []	47.75	5.577
GRU, reimplemented	34.27	5.099
GRU, glyph-emb, reimplemented	34.76	5.119
AWD-LSTM, baseline	22.28	4.478
AWD-LSTM, glyph-emb	22.52	4.493
AWD-LSTM, hier-emb	22.64	4.501
AWD-LSTM, hier-emb, ext	22.25	4.476
Dataset: CITYU (Traditional)
AWD-LSTM, baseline	70.48	6.139
AWD-LSTM, hier-emb	68.47	6.097
AWD-LSTM, hier-emb, ext	68.93	6.107
Dataset: AS (Traditional)
AWD-LSTM, baseline	45.99	5.523
AWD-LSTM, hier-emb	46.88	5.551
AWD-LSTM, hier-emb, ext	45.91	5.521

Table shows the prediction results. We also report results on the CTB and PKU datasets from [] and results on the MSR dataset from []. The results from [] can be compared with our results since the results are evaluated on the same data splits. However, direct comparison is unfair for [] because our models are bigger than theirs. The results from [] cannot be compared with our results as the data splits are different because their data split is not publicly available. Thus, we reimplemented the glyph embedding for a fairer comparison. The glyph embedding model architecture is similar to that used in the original paper []. We only include the results for the Segmental Neural LM model for reference and did not reimplement this model because it depends on multitask training which is different from the other models. Our result agrees with the conclusion from [] that the glyph embeddings are slightly worse than standard embeddings regardless of the baseline (GRU or AWD-LSTM). The hierarchical embeddings outperformed the standard embeddings in all datasets, regardless of whether the datasets use simplified or traditional characters.

V V §V <tag close=" ">V</tag><text font="smallcaps">Relation to Other Work</text> V-A V-A §V-A <tag close=" ">V-A</tag><text font="italic">Exploiting Recursive Structures</text>

Exploiting recursive structures has been shown to be beneficial in many NLP tasks such as sentiment analysis [], text simplification [], and machine translation []. These models are usually trained using human annotated structures but may be tested on structures annotated automatically using parsers when human annotation is not available. This mismatch in annotation quality could worsen the performance of these models and could partially explain why exploiting structures in NLP tasks have not always led to better results. For example, recursive models [] were not as good as the biLSTM in sentiment analysis task []. To address the mismatch in annotation quality, new models which can both produce and exploit structures have been introduced []. For our case, annotation quality is consistent across the training and test set, thus, better ways of modeling structures led to better results.

V-B V-B §V-B <tag close=" ">V-B</tag><text font="italic">Building Logographic Embeddings</text>

In languages like Mandarin, Japanese or Cantonese, logographs are characters and the number of characters are in the range of thousands. In contrast, alphabetic languages usually have far fewer characters (e.g. 26 characters for English). The large number of characters in languages with logographic origin makes character-level modeling inefficient and worsens the problem of out-of-vocabulary words and characters. However, alphabetic languages and languages with logographic origin are often treated the same way, disregarding their intrinsically marked differences []. Modeling logograph sub-units can alleviate these issues since there are fewer sub-units and they can be used to construct out-of-vocabulary words and characters. This is consistent with how learners of languages with logographic origin can comprehend the meaning or pronunciation of a logograph from its constituent sub-units []. Hence, leveraging structures of logographs can be useful in capturing semantic [] or phonological information [].

There are many prior work on building embeddings of logographs. The first approach is to apply convolutional neural network (CNN) on the visual rendering of logographs []. The second approach is to combine sub-unit embeddings with the logograph embeddings. Sub-units embeddings can be learned independently of logograph embeddings [] using Skip-Gram or CBOW models [] or learned jointly with logograph embeddings []. The third approach is to apply CNN or RNN on the sequence of sub-units [].

Our work is most similar to the third approach. However, while our approach exploits the recursive structures of logographs, most work in this area ignores structures or only consider the structures implicitly.

V-C V-C §V-C <tag close=" ">V-C</tag><text font="italic">Incorporating Morphology into Embeddings</text>

In languages like English, popular models to learn word embeddings assign a distinct vector to each word, ignoring word morphology (how characters, word’s sub-units, form a word). This approach uses solely the context surrounding words to learn the embeddings which may be a limitation in languages with a large vocabulary and many rare words since the context may be insufficient to learn good embeddings. Building logographic (character) embeddings in languages of logographic origin has the same difficulty since there a lot of logographs (characters) and many of them are rare characters.

To incorporate morphology into word embeddings learning, [] proposed building word embeddings by averaging bags of character n-grams. This method may be agnostic to the order of characters if the n-gram length is short. Others have used RNN [] or CNN [] to better incorporate word morphology information into words embeddings. Unlike English words which are linear sequences of characters, logographs are recursive structures of of sub-units. Hence, using models operating on sequences such as RNN or LSTM may not be optimal.

Rare word/character embeddings can be improved by leveraging similarity in morphology between rare words and common words. [] proposed building embeddings of new words from pre-trained embeddings by learning mapping from characters to embeddings. However, in this line of approach, the embeddings are fixed, which may not be useful for tasks that require information not captured in the embeddings pre-trained via unsupervised language modeling. In work from [], the embeddings are learned jointly with the task models so that the embeddings contain useful information for the task. Our hierarchical embeddings can be trained on task-specific data, making it potentially useful for many different tasks.

VI VI §VI <tag close=" ">VI</tag><text font="smallcaps">Discussion</text>

Figure 6 6 Figure 6

(a) 5(a) (a)LSTM prediction (a)LSTM prediction

(b) 5(b) (b)treeLSTM prediction (b)treeLSTM prediction 6 Visualizing the construction of the logograph embedding for {CJK*}UTF8bkai賄 (bribery) by LSTM (a) and treeLSTM (b). The central panels show the hidden states

h i

. The left columns show the input sub-units. The right columns show the predicted pronunciations using the hidden states

h i

. The bottom rows of the right columns are the predicted pronunciations for the logographs (“f ui #” for both LSTM and treeLSTM). Ground-truth pronunciation is “f ui #”. Figure 6 Visualizing the construction of the logograph embedding for {CJK*}UTF8bkai賄 (bribery) by LSTM (a) and treeLSTM (b). The central panels show the hidden states

h i

. The left columns show the input sub-units. The right columns show the predicted pronunciations using the hidden states

h i

. The bottom rows of the right columns are the predicted pronunciations for the logographs (“f ui #” for both LSTM and treeLSTM). Ground-truth pronunciation is “f ui #”.

Figure 7 7 Figure 7

(a) 6(a) (a)LSTM prediction (a)LSTM prediction

(b) 6(b) (b)treeLSTM prediction (b)treeLSTM prediction 7 Visualizing the construction of the logograph embedding for {CJK*}UTF8bkai鴽 (quail) by LSTM (a) and treeLSTM (b). The central panels show the hidden states

h i

. The left columns show the input sub-units. The right columns show the predicted pronunciations using the hidden states

h i

. The bottom rows of the right columns are the predicted pronunciations for the logographs (“m u #” for LSTM and “j yu #” for treeLSTM). Ground-truth pronunciation is “j yu #”. While LSTM made a mistake, treeLSTM predicted the correct pronunciation. Figure 7 Visualizing the construction of the logograph embedding for {CJK*}UTF8bkai鴽 (quail) by LSTM (a) and treeLSTM (b). The central panels show the hidden states

h i

. The left columns show the input sub-units. The right columns show the predicted pronunciations using the hidden states

h i

. The bottom rows of the right columns are the predicted pronunciations for the logographs (“m u #” for LSTM and “j yu #” for treeLSTM). Ground-truth pronunciation is “j yu #”. While LSTM made a mistake, treeLSTM predicted the correct pronunciation. VI-A VI-A §VI-A <tag close=" ">VI-A</tag><text font="italic">Left-right Bias in Pronunciation Prediction</text>

More than 80% of frequently used Han logographs are semantic-phonetic compounds []. These compounds consist of sub-units that might contain phonetic or semantic information []. Pronunciation of these compounds could conceivably be predicted from the phonetic sub-units. Amongst semantic-phonetic compounds, logographs with the left-right arrangement (in which the semantic sub-unit is on the left and the phonetic sub-unit is on the right) are the most common. For logographs with the left-right arrangement, a good model for logograph’s pronunciation should prefer the right child (the likely phonetic sub-unit) of a root node for making pronunciation prediction. To check whether the hierarchical embeddings prefer the left child or the right child, we compared the norm of the left forget gate against the norm of the right forget gate. The right child is preferred if the norm of the right forget gate is larger.

Table VIII VIII Table VIII VIII Number of times the model using hierarchical embeddings predicts the phonetic sub-unit is on the right of a logograph that follows the left-right arrangement. The scenarios were described in Table . Tr: Traditional, Sp: Simplified.

Table VIII Number of times the model using hierarchical embeddings predicts the phonetic sub-unit is on the right of a logograph that follows the left-right arrangement. The scenarios were described in Table . Tr: Traditional, Sp: Simplified.
Scenario	Left-Right	Prefer Right
Tr., Sp. $→$ Tr., Sp.	1657	1543 (93%)
Non-Sp. $→$ Sp.	1686	1589 (94%)
Tr. $→$ Sp.	1686	1643 (97%)

In Table , the second column shows the number of logographs following the left-right arrangement for different scenarios. The third column shows the number of logographs following the left-right arrangement in which the right child is preferred over the left child. The hierarchical embeddings prefer the right child most of the time (close to 100%) in all three scenarios. Thus, the learned hierarchical embeddings consider the right sub-units to be more relevant for pronunciation prediction for the majority of compound logographs with the left-right arrangement. This is consistent with human intuition. Since human depends on this intuition to infer pronunciation and it seems to work well, this suggests that the hierarchical embeddings might have learned a general solution that works well.

VI-B VI-B §VI-B <tag close=" ">VI-B</tag><text font="italic">Robustness to Distractors in Pronunciation Prediction</text>

By overfitting to common patterns at the expense of more difficult, infrequent samples that require deeper understanding, statistical models can perform well as measured by some aggregate metrics []. A common pattern useful for predicting pronunciation is that phonetic sub-units usually occur at the end of the linearized sequences. A general model would be able to find where the phonetic sub-units are in the sequences. A model that only attends to the end of sequences would make wrong prediction when the phonetic sub-units are not at the end of the sequences.

To determine how the models predict, we visualize the hidden states of LSTM and treeLSTM. The visualization for biLSTM is not shown since it performed worse than LSTM. For both LSTM and treeLSTM, the last hidden state (e.g. $h 15$ in Figure ) is considered the logograph embedding. The intermediate embeddings (e.g. $h 1$ to $h 14$ ) are embeddings of the subsequence of sub-units for LSTM and embeddings of the subtrees of sub-units for treeLSTM. The hidden states (embeddings) evolve to contain more phonetic information with more sub-units as indicated by generally increasing magnitude of the hidden states (corresponding to darker bands). When the magnitude of the hidden states are small (corresponding to faint bands), the hidden states do not have enough information to predict pronunciation confidently. We also obtained the prediction corresponding to each hidden state by feeding the hidden states ( $h 1$ to $h 15$ ) to the task-specific layer in order to determine at which step did the embeddings contain phonetic information to make the correct pronunciation prediction.

Figure shows how the models predict the pronunciation of the logograph {CJK*}UTF8bkai賄 (bribery). This is a common example as the phonetic sub-units are on the right (corresponding to end of the linearized sequence). While both models predict correctly, they used the logograph structural representation differently. LSTM had to observe the whole sequence to predict correctly, as suggested by the build-up in magnitude of the embeddings until the end of the sequence. For treeLSTM, the pattern of the embeddings’ magnitude is consistent with the hierarchical structure of the input logograph with two subtrees {CJK*}UTF8bkai貝 and {CJK*}UTF8bkai有. Specifically, not only was the final pronunciation prediction of {CJK*}UTF8bkai賄 correct (“f ui #”), but pronunciation of the subtrees ({CJK*}UTF8bkai貝 and {CJK*}UTF8bkai有) were also correct (“b ui #” and “j au #” respectively).

Figure shows a rare example where the phonetic sub-units are not at the end of the sequence. LSTM made the correct prediction after observing the relevant parts (up to the second last input token) but soon forgot the correct prediction as it might focus more on the end of the sequence. This mistake indicates that LSTM might have learned a heuristic instead of the general strategies. On the other hand, treeLSTM predicted the pronunciation correctly by seemingly focusing on the relevant part ({CJK*}UTF8bkai如) of the logograph and ignoring the less relevant tokens. Thus, imposing a prior on the mapping from logographs to embeddings by using recursive network seems to lead to a solution that may generalize better to more challenging cases.

VI-C VI-C §VI-C <tag close=" ">VI-C</tag><text font="italic">Infrequent Characters’ Embeddings in Language Modeling</text> Table IX IX Table IX IX Nearest neighbors in embedding space of infrequent words. The meaning and Mandarin pronunciation are shown next to the characters. The common sub-units between the logographs and their neighbors in the embedding space are color-coded. Red sub-units carry semantic information. Blue sub-units carry phonetic information.

Table IX Nearest neighbors in embedding space of infrequent words. The meaning and Mandarin pronunciation are shown next to the characters. The common sub-units between the logographs and their neighbors in the embedding space are color-coded. Red sub-units carry semantic information. Blue sub-units carry phonetic information.
Character	Standard Embedding	Hierarchical Embedding
,	, a plant, “ch a ng”	, cricket, “q u #”
spider,	, drawer, “t i #”	, ark clam, “q u #”
“zh u #”	, scold, “ch i #”	, louse, “y a #”
,	, omit, “sh e ng”	, firewood, “ch ai #”
fort,	, south, “n a n”	, purple, “z i #”
“zh ai #”	, blanket, “b ei #”	, female, “c i #”
,	, this, “c i #”	, pendant, “p ei #”
jade belt,	, army, “j u n”	, imperfect pearl, “j i #”
“p ei #”	, gate, “m e n”	, watering can, “g ua n”
,	, powerful, “z a ng”	, axe, “f u #”
celery,	, note, “zh a #”	, fragrance, “x u n”
“q i n”	, not, “m a #”	, lush, “m ao #”

Hierarchical embeddings could learn better representations of infrequent characters than standard embeddings could since the latter ignores the morphology within characters. Using the learned embeddings in the language modeling experiments, we looked for characters that are most similar (nearest neighbors) to the infrequent characters in the embedding space. If the nearest neighbors are semantically or phonologically close then we are more certain that the learned embeddings are sensible. The distance between embedding vectors is calculated using cosine similarity. Table showed that the infrequent characters and their nearest neighbors are relatively close in meaning when using hierarchical embeddings.

For standard embeddings, infrequent characters and their neighbors are generally unrelated. For example, spider are unrelated to a plant, drawer, or scold. It is possible that with little training data, the infrequent characters’ embedding stay close to the original random initialized values and hence are far away from related characters in embedding space. For hierarchical embeddings, infrequent characters are more related to their neighbors. The relatedness between infrequent characters and their neighbors can be semantic or phonological. For example, the first row in Table shows characters ({CJK*}UTF8bkai蛛,蛐,蚶,蚜) that share the same sematic sub-units (shown in red). Accordingly, spider is semantically related to cricket, ark clam, and louse since they are all insects. The second row in Table shows another example in which characters ({CJK*}UTF8bkai砦,柴,紫,雌) have the same phonetic sub-units (shown in blue). Correspondingly, “zh ai #” is phonologically related (having similar pronunciation) to “ch ai #”, “z i #”, and “c i #”.

However, the hierarchical embeddings are not always accurate and the last row of Table shows an interesting failure. The cosine distance suggests that (celery, “q i n”) and (axe, “f u #”) are related. Although both characters has a common sub-unit, the sub-unit carries phonological information (color-coded as blue) in the case of (celery, “q i n”) while the sub-unit carries semantic information (color-coded as red) in the case of (axe, “f u #”).

VI-D VI-D §VI-D <tag close=" ">VI-D</tag><text font="italic">Automated Feature Granularities Selection</text>

The granularity of the input features derived from logographs could have a major impact on model performance. The input features could be as granular as individual strokes, which results in a small vocabulary. Different permutations of the strokes can form unique ideographs and expand the vocabulary. The choice of the vocabulary set has a major impact on sequential models like RNN, as a big vocabulary makes training slow and makes it hard for the model to generalize. On the other hand, a small vocabulary leads to longer sequences and makes it harder for models to learn. [] showed that a big vocabulary yields lower perplexity for language modeling of Japanese, while big vocabulary implies that each token is a meaningful unit that carries semantic information []. Moreover, different sub-unit granularities might be more suitable for different logographic languages. For example, ideographs are more suitable as input tokens for Chinese, while individual strokes are better suited for Japanese []. [] chose to extract visual features of logographs instead of symbolic features to avoid specifying the level of granularity when decomposing a character.

In Figure and , LSTM treats all input features with relatively equal importance, evident by relatively high activation values across most hidden states. On the contrary, the structural constraints imposed by treeLSTM resulted in a more automated selection of input features, in which most of the high activation concentrate at the hidden states of sub-trees’ roots. In other words, treeLSTM seemed to have learned to build representations relevant to the task at the right level of granularity. Learning the right features via structures instead of delicate feature engineering is an advantage that should be explored further for RNN models.

VI-E VI-E §VI-E <tag close=" ">VI-E</tag><text font="italic">Intuitive Exploitation of Input Structures</text>

Various work suggested that incorporating syntactic structures is tricky and does not always improve results. For example, in subject-verb agreement modeling, a model could easily ignore syntactic information from the input data and so syntactic constraints must be explicitly injected into the model’s architecture []. Doing so would make it easier for the model to discern certain relationship of interest (subject-verb agreement) by shortening the path between relevant sub-units (subject and verb in a sentence) []. Hence, being explicit in modeling structures may be the key to obtaining performance gain. Similarly, our work showed that modeling structures explicitly (using treeLSTM) is better than implicitly (using LSTM) in terms of model performance.

Models that learn task-specific trees from data could be better than models that use conventional parsers to obtain the trees []. However, the learned trees are usually shallow and hard to interpret []. Shallow trees make the paths between related tokens shorter but they do not always result in better performance. For the binary trees of logographs, the tree depth is unlikely to account for the improved performance because the trees are not balanced binary trees (which are shallowest). The improved performance is more likely due to the inductive bias using logographic structures. We showed that by exploiting structures like human intuition, treeLSTM could arrive at the general and correct solution in a more data-efficient and effective manner for pronunciation prediction and language modeling tasks concerning logographs (Chinese characters). Better interpretability due to the model following human intuition provides some confidence that the model is general and is not exploiting statistical biases in the data.

VI-F VI-F §VI-F <tag close=" ">VI-F</tag><text font="italic">Potential Applications and Extensions</text>

To tackle the out-of-vocabulary problem, it is common to apply pre-processing steps such as replacing infrequent characters or characters unseen during training with the UNK token. However, these pre-processing steps could potentially remove information stemmed from the usage of the infrequent characters. Hierarchical embeddings enable modeling Chinese text directly without these pre-processing steps. By treating Chinese characters as recursive structures of common sub-units instead of independent tokens, hierarchical embeddings make it possible for model to have a much bigger vocabulary. Hierarchical embeddings also make learning representations of infrequent characters easier through leveraging the similarity between structures of infrequent characters and of common characters. Thus, models that use hierarchical embeddings may be able to capture the intention behind the usage of infrequent characters. Furthermore, hierarchical embeddings can also be used to model Japanese Kanji which are logographs created using the same principles as Chinese logographs.

We used hierarchical embeddings in the pronunciation prediction task and language modeling task. However, other NLP tasks may also benefit from using hierarchical embeddings as previous work exploiting logograph structures have shown promising results in tasks such as machine translation [] or textual error detection []. In particular, hierarchical embeddings may be useful in named entity recognition (NER) where infrequent characters may be used in names or in poetry generation where characters need to rhyme.

The current work can be extended in a few different ways. Although treeLSTM was used to construct the hierarchical embeddings, it is possible that self-attention models such as Transformer [] might lead to even better performance as they could learn patterns in trees that the current models could not. However, since self-attention models usually have a lot of parameters they may overfit given the small amount of training data. It would be interesting to see how well a bigger and more powerful model such as Transformer can model logographic structures given limited data. For language modeling, using hierarchical embeddings does not constrain the choice of the language model. The AWD-LSTM can be replaced by a more powerful model such as Transformer [] or BERT []. It is possible that hierarchical embeddings may have little benefit for models such as Transformer or BERT which can exploit contextual information to deduce the representations of the infrequent characters. However, exploiting both contextual information and structural similarity to obtain better embeddings for infrequent characters should theoretically be better than just relying on contextual information.

VII VII §VII <tag close=" ">VII</tag><text font="smallcaps">Conclusion</text>

Exploiting recursive structures of logographs to build logographic embeddings can lead to embeddings that yield both better results and interpretability. We showed both quantitative and qualitative evidence that exploiting recursive structures boosts accuracy in logographs’ pronunciation prediction: the hierarchical embeddings is better than the embeddings constructed by LSTM, biLSTM, and CNN. Hierarchical embeddings also consistently outperformed standard embeddings in language modeling of five different datasets. Inspecting the inner workings of the models also revealed that treeLSTM conceivably resembles how humans perform reading tasks, suggesting that exploiting structures not only improves performance, but might also help us develop more interpretable models. Although this paper only consider two tasks, building better logographic (character) embedding by exploiting recursive structures can potentially benefit other tasks such as machine translation.

Acknowledgment

The authors would like to thank the anonymous reviewers for their constructive feedback to improve the paper. In addition, the delightful discussions with Ai Ti Aw and Ed Hovy are also much appreciated.

References [1] 1 J. H.-w. Hsiao and R. Shillcock, “Analysis of a Chinese phonetic compound database: Implications for orthographic processing,” Journal of psycholinguistic research, vol. 35, no. 5, 2006. [2] 2 C. S.-H. Ho and P. Bryant, “Phonological skills are important in learning to read Chinese.” Developmental psychology, vol. 33, no. 6, 1997. [3] 3 K. S. Tai, R. Socher, and C. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” in Proceedings of ACL, 2015. [4] 4 X. Zhu, P. Sobihani, and H. Guo, “Long short-term memory over recursive structures,” in Proceedings of ICML, 2015. [5] 5 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of NIPS, 2013. [6] 6 A. Graves, “Generating Sequences With Recurrent Neural Networks,” CoRR, vol. abs/1308.0850, 2013. [7] 7 H. G. Ngo, N. F. Chen, S. Sivadas, B. Ma, and H. Li, “A Minimal-Resource Transliteration Framework for Vietnamese,” in Proceedings of INTERSPEECH, 2014. [8] 8 H. G. Ngo, N. F. Chen, M. Nguyen, B. Ma, and H. Li, “Phonology-augmented statistical transliteration for low-resource languages,” in Proceedings of INTERSPEECH, 2015. [9] 9 H. G. Ngo, M. Nguyen, and N. F. Chen, “Phonology-augmented statistical framework for machine transliteration using limited linguistic resources,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 27, no. 1, 2019. [10] 10 K. Yamada and K. Knight, “A syntax-based statistical translation model,” in Proceedings of ACL, 2001. [11] 11 A. Eriguchi, K. Hashimoto, and Y. Tsuruoka, “Tree-to-sequence attentional neural machine translation,” in Proceedings of ACL, 2016. [12] 12 R. Miyazaki and M. Komachi, “Japanese sentiment classification using a tree-structured Long Short-Term Memory with attention,” in Proceedings of PACLIC, 2018. [13] 13 S. R. Bowman, J. Gauthier, A. Rastogi, R. Gupta, C. Manning, and C. Potts, “A fast unified model for parsing and sentence understanding,” in Proceedings of ACL, 2016. [14] 14 C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith, “Recurrent neural network grammars,” in Proceedings of NAACL-HLT, 2016. [15] 15 X. Zhang, L. Lu, and M. Lapata, “Top-down Tree Long Short-Term Memory Networks,” in Proceedings of NAACL-HLT, 2016. [16] 16 J. Li, T. Luong, D. Jurafsky, and E. Hovy, “When are tree structures necessary for deep learning of representations?” in Proceedings of EMNLP, 2015. [17] 17 W. Lan and W. Xu, “Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering,” in Proceedings of COLING, 2018. [18] 18 T. Morioka, “CHISE: Character processing based on character ontology,” in International Conference on Large-Scale Knowledge Resources. Springer, 2008. [19] 19 P. Ramachandran, P. Liu, and Q. Le, “Unsupervised pretraining for sequence to sequence learning,” in Proceedings of EMNLP, 2017. [20] 20 M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of NAACL-HLT, 2018, pp. 2227–2237. [21] 21 A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018. [22] 22 J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in Proceedings of ACL, 2018. [23] 23 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186. [24] 24 B. Li, A. Drozd, T. Liu, and X. Du, “Subword-level composition functions for learning word embeddings,” in Proceedings of the Second Workshop on Subword and Character Level Models in NLP, 2018. [25] 25 O. Irsoy and C. Cardie, “Deep recursive neural networks for compositionality in language,” in Proceedings of NIPS, 2014, pp. 2096–2104. [26] 26 G. Neubig, Y. Goldberg, and C. Dyer, “On-the-fly operation batching in dynamic computation graphs,” in Proceedings of NIPS, 2017, pp. 3971–3981. [27] 27 R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” in Proceedings of EMNLP, 2017, pp. 2021–2031. [28] 28 B. Lake and M. Baroni, “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks,” in Proceedings of ICML, 2018, pp. 2879–2888. [29] 29 J. Mitchell, P. Stenetorp, P. Minervini, and S. Riedel, “Extrapolation in NLP,” in Proceedings of the Workshop on Generalization in the Age of Deep Learning, 2018, pp. 28–33. [30] 30 Z. Yang, X. Sun, and J. W. Hardin, “A note on the tests for clustered matched-pair binary data,” Biometrical journal, vol. 52, no. 5, 2010. [31] 31 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” JMLR, vol. 15, 2014. [32] 32 D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of ICLR, 2015. [33] 33 D. Haussler, “Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework,” Artificial intelligence, vol. 36, no. 2, 1988. [34] 34 I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [35] 35 I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of NIPS, 2014. [36] 36 M. Nguyen, H. G. Ngo, and N. F. Chen, “Regulating Orthography-Phonology Relationship for English to Thai Transliteration,” in Proceedings of the Sixth Named Entity Workshop, 2016, pp. 83–87. [37] 37 N. Xue, F. Xia, F.-D. Chiou, and M. Palmer, “The Penn Chinese TreeBank: Phrase structure annotation of a large corpus,” Natural language engineering, vol. 11, no. 2, 2005. [38] 38 T. Emerson, “The second International Chinese Word Segmentation Bakeoff,” in Proceedings of the fourth SIGHAN workshop on Chinese language Processing, 2005. [39] 39 K. Kawakami, C. Dyer, and P. Blunsom, “Unsupervised Word Discovery with Segmental Neural Language Models,” CoRR, vol. abs/1811.09353, 2018. [40] 40 S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing LSTM language models,” in Proceedings of the International Conference on Learning Representations, ICLR, 2018. [41] 41 L. Wang, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, and T. Luis, “Finding function in form: Compositional character models for open vocabulary word representation,” in Proceedings of EMNLP, 2015. [42] 42 F. Z. Dai and Z. Cai, “Glyph-aware Embedding of Chinese Characters,” in Proceedings of the First Workshop on Subword and Character Level Models in NLP, 2017, pp. 64–69. [43] 43 A. Siddharthan and A. Mandya, “Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules,” in Proceedings of EACL, 2014. [44] 44 C. Quirk, A. Menezes, and C. Cherry, “Dependency treelet translation: Syntactically informed phrasal SMT,” in Proceedings of ACL, 2005. [45] 45 T. Nakazawa, J. Richardson, and S. Kurohashi, “Insertion position selection model for flexible non-terminals in dependency tree-to-tree machine translation,” in Proceedings of EMNLP, 2016. [46] 46 H. Chen, S. Huang, D. Chiang, and J. Chen, “Improved neural machine translation with a syntax-aware encoder and decoder,” in Proceedings of ACL, 2017. [47] 47 R. Socher, J. Pennington, E. H. Huang, A. Ng, and C. Manning, “Semi-supervised recursive autoencoders for predicting sentiment distributions,” in Proceedings of EMNLP, 2011. [48] 48 R. Socher, B. Huval, C. Manning, and A. Ng, “Semantic compositionality through recursive matrix-vector spaces,” in Proceedings of EMNLP, 2012. [49] 49 R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of EMNLP, 2013. [50] 50 A. Eriguchi, Y. Tsuruoka, and K. Cho, “Learning to parse and translate improves neural machine translation,” in Proceedings of ACL, 2017, pp. 72–78. [51] 51 D. Yogatama, P. Blunsom, C. Dyer, E. Grefenstette, and L. Wang, “Learning to compose words into sentences with reinforcement learning,” in Proceedings of ICLR, 2017. [52] 52 L. Zhang and M. Komachi, “Neural machine translation of logographic language using sub-character level information,” in Proceedings of the Third Conference on Machine Translation: Research Papers, 2018. [53] 53 T.-R. Su and H.-Y. Lee, “Learning Chinese word representations from glyphs of characters,” in Proceedings of EMNLP, 2017. [54] 54 S. Yan, S. Shuming, and L. Jing, “Joint learning embeddings for Chinese words and their components via ladder structured networks,” in Proceedings of IJCAI, 2018. [55] 55 M. Nguyen, H. G. Ngo, and N. F. Chen, “Multimodal neural pronunciation modeling for spoken languages with logographic origin,” in Proceedings of EMNLP, 2018. [56] 56 F. Liu, H. Lu, C. Lo, and G. Neubig, “Learning character-level compositionality with visual features,” in Proceedings of ACL, 2017, pp. 2059–2068. [57] 57 Y. Toyama, M. Miwa, and Y. Sasaki, “Utilizing Visual Forms of Japanese Characters for Neural Review Classification,” in Proceedings of IJCNLP, vol. 2, 2017. [58] 58 X. Shi, J. Zhai, X. Yang, Z. Xie, and C. Liu, “Radical embedding: Delving deeper to Chinese radicals,” in Proceedings of ACL, vol. 2, 2015. [59] 59 H. Peng, E. Cambria, and X. Zou, “Radical-based hierarchical embeddings for Chinese sentiment analysis at sentence level,” in The 30th International FLAIRS conference. Marco Island, 2017, pp. 347–352. [60] 60 J. Yu, X. Jian, H. Xin, and Y. Song, “Joint embeddings of chinese words, characters, and fine-grained subcharacter components,” in Proceedings of EMNLP, 2017. [61] 61 R. Yin, Q. Wang, P. Li, R. Li, and B. Wang, “Multi-granularity Chinese word embedding,” in Proceedings of EMNLP, 2016. [62] 62 Y. Ke and M. Hagiwara, “Radical-level ideograph encoder for RNN-based sentiment analysis of Chinese and Japanese,” in Asian Conference on Machine Learning, 2017, pp. 561–573. [63] 63 M. Karpinska, B. Li, A. Rogers, and A. Drozd, “Subcharacter information in Japanese embeddings: When is it worth it?” in Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 2018. [64] 64 C. Dong, J. Zhang, C. Zong, M. Hattori, and H. Di, “Character-based LSTM-CRF with radical-level features for Chinese named entity recognition,” in Natural Language Understanding and Intelligent Applications. Springer, 2016. [65] 65 H. Han, X. Yang, L. Wu, H. Yan, Z. Gao, Y. Feng, and G. Townsend, “Dual long short-term memory networks for sub-character representation learning,” CoRR, vol. abs/1712.08841, 2017. [66] 66 H. Zhuang, C. Wang, C. Li, Q. Wang, and X. Zhou, “Natural Language Processing Service Based on Stroke-Level Convolutional Networks for Chinese Text Classification,” in Web Services (ICWS), 2017 IEEE International Conference on, 2017. [67] 67 S. Cao, W. Lu, J. Zhou, and X. Li, “cw2vec: Learning Chinese word embeddings with stroke n-grams,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 5053–5061. [68] 68 P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of ACL, vol. 5, 2017. [69] 69 J. Zhao, S. Mudgal, and Y. Liang, “Generalizing word embeddings using bag of subwords,” in Proceedings of EMNLP, 2018, pp. 601–606. [70] 70 Y. Kim, Y. Jernite, D. Sontag, and A. Rush, “Character-aware neural language models,” in Proceedings of AAAI, 2016. [71] 71 S. Papay, S. Padó, and N. T. Vu, “Addressing low-resource scenarios with character-aware embeddings,” in Proceedings of the Second Workshop on Subword and Character Level Models in NLP, 2018. [72] 72 Y. Pinter, R. Guthrie, and J. Eisenstein, “Mimicking word embeddings using subword RNNs,” in Proceedings of EMNLP, 2017. [73] 73 Y. Kim, K.-M. Kim, J.-M. Lee, and S. Lee, “Learning to generate word representations using subword information,” in Proceedings of COLING, 2018. [74] 74 T. Schick and H. Schütze, “Attentive Mimicking: Better word embeddings by attending to informative contexts,” in Proceedings of NAACL-HLT, 2019. [75] 75 Y. Li and J. Kang, “Analysis of phonetics of the ideophonetic characters in Modern Chinese,” Information analysis of usage of characters in modern Chinese, pp. 84–98, 1993. [76] 76 V. Nguyen, J. Brooke, and T. Baldwin, “Sub-character neural language modelling in Japanese,” in Proceedings of the First Workshop on Subword and Character Level Models in NLP, 2017. [77] 77 A. Kuncoro, C. Dyer, J. Hale, D. Yogatama, S. Clark, and P. Blunsom, “LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better,” in Proceedings of ACL, 2018. [78] 78 J. Björne, J. Heimonen, F. Ginter, A. Airola, T. Pahikkala, and T. Salakoski, “Extracting complex biological events with rich graph-based feature sets,” in Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, 2009, pp. 10–18. [79] 79 J. Choi, K. M. Yoo, and S.-G. Lee, “Learning to compose task-specific tree structures,” in Proceedings of AAAI, 2018, pp. 5094–5101. [80] 80 A. Williams, A. Drozdov*, and S. R. Bowman, “Do latent tree learning models identify meaningful structure in sentences?” Transactions of ACL, vol. 6, 2018. [81] 81 L. Zhang and M. Komachi, “Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information,” CoRR, vol. abs/1903.00149, 2019. [82] 82 K.-Y. Chen, H.-M. Wang, and H.-H. Chen, “A probabilistic framework for Chinese spelling check,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 4, pp. 15:1–15:17, 2015. [83] 83 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of NIPS, 2017, pp. 5998–6008.