POS-Tagger for English-Vietnamese Bilingual Corpus

Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages.

HLT-NAACL 2003 Workshop: Building and Using Parallel Texts Data Driven Machine Translation and Beyond , pp. 88-95 Edmonton, May-June 2003 POS-Tagger for English-Vietnamese Bilingual Corpus Dinh Dien Hoang Kiem Information Technology Faculty of Center of Information Technology Vietnam National University of HCMC, Development of 20/C2 Hoang Hoa Tham, Ward 12, Vietnam National University of HCMC, Tan Binh Dist., HCM City, Vietnam 227 Nguyen Van Cu, District 5, HCM City, [email protected] [email protected] Abstract Entropy; decision trees (Schmid, 1994a); Neural network (Schmid, 1994b); and so on can be used. In Corpus-based Natural Language Processing (NLP) which, the methods based on machine learning in tasks for such popular languages as English, French, general and TBL in particular prove effective with etc. have been well studied with satisfactory much popularity at present. achievements. In contrast, corpus-based NLP tasks for To achieve good results, the abovementioned unpopular languages (e.g. Vietnamese) are at a methods must be equipped with exactly annotated deadlock due to absence of annotated training data for training corpora. Such training corpora for popular these languages. Furthermore, hand-annotation of even languages (e.g. English, French, etc.) are available (e.g. reasonably well-determined features such as part-of- Penn Tree Bank, SUSANNE, etc.). Unfortunately, so speech (POS) tags has proved to be labor intensive and far, there has been no such annotated training data costly. In this paper, we suggest a solution to partially available for Vietnamese POS-taggers. Furthermore, overcome the annotated resource shortage in building manually annotated training data is very Vietnamese by building a POS-tagger for an expensive (for example, Penn Tree Bank was invested automatically word-aligned English-Vietnamese over 1 million dollars and many person-years). To parallel Corpus (named EVC). This POS-tagger made overcome this drawback, this paper will present a use of the Transformation-Based Learning (or TBL) solution to indirectly build such an annotated training method to bootstrap the POS-annotation results of the corpus for Vietnamese by taking advantages of English POS-tagger by exploiting the POS-information available English-Vietnamese bilingual corpus named of the corresponding Vietnamese words via their word- EVC (Dinh Dien, 2001b). This EVC has been alignments in EVC. Then, we directly project POS- automatically word-aligned (Dinh Dien et al., 2002a). annotations from English side to Vietnamese via Our approach in this work is to use a bootstrapped available word alignments. This POS-annotated POS tagger for English to annotate the English side of Vietnamese corpus will be manually corrected to a word-aligned parallel corpus, then directly project the become an annotated training data for Vietnamese NLP tag annotations to the second language (Vietnamese) tasks such as POS-tagger, Phrase-Chunker, Parser, via existing word-alignments (Yarowsky and Ngai, Word-Sense Disambiguator, etc. 2001). In this work, we made use of the TBL method and SUSANNE training corpus to train our English 1 Introduction POS-tagger. The remains of this paper is as follows: POS-Tagging by TBL method: introducing to POS-tagging is assigning to each word of a text the original TBL, improved fTBL, traditional English proper POS tag in its context of appearance. Although, POS-Tagger by TBL. each word can be classified into various POS-tags, in a English-Vietnamese bilingual Corpus (EVC): defined context, it can only be attributed with a definite resources of EVC, word-alignment of EVC. POS. As an example, in this sentence: “I can can a Bootstrapping English-POS-Tagger: bootstrapping can”, the POS-tagger must be able to perform the English POS-Tagger by the POS-tag of following: “IPRO canAUX canV aDET canN”. corresponding Vietnamese words. Its evaluation In order to proceed with POS-tagging, such various Projecting English POS-tag annotations to methods as Hidden Markov Models (HMM); Memory- Vietnamese side. Its evaluation. based models (Daelemans, 1996); Transformation- Conclusion: conclusions, limitations and future based Learning (TBL) (Brill, 1995); Maximum developments. 2 POS-Tagging by TBL method POS-tagging was the first application of TBL and the most popular and extended to various languages The Transformation-Based Learning (or TBL) was (e.g. Korean, Spanish, German, etc.) (Curran, 1999). proposed by Eric Brill in 1993 in his doctoral The approach of TBL POS-tagger is simple but dissertation (Brill, 1993) on the foundation of structural effective and it reaches the accuracy competitive with linguistics of Z.S.Harris. TBL has been applied with other powerful POS-taggers. The TBL algorithm for success in various natural language processing (mainly POS-tagger can be briefly described under two periods the tasks of classification). In 2001, Radu Florian and as follows: Grace Ngai proposed the fast Transformation-Based * The training period: Learning (or fTBL) (Florian and Ngai, 2001a) to Starting with the annotated training corpus (or called golden corpus, which has been assigned improve the learning speed of TBL without affecting with correct POS tag annotations), TBL copies this the accuracy of the original algorithm. golden corpus into a new unannotated corpus The central idea of TBL is to start with some (called current corpus, which is removed POS tag simple (or sophisticated) solution to the problem (called annotations). baseline tagging), and step-by-step apply optimal TBL assigns an inital POS-tag to each word in transformation rules (which are extracted from a corpus. This initial tag is the most likely tag for a annotated training corpus at each step) to improve word if the word is known and is guessed based (change from incorrect tags into correct ones) the upon properties of the word if the word is not problem. The algorithm stops when no more optimal known. transformation rule is selected or data is exhausted. The TBL applies each instance of each candidate rule (following the format of templates designed by optimal transformation rule is the one which results in human beings) in the current corpus. These rules the largest benefit (repairs incorrect tags into correct change the POS tags of words based upon the tags as much as possible). contexts they appear in. TBL evaluates the result of A striking particularity of TBL in comparison with applying that candidate rule by comparing the other learning methods is perceptive and symbolic: the current result of POS-tag annotations with that of linguists are able to observe, intervene in all the the golden corpus in order to choose the best one learning, implementing processes as well as the which has highest mark. These best rules are intermediary and final results. Besides, TBL allows the repeatedly extracted until there is no more optimal inheritance of the tagging results of another system rule (its mark isn’t higher than a preset threshold). (considered as the baseline or initial tagging) with the These optimal rules create an ordered sequence. * The executing period: correction on that result based on the transformation Starting with the new unannotated text, TBL rules learned through the training period. assigns an inital POS-tag to each word in text in a TBL is active in conformity with the way similar to that of the training period. transformational rules in order to change wrong tags The sequence of optimal rules (extracted from into right ones. All these rules obey the templates training period) are applied, which change the POS specified by human. In these templates, we need to tag annotations based upon the contexts they regulate the factors affecting the tagging. In order to appear in. These rules are applied deterministically evaluate the optimal transformation rules, TBL needs in the order they appear in the sequence. the annotated training corpus (the corpus to which the In addition to the above-mentioned TBL algorithm correct tag has been attached, usually referred to as the that is applied in the supervised POS-tagger, Brill golden corpus) to compare the result of current tagging (1997) also presented an unsupervised POS-tagger that to the correct tag in the training corpus. In the executing is trained on unannotated corpora. The accuracy of period, these optimal rules will be used for tagging new unsupervised POS-tagger was reported lower than that corpora (in conformity with the sorting order) and these of supervised POS-tagger. new corpora must also be assigned with the baseline Because the goal of our work is to build a POS-tag annotated training data for Vietnamese, we need an tags similar to that of the training period. These annotated corpus with as high as possible accuracy. So, linguistic annotation tags can be morphological ones we will concentrate on the supervised POS-tagger only. (sentence boundary, word boundary), POS tags, For full details of TBL and FTBL, please refer to syntactical tags (phrase chunker), sense tags, Eric Brill (1993, 1995) and Radu Florian and Grace grammatical relation tags, etc. Ngai (2001a). 3 English – Vietnamese Bilingual Corpus Next, this bilingual corpus has been automatically word aligned by a hybrid model combining the The bilingual corpus that needs POS-tagging in this semantic class-based model with the GIZA++ model. paper is named EVC (English – Vietnamese Corpus). An example of the word-alignment result is as in figure This corpus is collected from many different resources 1 below. The accuracy of word-alignment of this of bilingual texts (such as books, dictionaries, corpora, parallel corpus has been reported approximately 87% in etc.) in selected fields such as Science, Technology, (Dinh Dien et al., 2002b). For full details of word daily conversation (see table 1). After collecting alignment of this EVC corpus (precision, recall, bilingual texts from different resources, this parallel coverage, etc.), please refer to (Dinh Dien et al., corpus has been normalized their form (text-only), tone 2002a). marks (diacritics), character code of Vietnam (TCVN- The result of this word-aligned parallel corpus has 3), character font (VN-Times), etc. Next, this corpus been used in various Vietnamese NLP tasks, such as in has been sentence aligned and checked spell semi- training the Vietnamese word segmenter (Dinh Dien et automatically. An example of unannotated EVC as the al., 2001a), word sense disambiguation (Dinh Dien, following: 2002b), etc. Remarkably, this EVC includes the SUSANNE *D02:01323: Jet planes fly about nine miles high. corpus (Sampson, 1995) – a golden corpus has been +D02:01323: Các phi cơ phản lực bay cao khoảng manually annotated such necessary English linguistic chín dặm. annotations as lemma, POS tags, chunking tags, Where, the codes at the beginning of each line refer syntactic trees, etc. This English corpus has been to the corresponding sentence in the EVC corpus. For translated into Vietnamese by English teachers of full details of building this EVC corpus (e.g. collecting, Foreign Language Department of Vietnam University normalizing, sentence alignment, spelling checker, of HCM City. In this paper, we will make use of this etc.), please refer to Dinh Dien (2001b). valuable annotated corpus as the training corpus for our bootstrapped English POS-tagger. No. Resources The number Number of Number of Length Percent of pairs of English Vietnamese (English (words/ sentences words morpho-words words) EVC) 1. Computer books 9,475 165,042 239,984 17.42 7.67 2. LLOCE dictionary 33,078 312,655 410,760 9.45 14.53 3. EV bilingual dictionaries 174,906 1,110,003 1,460,010 6.35 51.58 4. SUSANNE corpus 6,269 131,500 181,781 20.98 6.11 5. Electronics books 12,120 226,953 297,920 18.73 10.55 6. Children’s Encyclopedia 4,953 79,927 101,023 16.14 3.71 7. Other books 9,210 126,060 160,585 13.69 5.86 Total 250,011 2,152,140 2,852,063 8.59 100% Table 1. Resources of EVC corpus Jet planes fly about nine miles high Caùc phi cô phaûn löïc bay cao khoaûng chín daëm Figure 1. An example of a word-aligned pair of sentences in EVC corpus 4.1 The English POS-Tagger by TBL method 4 Our Bootstrapped English POS-Tagger To make the presentation clearer, we re-use notations in the introduction to fnTBL-toolkit of Radu Florian and So far, existing POS-taggers for (mono-lingual) Grace Ngai (2001b) as follows: English have been well developed with satisfactory achievements and it is very difficult (it is nearly • χ : denotes the space of samples: the set of words which need POS-tagging. In English, it is simple to impossible for us) to improve their results. Actually, recognize the word boundary, but in Vietnamese those existing advanced POS-taggers have exhaustively (an isolate language), it is rather complicated. exploited all linguistic information in English texts and Therefore, it has been presented in another work there is no way for us to improve English POS-tagger in (Dinh Dien, 2001a). case of such a monolingual English texts. By contrast, in the bilingual texts, we are able to make use of the • C : set of possible POS-classifications c (or tagset). second language’s linguistic information in order to For example: noun (N), verb (V), adjective (A), ... improve the POS-tag annotations of the first language. For English, we made use of the Penn TreeBank Our solution is motivated by I.Dagan, I.Alon and tagset and for Vietnamese tagset, we use the POS- S.Ulrike (1991); W.Gale, K.Church and D.Yarowsky tagset mapping table (see appendix A). (1992). They proposed the use of bilingual corpora to • S = χxC: the space of states: the cross-product avoid hand-tagging of training data. Their premise is between the sample space (word) and the that “different senses of a given word often translate classification space (tagset), where each point is a differently in another language (for example, pen in couple (word, tag). English is stylo in French for its writing implement • π : predicate defined on S+ space, which is on a sense, and enclos for its enclosure sense). By using a parallel aligned corpus, the translation of each sequence of states. Predicate π follows the occurrence of a word such as pen can be used to specified templates of transformation rules. In the automatically determine its sense”. This remark is not POS-tagger for English, this predicate only only true for word sense but also for POS-tag and it is consists of English factors which affect the POS- more exact in such typologically different languages as tagging process, for example UWord i or ∃i∈[ − m , + n ] English vs. Vietnamese. In fact, POS-tag annotations of English words as U Tag i or UWord i ∧ Tag j . well as Vietnamese words are often ambiguous but they ∃i∈[ − m , + n ] ∃i∈[ − m , + n ] are not often exactly the same (table 4). For example, Where, Wordi is the morphology of the ith word from “can” in English may be “Aux” for ability sense, “V” the current word. Positive values of i mean for to make a container sense, and “N” for a container preceding (its left side), and negative ones mean sense and there is hardly existing POS-tagger which can following (its right side). i ranges within the tag POS for that word “can” exactly in all different window from –m to +n. In this English- contexts. Nevertheless, if that “can” in English is Vietnamese bilingual POS-tagger, we add new already word-aligned with a corresponding Vietnamese elements including VTag 0 and ∃VTag 0 to those word, it will be POS-disambiguated easily by Vietnamese word’ s POS-tags. For example, if “can” is predicates. VTag0 is the Vietnamese POS-tag aligned with “có thể”, it must be Auxiliary ; if it is corresponding to the current English word via its aligned with “đóng hộp” then it must be a Verb, and if word-alignment. These Vietnamese POS-tags are determined by the most frequent tag according to it is aligned with “cái hộp” then it must be a Noun. the Vietnamese dictionary. However, not that all Vietnamese POS-tag information is useful and deterministic. The big • A rule r defined as a couple (π, c) which consists question here is when and how we make use of the of predicate π and tag c. Rule r is written in the Vietnamese POS-tag information? Our answer is to form π ⇒ c. This means that the rule r = (π, c) will have this English POS-tagger trained by TBL method be applied on the sample x if the predicate π is (section 2) with the SUSANNE training corpus (section satisfied on it, whereat, x will be changed into a 3). After training, we will extract an ordered sequence new tag c. of optimal transformation rules. We will use these rules • Giving a state s = (x,c) and rule r = (π, c), then the to improve an existing English POS-tagger (as baseline result state r(s), which is gained by applying rule r tagger) for tagging words of the English side in the on s, is defined as: word-aligned EVC corpus. This English POS-tagging s if π(s)=False result will be projected to Vietnamese side via word- r(s) = (x, c’) if π(s)=True alignments in order to form a new Vietnamese training corpus annotated with POS-tags. • T : set of training samples, which were assigned 4.3 Experiment and Results of Bootstrapped correct tag. Here we made use of the SUSANNE English POS-Tagger golden corpus (Sampson, 1995) whose POS-tagset After the training period, this system will extract an was converted into the PTB tagset. ordered sequence of optimal transformation rules under • The score associated with a rule r = (π, c) is usually following format, for examples: the difference in performance (on the training data) ((tag −1 = TO ) ∧ (tag 0 = NN )) ⇒ tag 0 ← VB that results from applying the rule, as follows: ((Word0 ="can") ∧ (VTag0 = MD) ∧ (tag0 = VB)) ⇒ tag0 ← MD Score(r ) = ∑ score(r (s )) − ∑ score( s) (( ∃ i ∈ [ − 3, − 1] | Tag i = MD ) ∧ (tag 0 = VPB )) ⇒ tag 0 ← VB s∈T s∈T These are intuitive rules and easy to understand by 1 if c = True(x) human beings. For examples: the 2nd rule will be score((x,c)) = 0 if c ≠ True(x) understood as follows: “if the POS-tag of current word is VB (Verb) and its word-form is “can” and its corresponding Vietnamese word-tag is MD (Modal), 4.2 The TBL algorithm for POS-Tagging then the POS-tag of current word will be changed into The TBL algorithm for POS-tagging can be briefly MD”. described as follows (see the flowchart in figure 2): We have experimented this method on EVC corps with the training SUSANNE corpus. To evaluate this Step 1: Baseline tagging: To initiatize for each sample x method, we held-back 6,000-word part of the training in SUSANNE training data with its most likely POS-tag corpus (which have not been used in the training c. For English, we made use of the available English period) and we achieved the POS-tagging results as tagger (and parser) of Eugene Charniak (1997) at follows: Brown University (version 2001). For Vietnamese, it is the set of possible parts-of-speech tags (follow the Step Correct Incorrect Precision appearance probability order of that part-of-speech in tags Tags dictionary). We call the starting training data as T0. Baseline tagging 5724 276 95.4% Step 2: Considering all the transformations (rules) r to (Brown POS-tagger) the training data Tk in time kth, choose the one with the TBL-POS-tagger 5850 150 97.5% highest Score(r) and applying it to the training data to (bootstrapping by obtain new corpus Tk+1. We have: Tk+1 = r(Tk) = { r(s) | corresponding s∈Tk}. If there are no more possible transformation Vietnamese POS-tag) rules which satisfies: Score(r) > β, the algorithm is Table 2. The result of Bootstrapped POS- stopped. β is the threshold, which is preset and adjusted tagger for English side in EVC. according to reality situations. It is thanks to exploiting the information of the Step 3: k = k+1. corresponding Vietnamese POS that the English POS- tagging results are improved. If we use only available Step 4: Repeat from step 2. English information, it is very difficult for us to Step 5: Applying every rule r which is drawn in order improve the output of Brown POS-tagger. Despite the for new corpus EVC after this corpus has been POS- POS-tagging improvement, the results can hardly said tagged with baseline tags similar to those of the training to be fully satisfactory due to the following reasons: period. * The result of automatic word-alignment is only 87% (Dinh Dien et al., 2002a). * Convergence ability of the algorithm: call ek the * It is not always true that the use of Vietnamese number of error (the difference between the tagging POS-information is effective enough to disambiguate result in conformity with rule r and the correct tag in the POS of English words (please refer to table 3). the golden corpus in time kth), we have: ek+1 = ek – Through the statistical table 3 below, the Score(r), since Score(r) > 0, so ek+1 < ek with all k, and information of Vietnamese POS-tags can be seen as ek∈N, so the algorithm will be converged after limited follows: steps. - Case 1,2,3,4: no need for any disambiguation of * Complexity of the algorithm: O(n*t*c) where n: size English POS-tags. of training set (number of words); t: size of possible - Case 5, 7: Full disambiguation of English POS-tags transformation rule set (number of candidate rules); c: (majority). size of corpus satisfied rule applying condition (number - Case 6, 8, 9: Partial disambiguation of English of order satisfied predicate π). POS-tags by TBL-method. word-aligned bilingual SUSANNE corpus remove POS-tags Vietnamese Unannotated corpus corresponding POS-tags Brown POS-tagger (baseline tagger) Templates current annotated candidate Corpus annotated by corpus transformation rules candidate rules Compare Optimal Rules Y mark & Evaluate >β N Sequence of optimal rules End Figure 2. Flowchart of TBL-algorithm in POS-tagger for EVC corpus No. English POS-tags Vietnamese POS-tags Contrast English vs. Vietnamese Percent % POS-tags 1. One POS-tag only One POS-tag only Two POS-tags are identical 25.2 2. One POS-tag only One POS-tag only Two POS-tags are different 1.2 3. One POS-tag only More than 1 POS-tag One common POS-tag only 5.3 4. One POS-tag only More than 1 POS-tag No common POS-tag 3.5 5. More than 1 POS-tag One POS-tag only One common POS-tag only 50.5 6. More than 1 POS-tag One POS-tag only No common POS-tag 2.8 7. More than 1 POS-tag More than 1 POS-tag One common POS-tag only 6.1 8. More than 1 POS-tag More than 1 POS-tag More than 1 common POS-tag 4.1 9. More than 1 POS-tag More than 1 POS-tag No common POS-tag 1.3 Table 3. Contrast POS-tag of English and Vietnamese in the word-aligned EVC Regarding evaluation of POS-tag projections, 5 Projecting English POS-Tags to because so far, there has been no POS-annotated Vietnamese corpus available for Vietnamese, we had to manually build a small golden corpus for Vietnamese POS- After having English-POS-tag annotations with high tagging with approximately 1000 words for precision, we proceed to directly project those POS- evaluating. The results of Vietnamese POS-tagging tag annotations from English side into Vietnamese is as table 5 below: side. Our solution is motivated by a similar work of Method Correct Incorrect Precision David Yarowsky and Grace Ngai (2001). This tags Tags projection is based on available word-alignments in Baseline tagging 823 177 82.3% the automatically word-aligned English-Vietnamese (use information parallel corpus. of POS-tag in Nevertheless, due to typological difference dictionary) between English (an inflected typology) vs. Projecting from 946 54 94.6% Vietnamese (an isolated typology), direct projection English side in is not a simple 1-1 map but it may be a complex m-n EVC map: Regarding grammatical meanings, English Table 5. The result of projecting POS-tags usually makes use of inflectional facilities, such from English side to Vietnamese in EVC. as suffixes to express grammatical meanings. For example: -s →plural, -ed →past, - 6 Conclusion ing→continuous, ‘s → possesive case, etc. Whilst Vietnamese often makes use of function We have just presented the POS-tagging for an words, word order facilities. For example: automatically word-aligned English-Vietnamese “caùc”’ “nhöõng” → plural, “ñaõ” → past, “ñang” parallel corpus by POS-tagging English words first → continuous, “cuûa” → possessive cases, etc. and then projecting them to Vietnamese side later. Regarding lexicalization, some words in English The English POS-tagging is done in 2 steps: The must be represented by a phrase in Vietnamese basic tagging step is achieved through the available and vice-versa. For example: “cow” and “ox” in POS-tagger (Brown) and the correction step is achieved through the TBL learning method in which English will be rephrased into two words “boø the information on the corresponding Vietnamese is caùi” (female one) and “boø ñöïc” (male one) in used through available word-alignment in the EVC. Vietnamese; or “ngheù” in Vietnamese will be The result of POS-tagging of Vietnamese in the rephrased into two words “buffalo calf” in English-Vietnamese bilingual corpus plays a English. meaningful role in the building of the automatic The result of projecting is as table 4 below. training corpus for the Vietnamese processors in need In addition, tagsets of two languages are of parts of speech (such as Vietnamese POS-taggers, different. Due characteristics of each language, we Vietnamese parser, etc.). By making use of the must use two different tagset for POS-tagging. language typology’ s differences and the word- Regarding English, we made use of available POS- alignments in bilingual corpus for the mutual tagset of PennTreeBank. While in Vietnamese, we disambiguation, we are still able to improve the result made use of POS-tagset in the standard Vietnamese of the English POS-tagging of the currently powerful dictionary of Hoang Phe (1998) and other new tags. English POS-taggers. So, we must have an English-Vietnamese consensus Currently, we are improving the speed of tagset map (please refer to Appendix A). training period by using Fast TBL algorithm instead of TBL one. Eng- Jet planes fly about nine miles high In the future, we will improve this serial POS- lish tagging to the parallel POS-tagging for both English E-tag NN NNS VBP IN CD NNS RB and Vietnamese simultaneously after we obtain the VN- phaûn (caùc) bay khoaûng chín daëm cao exact Vietnamese POS-tags in the parallel corpus of ese löïc phi cô SUSANNE. V-tag N N V IN CD N R Acknowledgements Table 4. An example of English POS- We would like to thank Prof. Eduard Hovy tagging in parallel corpus EVC (ISI/USC, USA) for his guidance as external advisor on this research. References Conference on New methods in Language Processing, Manchester, UK. H. Schmid. 1994b. POS Tagging with Neural Networks, E. Brill. 1993. A Corpus-based approach to Language Proceedings of International Conference on Learning, PhD-thesis, Pennsylvania Uni., USA. Computational Linguistics, Kyoto, Japan, pp.172-176. E. Brill. 1995. Transformation-Based Error-Driven D. Yarowsky and G. Ngai. 2001. Induce, Multilingual POS Learning and Natural Language Processing: A Case Tagger and NP bracketer via projection on aligned Study in Part of Speech Tagging. Computational corpora, Proceedings of NAACL-01. Linguistics, 21(4), pp. 543-565. E. Brill. 1997. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Natural Appendix A. English-Vietnamese consensus Language Processing Using Very Large Corpora. POS-tagset mapping table Kluwer Academic Press. J. Curran. 1999. Transformation-Based Learning in English POS Vietnamese Shallow Natural Language Processing, Honours POS Thesis, Basser Department of Computer Science, CC (Coordinating conjunction) CC University of Sydney, Sydney, Australia. CD (Cardinal number) CD E. Charniak. 1997. Statistical parsing with a context-free grammar and word statistics, in Proceedings of the DT (Determiner) DT Fourteenth National Conference on Artificial EX (Existential) V Intelligence, AAAI Press/MIT Press, Menlo Park. FW (Foreign word) FW I. Dagan, I.Alon, and S.Ulrike. 1991. Two languages are IN (Preposition) IN more informative than one. In Proceedings of the 29th JJ (Adjective) A Annual ACL, Berkeley, CA, pp.130-137. JJR (Adjective, comparative) A W. Daelemans, J. Zavrel, P. Berck, S. Gillis. 1996. MTB: JJS (Adjective, superlative) A A Memory-Based Part-of-Speech Tagger Generator. In Proceedings of 4th Workshop on Very Large LS (List item marker) LS Corpora, Copenhagen. MD (Modal) MD D. Dien, H. Kiem, and N.V. Toan. 2001a. Vietnamese NN (Noun, singular or mass) N Word Segmentation, Proceedings of NLPRS’01 (The NNS (Noun, plural) N 6th Natural Language Processing Pacific Rim NP (Proper noun, singular) N Symposium), Tokyo, Japan, 11/2001, pp. 749-756. NPS (Proper noun, plural) N D. Dien. 2001b. Building an English-Vietnamese bilingual corpus, Master thesis in Comparative Linguistics, PDT (Predeterminer) DT University of Social Sciences and Humanity of HCM POS (Possessive ending) “cuûa” City, Vietnam. PP (Personal pronoun) P D. Dien, H.Kiem, T.Ngan, X.Quang, Q.Hung, P.Hoi, PP$ (Possessive pronoun) “cuûa” P V.Toan. 2002a. Word alignment in English – Vietnamese bilingual corpus, Proceedings of RB (Adverb) R EALPIIT’02, Hanoi, Vietnam, 1/2002, pp. 3-11. RBR (Adverb, comparative) R D.Dien, H.Kiem. 2002b. Building a training corpus for RBS (Adverb, superlative) R word sense disambiguation in the English-to- RP (Particle) RP Vietnamese Machine Translation, Proceedings of SYM (Symbol) SYM Workshop on Machine Translation in Asia, COLING- TO (''to'') - 02, Taiwan, 9/2002, pp. 26-32. UH (Interjection) UH R. Florian, and G. Ngai. 2001a. Transformation-Based Learning in the fast lane, Proceedings of North VB (Verb, base form) V America ACL-2001. VBD (Verb, past tense) V R. Florian, and G. Ngai. 2001b. Fast Transformation-Based VBG (Verb, gerund or present V Learning Toolkit. Technical Report. participle) W. Gale, K.W.Church, and D. Yarowsky. 1992. Using VBN (Verb, past participle) V bilingual materials to develop word sense VBP (Verb, non-3rd person V disambiguation methods. In Proceedings of the Int. Conf. on Theoretical and Methodological Issues in singular present) MT, pp.101-112. VBZ (Verb, 3rd person singular V H. Phe. 1998. Töø ñieån tieáng Vieät (Vietnamese Dictionary). present) Center of Lexicography. Da Nang Publisher. WDT (Whdeterminer) P G. Sampson. 1995. English for the Computer: The WP (Wh-pronoun) P SUSANNE Corpus and Analytic Scheme, Clarendon WP$ (Possessive wh-pronoun) “cuûa” P Press (Oxford University Press). WRB (Wh-adverb) R H. Schmid. 1994a. Probabilistic POS Tagging using Decision Trees, Proceedings of International

Tải về miễn phí