POS-Tagger for English-Vietnamese Bilingual Corpus
Corpus-based Natural Language Processing (NLP)
tasks for such popular languages as English, French,
etc. have been well studied with satisfactory
achievements. In contrast, corpus-based NLP tasks for
unpopular languages (e.g. Vietnamese) are at a
deadlock due to absence of annotated training data for
these languages.
HLT-NAACL 2003 Workshop: Building and Using Parallel Texts
Data Driven Machine Translation and Beyond , pp. 88-95
Edmonton, May-June 2003
POS-Tagger for English-Vietnamese Bilingual Corpus
Dinh Dien Hoang Kiem
Information Technology Faculty of Center of Information Technology
Vietnam National University of HCMC, Development of
20/C2 Hoang Hoa Tham, Ward 12, Vietnam National University of HCMC,
Tan Binh Dist., HCM City, Vietnam 227 Nguyen Van Cu, District 5, HCM City,
[email protected] [email protected]
Abstract Entropy; decision trees (Schmid, 1994a); Neural
network (Schmid, 1994b); and so on can be used. In
Corpus-based Natural Language Processing (NLP)
which, the methods based on machine learning in
tasks for such popular languages as English, French,
general and TBL in particular prove effective with
etc. have been well studied with satisfactory
much popularity at present.
achievements. In contrast, corpus-based NLP tasks for
To achieve good results, the abovementioned
unpopular languages (e.g. Vietnamese) are at a
methods must be equipped with exactly annotated
deadlock due to absence of annotated training data for
training corpora. Such training corpora for popular
these languages. Furthermore, hand-annotation of even
languages (e.g. English, French, etc.) are available (e.g.
reasonably well-determined features such as part-of-
Penn Tree Bank, SUSANNE, etc.). Unfortunately, so
speech (POS) tags has proved to be labor intensive and
far, there has been no such annotated training data
costly. In this paper, we suggest a solution to partially
available for Vietnamese POS-taggers. Furthermore,
overcome the annotated resource shortage in
building manually annotated training data is very
Vietnamese by building a POS-tagger for an
expensive (for example, Penn Tree Bank was invested
automatically word-aligned English-Vietnamese
over 1 million dollars and many person-years). To
parallel Corpus (named EVC). This POS-tagger made
overcome this drawback, this paper will present a
use of the Transformation-Based Learning (or TBL)
solution to indirectly build such an annotated training
method to bootstrap the POS-annotation results of the
corpus for Vietnamese by taking advantages of
English POS-tagger by exploiting the POS-information
available English-Vietnamese bilingual corpus named
of the corresponding Vietnamese words via their word-
EVC (Dinh Dien, 2001b). This EVC has been
alignments in EVC. Then, we directly project POS-
automatically word-aligned (Dinh Dien et al., 2002a).
annotations from English side to Vietnamese via
Our approach in this work is to use a bootstrapped
available word alignments. This POS-annotated
POS tagger for English to annotate the English side of
Vietnamese corpus will be manually corrected to
a word-aligned parallel corpus, then directly project the
become an annotated training data for Vietnamese NLP
tag annotations to the second language (Vietnamese)
tasks such as POS-tagger, Phrase-Chunker, Parser,
via existing word-alignments (Yarowsky and Ngai,
Word-Sense Disambiguator, etc.
2001). In this work, we made use of the TBL method
and SUSANNE training corpus to train our English
1 Introduction POS-tagger. The remains of this paper is as follows:
POS-Tagging by TBL method: introducing to
POS-tagging is assigning to each word of a text the original TBL, improved fTBL, traditional English
proper POS tag in its context of appearance. Although, POS-Tagger by TBL.
each word can be classified into various POS-tags, in a English-Vietnamese bilingual Corpus (EVC):
defined context, it can only be attributed with a definite resources of EVC, word-alignment of EVC.
POS. As an example, in this sentence: “I can can a Bootstrapping English-POS-Tagger: bootstrapping
can”, the POS-tagger must be able to perform the English POS-Tagger by the POS-tag of
following: “IPRO canAUX canV aDET canN”. corresponding Vietnamese words. Its evaluation
In order to proceed with POS-tagging, such various Projecting English POS-tag annotations to
methods as Hidden Markov Models (HMM); Memory- Vietnamese side. Its evaluation.
based models (Daelemans, 1996); Transformation- Conclusion: conclusions, limitations and future
based Learning (TBL) (Brill, 1995); Maximum developments.
2 POS-Tagging by TBL method POS-tagging was the first application of TBL and
the most popular and extended to various languages
The Transformation-Based Learning (or TBL) was (e.g. Korean, Spanish, German, etc.) (Curran, 1999).
proposed by Eric Brill in 1993 in his doctoral The approach of TBL POS-tagger is simple but
dissertation (Brill, 1993) on the foundation of structural effective and it reaches the accuracy competitive with
linguistics of Z.S.Harris. TBL has been applied with other powerful POS-taggers. The TBL algorithm for
success in various natural language processing (mainly POS-tagger can be briefly described under two periods
the tasks of classification). In 2001, Radu Florian and as follows:
Grace Ngai proposed the fast Transformation-Based * The training period:
Learning (or fTBL) (Florian and Ngai, 2001a) to Starting with the annotated training corpus (or
called golden corpus, which has been assigned
improve the learning speed of TBL without affecting
with correct POS tag annotations), TBL copies this
the accuracy of the original algorithm. golden corpus into a new unannotated corpus
The central idea of TBL is to start with some (called current corpus, which is removed POS tag
simple (or sophisticated) solution to the problem (called annotations).
baseline tagging), and step-by-step apply optimal TBL assigns an inital POS-tag to each word in
transformation rules (which are extracted from a corpus. This initial tag is the most likely tag for a
annotated training corpus at each step) to improve word if the word is known and is guessed based
(change from incorrect tags into correct ones) the upon properties of the word if the word is not
problem. The algorithm stops when no more optimal known.
transformation rule is selected or data is exhausted. The TBL applies each instance of each candidate rule
(following the format of templates designed by
optimal transformation rule is the one which results in
human beings) in the current corpus. These rules
the largest benefit (repairs incorrect tags into correct change the POS tags of words based upon the
tags as much as possible). contexts they appear in. TBL evaluates the result of
A striking particularity of TBL in comparison with applying that candidate rule by comparing the
other learning methods is perceptive and symbolic: the current result of POS-tag annotations with that of
linguists are able to observe, intervene in all the the golden corpus in order to choose the best one
learning, implementing processes as well as the which has highest mark. These best rules are
intermediary and final results. Besides, TBL allows the repeatedly extracted until there is no more optimal
inheritance of the tagging results of another system rule (its mark isn’t higher than a preset threshold).
(considered as the baseline or initial tagging) with the These optimal rules create an ordered sequence.
* The executing period:
correction on that result based on the transformation
Starting with the new unannotated text, TBL
rules learned through the training period. assigns an inital POS-tag to each word in text in a
TBL is active in conformity with the way similar to that of the training period.
transformational rules in order to change wrong tags The sequence of optimal rules (extracted from
into right ones. All these rules obey the templates training period) are applied, which change the POS
specified by human. In these templates, we need to tag annotations based upon the contexts they
regulate the factors affecting the tagging. In order to appear in. These rules are applied deterministically
evaluate the optimal transformation rules, TBL needs in the order they appear in the sequence.
the annotated training corpus (the corpus to which the In addition to the above-mentioned TBL algorithm
correct tag has been attached, usually referred to as the that is applied in the supervised POS-tagger, Brill
golden corpus) to compare the result of current tagging (1997) also presented an unsupervised POS-tagger that
to the correct tag in the training corpus. In the executing is trained on unannotated corpora. The accuracy of
period, these optimal rules will be used for tagging new unsupervised POS-tagger was reported lower than that
corpora (in conformity with the sorting order) and these of supervised POS-tagger.
new corpora must also be assigned with the baseline Because the goal of our work is to build a POS-tag
annotated training data for Vietnamese, we need an
tags similar to that of the training period. These
annotated corpus with as high as possible accuracy. So,
linguistic annotation tags can be morphological ones we will concentrate on the supervised POS-tagger only.
(sentence boundary, word boundary), POS tags, For full details of TBL and FTBL, please refer to
syntactical tags (phrase chunker), sense tags, Eric Brill (1993, 1995) and Radu Florian and Grace
grammatical relation tags, etc. Ngai (2001a).
3 English – Vietnamese Bilingual Corpus Next, this bilingual corpus has been automatically
word aligned by a hybrid model combining the
The bilingual corpus that needs POS-tagging in this semantic class-based model with the GIZA++ model.
paper is named EVC (English – Vietnamese Corpus). An example of the word-alignment result is as in figure
This corpus is collected from many different resources 1 below. The accuracy of word-alignment of this
of bilingual texts (such as books, dictionaries, corpora, parallel corpus has been reported approximately 87% in
etc.) in selected fields such as Science, Technology, (Dinh Dien et al., 2002b). For full details of word
daily conversation (see table 1). After collecting alignment of this EVC corpus (precision, recall,
bilingual texts from different resources, this parallel coverage, etc.), please refer to (Dinh Dien et al.,
corpus has been normalized their form (text-only), tone 2002a).
marks (diacritics), character code of Vietnam (TCVN- The result of this word-aligned parallel corpus has
3), character font (VN-Times), etc. Next, this corpus been used in various Vietnamese NLP tasks, such as in
has been sentence aligned and checked spell semi- training the Vietnamese word segmenter (Dinh Dien et
automatically. An example of unannotated EVC as the al., 2001a), word sense disambiguation (Dinh Dien,
following: 2002b), etc.
Remarkably, this EVC includes the SUSANNE
*D02:01323: Jet planes fly about nine miles high. corpus (Sampson, 1995) – a golden corpus has been
+D02:01323: Các phi cơ phản lực bay cao khoảng manually annotated such necessary English linguistic
chín dặm. annotations as lemma, POS tags, chunking tags,
Where, the codes at the beginning of each line refer syntactic trees, etc. This English corpus has been
to the corresponding sentence in the EVC corpus. For translated into Vietnamese by English teachers of
full details of building this EVC corpus (e.g. collecting, Foreign Language Department of Vietnam University
normalizing, sentence alignment, spelling checker, of HCM City. In this paper, we will make use of this
etc.), please refer to Dinh Dien (2001b). valuable annotated corpus as the training corpus for our
bootstrapped English POS-tagger.
No. Resources The number Number of Number of Length Percent
of pairs of English Vietnamese (English (words/
sentences words morpho-words words) EVC)
1. Computer books 9,475 165,042 239,984 17.42 7.67
2. LLOCE dictionary 33,078 312,655 410,760 9.45 14.53
3. EV bilingual dictionaries 174,906 1,110,003 1,460,010 6.35 51.58
4. SUSANNE corpus 6,269 131,500 181,781 20.98 6.11
5. Electronics books 12,120 226,953 297,920 18.73 10.55
6. Children’s Encyclopedia 4,953 79,927 101,023 16.14 3.71
7. Other books 9,210 126,060 160,585 13.69 5.86
Total 250,011 2,152,140 2,852,063 8.59 100%
Table 1. Resources of EVC corpus
Jet planes fly about nine miles high
Caùc phi cô phaûn löïc bay cao khoaûng chín daëm
Figure 1. An example of a word-aligned pair of sentences in EVC corpus
4.1 The English POS-Tagger by TBL method
4 Our Bootstrapped English POS-Tagger
To make the presentation clearer, we re-use notations in
the introduction to fnTBL-toolkit of Radu Florian and
So far, existing POS-taggers for (mono-lingual)
Grace Ngai (2001b) as follows:
English have been well developed with satisfactory
achievements and it is very difficult (it is nearly • χ : denotes the space of samples: the set of words
which need POS-tagging. In English, it is simple to
impossible for us) to improve their results. Actually,
recognize the word boundary, but in Vietnamese
those existing advanced POS-taggers have exhaustively
(an isolate language), it is rather complicated.
exploited all linguistic information in English texts and
Therefore, it has been presented in another work
there is no way for us to improve English POS-tagger in
(Dinh Dien, 2001a).
case of such a monolingual English texts. By contrast,
in the bilingual texts, we are able to make use of the • C : set of possible POS-classifications c (or tagset).
second language’s linguistic information in order to For example: noun (N), verb (V), adjective (A), ...
improve the POS-tag annotations of the first language. For English, we made use of the Penn TreeBank
Our solution is motivated by I.Dagan, I.Alon and tagset and for Vietnamese tagset, we use the POS-
S.Ulrike (1991); W.Gale, K.Church and D.Yarowsky tagset mapping table (see appendix A).
(1992). They proposed the use of bilingual corpora to • S = χxC: the space of states: the cross-product
avoid hand-tagging of training data. Their premise is between the sample space (word) and the
that “different senses of a given word often translate classification space (tagset), where each point is a
differently in another language (for example, pen in couple (word, tag).
English is stylo in French for its writing implement
• π : predicate defined on S+ space, which is on a
sense, and enclos for its enclosure sense). By using a
parallel aligned corpus, the translation of each sequence of states. Predicate π follows the
occurrence of a word such as pen can be used to specified templates of transformation rules. In the
automatically determine its sense”. This remark is not POS-tagger for English, this predicate only
only true for word sense but also for POS-tag and it is consists of English factors which affect the POS-
more exact in such typologically different languages as tagging process, for example UWord i or
∃i∈[ − m , + n ]
English vs. Vietnamese.
In fact, POS-tag annotations of English words as U Tag i or UWord i ∧ Tag j .
well as Vietnamese words are often ambiguous but they ∃i∈[ − m , + n ] ∃i∈[ − m , + n ]
are not often exactly the same (table 4). For example, Where, Wordi is the morphology of the ith word from
“can” in English may be “Aux” for ability sense, “V” the current word. Positive values of i mean
for to make a container sense, and “N” for a container preceding (its left side), and negative ones mean
sense and there is hardly existing POS-tagger which can following (its right side). i ranges within the
tag POS for that word “can” exactly in all different window from –m to +n. In this English-
contexts. Nevertheless, if that “can” in English is Vietnamese bilingual POS-tagger, we add new
already word-aligned with a corresponding Vietnamese elements including VTag 0 and ∃VTag 0 to those
word, it will be POS-disambiguated easily by
Vietnamese word’ s POS-tags. For example, if “can” is predicates. VTag0 is the Vietnamese POS-tag
aligned with “có thể”, it must be Auxiliary ; if it is corresponding to the current English word via its
aligned with “đóng hộp” then it must be a Verb, and if word-alignment. These Vietnamese POS-tags are
determined by the most frequent tag according to
it is aligned with “cái hộp” then it must be a Noun.
the Vietnamese dictionary.
However, not that all Vietnamese POS-tag
information is useful and deterministic. The big • A rule r defined as a couple (π, c) which consists
question here is when and how we make use of the of predicate π and tag c. Rule r is written in the
Vietnamese POS-tag information? Our answer is to form π ⇒ c. This means that the rule r = (π, c) will
have this English POS-tagger trained by TBL method be applied on the sample x if the predicate π is
(section 2) with the SUSANNE training corpus (section satisfied on it, whereat, x will be changed into a
3). After training, we will extract an ordered sequence new tag c.
of optimal transformation rules. We will use these rules • Giving a state s = (x,c) and rule r = (π, c), then the
to improve an existing English POS-tagger (as baseline result state r(s), which is gained by applying rule r
tagger) for tagging words of the English side in the on s, is defined as:
word-aligned EVC corpus. This English POS-tagging s if π(s)=False
result will be projected to Vietnamese side via word- r(s) = (x, c’) if π(s)=True
alignments in order to form a new Vietnamese training
corpus annotated with POS-tags.
• T : set of training samples, which were assigned 4.3 Experiment and Results of Bootstrapped
correct tag. Here we made use of the SUSANNE English POS-Tagger
golden corpus (Sampson, 1995) whose POS-tagset After the training period, this system will extract an
was converted into the PTB tagset. ordered sequence of optimal transformation rules under
• The score associated with a rule r = (π, c) is usually following format, for examples:
the difference in performance (on the training data) ((tag −1 = TO ) ∧ (tag 0 = NN )) ⇒ tag 0 ← VB
that results from applying the rule, as follows: ((Word0 ="can") ∧ (VTag0 = MD) ∧ (tag0 = VB)) ⇒ tag0 ← MD
Score(r ) = ∑ score(r (s )) − ∑ score( s) (( ∃ i ∈ [ − 3, − 1] | Tag i = MD ) ∧ (tag 0 = VPB )) ⇒ tag 0 ← VB
s∈T s∈T
These are intuitive rules and easy to understand by
1 if c = True(x) human beings. For examples: the 2nd rule will be
score((x,c)) = 0 if c ≠ True(x) understood as follows: “if the POS-tag of current word
is VB (Verb) and its word-form is “can” and its
corresponding Vietnamese word-tag is MD (Modal),
4.2 The TBL algorithm for POS-Tagging then the POS-tag of current word will be changed into
The TBL algorithm for POS-tagging can be briefly MD”.
described as follows (see the flowchart in figure 2): We have experimented this method on EVC corps
with the training SUSANNE corpus. To evaluate this
Step 1: Baseline tagging: To initiatize for each sample x
method, we held-back 6,000-word part of the training
in SUSANNE training data with its most likely POS-tag
corpus (which have not been used in the training
c. For English, we made use of the available English
period) and we achieved the POS-tagging results as
tagger (and parser) of Eugene Charniak (1997) at
follows:
Brown University (version 2001). For Vietnamese, it is
the set of possible parts-of-speech tags (follow the Step Correct Incorrect Precision
appearance probability order of that part-of-speech in tags Tags
dictionary). We call the starting training data as T0. Baseline tagging 5724 276 95.4%
Step 2: Considering all the transformations (rules) r to (Brown POS-tagger)
the training data Tk in time kth, choose the one with the TBL-POS-tagger 5850 150 97.5%
highest Score(r) and applying it to the training data to (bootstrapping by
obtain new corpus Tk+1. We have: Tk+1 = r(Tk) = { r(s) | corresponding
s∈Tk}. If there are no more possible transformation Vietnamese POS-tag)
rules which satisfies: Score(r) > β, the algorithm is Table 2. The result of Bootstrapped POS-
stopped. β is the threshold, which is preset and adjusted tagger for English side in EVC.
according to reality situations. It is thanks to exploiting the information of the
Step 3: k = k+1. corresponding Vietnamese POS that the English POS-
tagging results are improved. If we use only available
Step 4: Repeat from step 2.
English information, it is very difficult for us to
Step 5: Applying every rule r which is drawn in order improve the output of Brown POS-tagger. Despite the
for new corpus EVC after this corpus has been POS- POS-tagging improvement, the results can hardly said
tagged with baseline tags similar to those of the training to be fully satisfactory due to the following reasons:
period. * The result of automatic word-alignment is only
87% (Dinh Dien et al., 2002a).
* Convergence ability of the algorithm: call ek the * It is not always true that the use of Vietnamese
number of error (the difference between the tagging POS-information is effective enough to disambiguate
result in conformity with rule r and the correct tag in the POS of English words (please refer to table 3).
the golden corpus in time kth), we have: ek+1 = ek – Through the statistical table 3 below, the
Score(r), since Score(r) > 0, so ek+1 < ek with all k, and information of Vietnamese POS-tags can be seen as
ek∈N, so the algorithm will be converged after limited follows:
steps. - Case 1,2,3,4: no need for any disambiguation of
* Complexity of the algorithm: O(n*t*c) where n: size English POS-tags.
of training set (number of words); t: size of possible - Case 5, 7: Full disambiguation of English POS-tags
transformation rule set (number of candidate rules); c: (majority).
size of corpus satisfied rule applying condition (number - Case 6, 8, 9: Partial disambiguation of English
of order satisfied predicate π). POS-tags by TBL-method.
word-aligned bilingual
SUSANNE corpus
remove POS-tags
Vietnamese
Unannotated corpus corresponding POS-tags
Brown POS-tagger
(baseline tagger) Templates
current annotated candidate Corpus annotated by
corpus transformation rules candidate rules
Compare
Optimal Rules Y mark & Evaluate
>β
N
Sequence of
optimal rules End
Figure 2. Flowchart of TBL-algorithm in POS-tagger for EVC corpus
No. English POS-tags Vietnamese POS-tags Contrast English vs. Vietnamese Percent %
POS-tags
1. One POS-tag only One POS-tag only Two POS-tags are identical 25.2
2. One POS-tag only One POS-tag only Two POS-tags are different 1.2
3. One POS-tag only More than 1 POS-tag One common POS-tag only 5.3
4. One POS-tag only More than 1 POS-tag No common POS-tag 3.5
5. More than 1 POS-tag One POS-tag only One common POS-tag only 50.5
6. More than 1 POS-tag One POS-tag only No common POS-tag 2.8
7. More than 1 POS-tag More than 1 POS-tag One common POS-tag only 6.1
8. More than 1 POS-tag More than 1 POS-tag More than 1 common POS-tag 4.1
9. More than 1 POS-tag More than 1 POS-tag No common POS-tag 1.3
Table 3. Contrast POS-tag of English and Vietnamese in the word-aligned EVC
Regarding evaluation of POS-tag projections,
5 Projecting English POS-Tags to because so far, there has been no POS-annotated
Vietnamese corpus available for Vietnamese, we had to manually
build a small golden corpus for Vietnamese POS-
After having English-POS-tag annotations with high tagging with approximately 1000 words for
precision, we proceed to directly project those POS- evaluating. The results of Vietnamese POS-tagging
tag annotations from English side into Vietnamese is as table 5 below:
side. Our solution is motivated by a similar work of Method Correct Incorrect Precision
David Yarowsky and Grace Ngai (2001). This tags Tags
projection is based on available word-alignments in
Baseline tagging 823 177 82.3%
the automatically word-aligned English-Vietnamese
(use information
parallel corpus.
of POS-tag in
Nevertheless, due to typological difference
dictionary)
between English (an inflected typology) vs.
Projecting from 946 54 94.6%
Vietnamese (an isolated typology), direct projection
English side in
is not a simple 1-1 map but it may be a complex m-n
EVC
map:
Regarding grammatical meanings, English Table 5. The result of projecting POS-tags
usually makes use of inflectional facilities, such from English side to Vietnamese in EVC.
as suffixes to express grammatical meanings. For
example: -s →plural, -ed →past, - 6 Conclusion
ing→continuous, ‘s → possesive case, etc.
Whilst Vietnamese often makes use of function We have just presented the POS-tagging for an
words, word order facilities. For example: automatically word-aligned English-Vietnamese
“caùc”’ “nhöõng” → plural, “ñaõ” → past, “ñang” parallel corpus by POS-tagging English words first
→ continuous, “cuûa” → possessive cases, etc. and then projecting them to Vietnamese side later.
Regarding lexicalization, some words in English The English POS-tagging is done in 2 steps: The
must be represented by a phrase in Vietnamese basic tagging step is achieved through the available
and vice-versa. For example: “cow” and “ox” in POS-tagger (Brown) and the correction step is
achieved through the TBL learning method in which
English will be rephrased into two words “boø
the information on the corresponding Vietnamese is
caùi” (female one) and “boø ñöïc” (male one) in used through available word-alignment in the EVC.
Vietnamese; or “ngheù” in Vietnamese will be The result of POS-tagging of Vietnamese in the
rephrased into two words “buffalo calf” in English-Vietnamese bilingual corpus plays a
English. meaningful role in the building of the automatic
The result of projecting is as table 4 below. training corpus for the Vietnamese processors in need
In addition, tagsets of two languages are of parts of speech (such as Vietnamese POS-taggers,
different. Due characteristics of each language, we Vietnamese parser, etc.). By making use of the
must use two different tagset for POS-tagging. language typology’ s differences and the word-
Regarding English, we made use of available POS- alignments in bilingual corpus for the mutual
tagset of PennTreeBank. While in Vietnamese, we disambiguation, we are still able to improve the result
made use of POS-tagset in the standard Vietnamese of the English POS-tagging of the currently powerful
dictionary of Hoang Phe (1998) and other new tags. English POS-taggers.
So, we must have an English-Vietnamese consensus Currently, we are improving the speed of
tagset map (please refer to Appendix A). training period by using Fast TBL algorithm instead
of TBL one.
Eng- Jet planes fly about nine miles high
In the future, we will improve this serial POS-
lish tagging to the parallel POS-tagging for both English
E-tag NN NNS VBP IN CD NNS RB and Vietnamese simultaneously after we obtain the
VN- phaûn (caùc) bay khoaûng chín daëm cao exact Vietnamese POS-tags in the parallel corpus of
ese löïc phi cô SUSANNE.
V-tag N N V IN CD N R Acknowledgements
Table 4. An example of English POS- We would like to thank Prof. Eduard Hovy
tagging in parallel corpus EVC (ISI/USC, USA) for his guidance as external advisor
on this research.
References Conference on New methods in Language Processing,
Manchester, UK.
H. Schmid. 1994b. POS Tagging with Neural Networks,
E. Brill. 1993. A Corpus-based approach to Language
Proceedings of International Conference on
Learning, PhD-thesis, Pennsylvania Uni., USA.
Computational Linguistics, Kyoto, Japan, pp.172-176.
E. Brill. 1995. Transformation-Based Error-Driven
D. Yarowsky and G. Ngai. 2001. Induce, Multilingual POS
Learning and Natural Language Processing: A Case
Tagger and NP bracketer via projection on aligned
Study in Part of Speech Tagging. Computational
corpora, Proceedings of NAACL-01.
Linguistics, 21(4), pp. 543-565.
E. Brill. 1997. Unsupervised Learning of Disambiguation
Rules for Part of Speech Tagging. In Natural Appendix A. English-Vietnamese consensus
Language Processing Using Very Large Corpora. POS-tagset mapping table
Kluwer Academic Press.
J. Curran. 1999. Transformation-Based Learning in English POS Vietnamese
Shallow Natural Language Processing, Honours POS
Thesis, Basser Department of Computer Science, CC (Coordinating conjunction) CC
University of Sydney, Sydney, Australia. CD (Cardinal number) CD
E. Charniak. 1997. Statistical parsing with a context-free
grammar and word statistics, in Proceedings of the
DT (Determiner) DT
Fourteenth National Conference on Artificial EX (Existential) V
Intelligence, AAAI Press/MIT Press, Menlo Park. FW (Foreign word) FW
I. Dagan, I.Alon, and S.Ulrike. 1991. Two languages are IN (Preposition) IN
more informative than one. In Proceedings of the 29th JJ (Adjective) A
Annual ACL, Berkeley, CA, pp.130-137. JJR (Adjective, comparative) A
W. Daelemans, J. Zavrel, P. Berck, S. Gillis. 1996. MTB:
JJS (Adjective, superlative) A
A Memory-Based Part-of-Speech Tagger Generator.
In Proceedings of 4th Workshop on Very Large LS (List item marker) LS
Corpora, Copenhagen. MD (Modal) MD
D. Dien, H. Kiem, and N.V. Toan. 2001a. Vietnamese NN (Noun, singular or mass) N
Word Segmentation, Proceedings of NLPRS’01 (The NNS (Noun, plural) N
6th Natural Language Processing Pacific Rim NP (Proper noun, singular) N
Symposium), Tokyo, Japan, 11/2001, pp. 749-756.
NPS (Proper noun, plural) N
D. Dien. 2001b. Building an English-Vietnamese bilingual
corpus, Master thesis in Comparative Linguistics, PDT (Predeterminer) DT
University of Social Sciences and Humanity of HCM POS (Possessive ending) “cuûa”
City, Vietnam. PP (Personal pronoun) P
D. Dien, H.Kiem, T.Ngan, X.Quang, Q.Hung, P.Hoi, PP$ (Possessive pronoun) “cuûa” P
V.Toan. 2002a. Word alignment in English –
Vietnamese bilingual corpus, Proceedings of
RB (Adverb) R
EALPIIT’02, Hanoi, Vietnam, 1/2002, pp. 3-11. RBR (Adverb, comparative) R
D.Dien, H.Kiem. 2002b. Building a training corpus for RBS (Adverb, superlative) R
word sense disambiguation in the English-to- RP (Particle) RP
Vietnamese Machine Translation, Proceedings of SYM (Symbol) SYM
Workshop on Machine Translation in Asia, COLING- TO (''to'') -
02, Taiwan, 9/2002, pp. 26-32.
UH (Interjection) UH
R. Florian, and G. Ngai. 2001a. Transformation-Based
Learning in the fast lane, Proceedings of North VB (Verb, base form) V
America ACL-2001. VBD (Verb, past tense) V
R. Florian, and G. Ngai. 2001b. Fast Transformation-Based VBG (Verb, gerund or present V
Learning Toolkit. Technical Report. participle)
W. Gale, K.W.Church, and D. Yarowsky. 1992. Using VBN (Verb, past participle) V
bilingual materials to develop word sense VBP (Verb, non-3rd person V
disambiguation methods. In Proceedings of the Int.
Conf. on Theoretical and Methodological Issues in
singular present)
MT, pp.101-112. VBZ (Verb, 3rd person singular V
H. Phe. 1998. Töø ñieån tieáng Vieät (Vietnamese Dictionary). present)
Center of Lexicography. Da Nang Publisher. WDT (Whdeterminer) P
G. Sampson. 1995. English for the Computer: The WP (Wh-pronoun) P
SUSANNE Corpus and Analytic Scheme, Clarendon WP$ (Possessive wh-pronoun) “cuûa” P
Press (Oxford University Press).
WRB (Wh-adverb) R
H. Schmid. 1994a. Probabilistic POS Tagging using
Decision Trees, Proceedings of International