Word Segmentation for Vietnamese Text Categorization: An online corpus approach
Abstract—This paper extends a novel Vietnamese
segmentation approach for text categorization. Instead of using
annotated training corpus or lexicon which is still lack in
Vietnam, we use statistic information extracted directly from a
commercial search engine and genetic algorithm to find the most
reasonable way of segmentation. The extracted information is
document frequency of segmented words. We conduct many
thorough experiments to find out the most appropriate mutual
information formula in word segmentation step. Our experiment
results on segmentation and categorization obtained from online
news abstracts clearly show that our approach is very optimistic....