logo

Word Segmentation for Vietnamese Text Categorization: An online corpus approach

Abstract—This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam, we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic....
DMCA.com Protection Status Copyright by webtailieu.net