Combining Coregularization and Consensus-based Self-Training for Multilingual Text Categorization
Massih-Reza Amini(1), Cyril Goutte (1), Nicolas Usunier (2)
(1) National Research Council Canada
(2) Laboratoire d'Informatique Paris 6
123, boulevard Alexandre Taché
104, avenue du président
Kennedy
Gatineau, Canada
75016 Paris
We investigate the problem of learning document classifiers in a
multilingual setting, from collections where labels are only
partially available. We address this problem in the framework of
multiview learning, where different languages correspond to
different views of the same document, combined with semi-supervised
learning in order to benefit from unlabeled documents. We rely on
two techniques, coregularization and consensus-based self-training,
that combine multiview and semi-supervised learning in different
ways. Our approach trains different monolingual classifiers on each
of the views, such that the classifiers decisions over a set of
unlabeled examples are in agreement as much as possible, and
iteratively labels new examples from another unlabeled training set
based on consensus across language-specific classifiers. We derive
a boosting-based training algorithm for this task, and analyze the
impact of the number of views on the semi-supervised learning
results on a multilingual extension of the Reuters RCV1/RCV2 corpus
using five different languages. Our experiments show that
coregularization and consensus-based self-training are complementary
and that their combination is especially effective in the
interesting and very common situation where there are few views
(languages) and few labeled documents available.