|
Description
This program is an implementation of the RankBoost algorithm [Freund et al., 2003] trained with partially labeled data for bipartite ranking (Semi-supervised RankBoost) described in [Amini et al. 2008]. The proposed algorithm is a new inductive ranking algorithm which builds a prediction function on the basis of two labeled and unlabeled training sets: one labeled and one unlabeled. In a first stage, the algorithm loops over the labeled set and assigns, for each labeled training example, the same relevance judgment to the most similar examples from the unlabeled training set. An extended version of the RankBoost algorithm is then developed to produce a scoring function that minimizes the average number of incorrectly ordered pairs of (relevant, irrelevant) examples, over the labeled training set and the tentatively labeled part of the unlabeled data.
Download and Installation
The program is free for scientific use only. If you publish results based on this program, please acknowledge its use, by referring to:
M.-R. Amini, V. Truong, C. Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008. [ PDF]
This program was developed on Linux with g++ and the source code is available from:
http://ama.liglab.fr/~amini/SSRankBoost/semisup_rankboost.tar.bz2
After downloading the file, and unpackting it:
> bzip2 -cd semisup_rankboost.tar.bz2 | tar xvf -
you need to compile the program in the new directory semisup_rankboost/
> make
After compilation, two executables are created:
- ssrankboost-learn (for training the model)
- ssrankboost-test (for testing it)
Training and testing
Both train and test modules operate on feature:value representation of examples:
- Rel feature:value feature:value
where Rel (in 0|-1|1) is the relevance judgement of an example. 0 if this judgement is unkown (for unlabeled examples), -1 (or 1) if the example is judged irrelevant (resp. relevant) to the given topic. In semisup_rankboost/ there are two training_set and test_set files in the directory example/ which are given for test purposes. training_set 10,000 partially labeled examples (10 labeled and 9990 unlabeled) and test_set contains 8,000 test examples. (These datasets are built over a subset of the RCV1 Reuters collection).
Train the model:
> ssrankboost-learn [options] input_file parameter_file
Options are:
-l (float) | The discount factor regularizing the impact of the objective term over unlabeled examples in the learning process. If -l is not specified, the program corresponds to the supervised RankBoost algorithm for bipartite ranking [Freund et al. 2003],
|
-k (integer) | The number of unlabeled examples which are given the same relevance judgment than their most nearest labeled neighbor (default 3),
|
-n (integer) | The number of candidate thresholds over features - stumps (default 10),
|
-t (integer) | The number of boosting iterations (default 50),
|
-? | Help
|
Test the model:
> ssrankboost-test input_file parameter_file
Example
Running the program in the full supervised case [Freund et al. 2003] on a part of the Reuters collection provided in the directory example by using the default parameters (i.e. 20 candidate thresholds and 50 boosting iterations) gives
> ssrankboost-learn example/training_set Param-LabeledCase
Scanning ...
Vocabulary size: 21531, labeled examples: 10
Training ...
1 --> Rt=0.75000000 alpha= 0.972955075 i*= 28 theta*=1.00000000
2 --> Rt=0.34449112 alpha= 0.767976553 i*=1242 theta*=1.00000000
3 --> Rt=0.23282148 alpha= 0.806175652 i*= 28 theta*=1.00000000
4 --> Rt=0.15818393 alpha= 0.905899061 i*= 53 theta*=1.00000000
5 --> Rt=0.08907593 alpha= 0.884357773 i*= 657 theta*=2.80000000
6 --> Rt=0.05405791 alpha= 0.942351016 i*= 63 theta*=1.00000000
7 --> Rt=0.02698971 alpha= 0.806420658 i*= 3 theta*=1.00000000
8 --> Rt=0.01684374 alpha= 0.794356020 i*= 28 theta*=1.00000000
9 --> Rt=0.01171223 alpha= 0.908774891 i*= 657 theta*=2.80000000
10 --> Rt=0.00648682 alpha= 0.867918291 i*= 53 theta*=1.00000000
11 --> Rt=0.00403677 alpha= 0.937479989 i*= 28 theta*=1.00000000
12 --> Rt=0.00206520 alpha= 0.826522449 i*= 3 theta*=1.00000000
13 --> Rt=0.00124185 alpha= 0.792700753 i*= 28 theta*=1.00000000
14 --> Rt=0.00086342 alpha= 0.904145093 i*= 53 theta*=1.00000000
15 --> Rt=0.00047831 alpha= 0.857645425 i*= 657 theta*=2.80000000
16 --> Rt=0.00030285 alpha= 0.936779973 i*= 28 theta*=1.00000000
17 --> Rt=0.00015668 alpha= 0.839317614 i*= 3 theta*=1.00000000
18 --> Rt=0.00009211 alpha= 0.792598917 i*= 28 theta*=1.00000000
19 --> Rt=0.00006397 alpha= 0.902250267 i*= 657 theta*=2.80000000
20 --> Rt=0.00003531 alpha= 0.848599909 i*= 53 theta*=1.00000000
21 --> Rt=0.00002271 alpha= 0.936674053 i*= 28 theta*=1.00000000
22 --> Rt=0.00001183 alpha= 0.847885517 i*= 3 theta*=1.00000000
23 --> Rt=0.00000685 alpha= 0.792655228 i*= 28 theta*=1.00000000
24 --> Rt=0.00000475 alpha= 0.899772802 i*= 53 theta*=1.00000000
25 --> Rt=0.00000262 alpha= 0.843674267 i*= 657 theta*=2.80000000
26 --> Rt=0.00000170 alpha= 0.936658854 i*= 28 theta*=1.00000000
27 --> Rt=0.00000089 alpha= 0.853603705 i*=1242 theta*=1.00000000
28 --> Rt=0.00000051 alpha= 0.792703244 i*= 28 theta*=1.00000000
29 --> Rt=0.00000035 alpha= 0.898799536 i*= 657 theta*=2.80000000
30 --> Rt=0.00000019 alpha= 0.839749236 i*= 53 theta*=1.00000000
31 --> Rt=0.00000013 alpha= 0.936636595 i*= 63 theta*=1.00000000
32 --> Rt=0.00000007 alpha= 0.857409980 i*= 3 theta*=1.00000000
33 --> Rt=0.00000004 alpha= 0.792724235 i*= 28 theta*=1.00000000
34 --> Rt=0.00000003 alpha= 0.897733607 i*= 53 theta*=1.00000000
35 --> Rt=0.00000001 alpha= 0.837604335 i*= 657 theta*=2.80000000
36 --> Rt=0.00000001 alpha= 0.936622647 i*= 28 theta*=1.00000000
37 --> Rt=0.00000001 alpha= 0.859915534 i*= 3 theta*=1.00000000
38 --> Rt=0.00000000 alpha= 0.792735846 i*= 28 theta*=1.00000000
39 --> Rt=0.00000000 alpha= 0.897300106 i*= 657 theta*=2.80000000
40 --> Rt=0.00000000 alpha= 0.835938833 i*= 53 theta*=1.00000000
41 --> Rt=0.00000000 alpha= 0.936606921 i*= 28 theta*=1.00000000
42 --> Rt=0.00000000 alpha= 0.861564152 i*=1242 theta*=1.00000000
43 --> Rt=0.00000000 alpha= 0.792739615 i*= 28 theta*=1.00000000
44 --> Rt=0.00000000 alpha= 0.896850614 i*= 53 theta*=1.00000000
45 --> Rt=0.00000000 alpha= 0.835020098 i*= 657 theta*=2.80000000
46 --> Rt=0.00000000 alpha= 0.936597700 i*= 28 theta*=1.00000000
47 --> Rt=0.00000000 alpha= 0.862640647 i*=1242 theta*=1.00000000
48 --> Rt=0.00000000 alpha= 0.792742061 i*= 63 theta*=1.00000000
49 --> Rt=0.00000000 alpha= 0.896662342 i*= 657 theta*=2.80000000
50 --> Rt=0.00000000 alpha= 0.834316604 i*= 53 theta*=1.00000000
Where, Rt corresponds to the Zt in ([Freund et al., 2003] eq. 4), i* and theta* are the chosen feature variable and the corresponding threshold which determine the base ranking function at a given iteration ([Freund et al., 2003] eq. 9) and alpha correspond to the boosting weights ([Freund et al., 2003] eq. 6).
The testing of the previous model gives
> ssrankboost-test example/test_set Param-LabeledCase
T9P=0.400000 MAP@500=0.037637 REC@500=0.086915 AUC=0.547948
Where, T9P is the precision at 50.
Now using the additional unlabeled examples in the training process with a discount factor 0.1 (keeping the same default number of iterations, stumps as before and giving the same relevance judgement to 3 nearest unlabeled neighbors than each labeled example)
> ssrankboost-learn -l 0.1 example/training_set Param-SSupCase
Scanning ...
Vocabulary size: 21531, labeled examples: 10, unlabeled examples: 9990, number of unlabeled examples labeled by the KNN algorithm: 30
Training ...
And a similar output than the previous supervised leanring case displays with a difference that Rt now corresponds to the semi-supervised ranking loss bounds ([Amini et al. 2008] eq. 7) and the semi-supervised boosting weights are obtained as in ([Amini et al. 2008] eq. 6)
The testing of the previous semi-supervised model gives
> ssrankboost-test example/test_set Param-SSupCase
T9P=0.820000 MAP@500=0.143034 REC@500=0.177173 AUC=0.754527
Disclaimer
This program is publicly available for research use only. It should not be distributed for commercial use and the author is not responsible for any (mis)use of this algorithm.
Acknowledgements
The author is thankful to Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it as the examples used in this release come from part of RCV1 collection.
Bibliography
[Amini et al., 2008] Massih-Reza Amini, Vinh Truong , Cyril Goutte. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st International ACM SIGIR (SIGIR 2008), 2008.
[Freud et al., 2003] Yoav Freund, Raj Iyer, Robert E. Schapire, Yoram Singer. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research (JMLR 2003), 2003
|
|