Classification mode
To do a straight classification run, use the settings:
parameter(
c DESCRIBE DATA
1 mdim=4682, nsample0=81, nclass=3, maxcat=1,
1 ntest=0, labelts=0, labeltr=1,
c
c SET RUN PARAMETERS
2 mtry0=150, ndsize=1, jbt=1000, look=100, lookcls=1,
2 jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c SET IMPORTANCE OPTIONS
3 imp=0, interact=0, impn=0, impfast=0,
c
c SET PROXIMITY COMPUTATIONS
4 nprox=0, nrnn=5,
c
c SET OPTIONS BASED ON PROXIMITIES
5 noutlier=0, nscale=0, nprot=0,
c
c REPLACE MISSING VALUES
6 code=-999, missfill=0, mfixrep=0,
c
c GRAPHICS
7 iviz=1,
c
c SAVING A FOREST
8 isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c RUNNING A SAVED FOREST
9 irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)
Note: since the sample size is small, for reliability 1000 trees are grown using mtry0=150. The results are not sensitive to mtry0 over the range 50-200. Since look=100, the oob results are output every 100 trees in terms of percentage misclassified
100 2.47
200 2.47
300 2.47
400 2.47
500 1.23
600 1.23
700 1.23
800 1.23
900 1.23
1000 1.23
(note: an error rate of 1.23% implies 1 of the 81 cases was misclassified,)
Variable importance
The variable importances are critical. The run computing importances is done by switching imp =0 to imp =1 in the above parameter list. The output has four columns: 数据挖掘研究院
gene number the raw importance score the z-score obtained by dividing the raw score by its standard error the significance level.
The highest 25 gene importances are listed sorted by their z-scores. To get the output on a disk file, put impout =1, and give a name to the corresponding output file. If impout is put equal to 2 the results are written to screen and you will see a display similar to that immediately below:
gene raw z-score significance number score 667 1.414 1.069 0.143 689 1.259 0.961 0.168 666 1.112 0.903 0.183 668 1.031 0.849 0.198 682 0.820 0.803 0.211 878 0.649 0.736 0.231 1080 0.514 0.729 0.233 1104 0.514 0.718 0.237 879 0.591 0.713 0.238 895 0.519 0.685 0.247 3621 0.552 0.684 0.247 3529 0.650 0.683 0.247 3404 0.453 0.661 0.254 623 0.286 0.655 0.256 3617 0.498 0.654 0.257 650 0.505 0.650 0.258 645 0.380 0.644 0.260 3616 0.497 0.636 0.262 938 0.421 0.635 0.263 915 0.426 0.631 0.264 669 0.484 0.626 0.266 663 0.550 0.625 0.266 723 0.334 0.610 0.271 685 0.405 0.605 0.272 3631 0.402 0.603 0.273 数据挖掘研究院
Using important variables
Another useful option is to do an automatic rerun using only those variables that were most important in the original run. Say we want to use only the 15 most important variables found in the first run in the second run. Then in the options change mdim2nd=0 to mdim2nd=15 , keep imp=1 and compile. Directing output to screen, you will see the same output as above for the first run plus the following output for the second run. Then the importances are output for the 15 variables used in the 2nd run.
gene raw z-score significance
number score
3621 6.235 2.753 0.003
1104 6.059 2.709 0.003
3529 5.671 2.568 0.005
666 7.837 2.389 0.008
3631 4.657 2.363 0.009
667 7.005 2.275 0.011
668 6.828 2.255 0.012
689 6.637 2.182 0.015
878 4.733 2.169 0.015
682 4.305 1.817 0.035
644 2.710 1.563 0.059
879 1.750 1.283 0.100
686 1.937 1.261 0.104
1080 0.927 0.906 0.183
623 0.564 0.847 0.199
数据挖掘研究院
Variable interactions
Another option is looking at interactions between variables. If variable m1 is correlated with variable m2 then a split on m1 will decrease the probability of a nearby split on m2 . The distance between splits on any two variables is compared with their theoretical difference if the variables were independent. The latter is subtracted from the former-a large resulting value is an indication of a repulsive interaction. To get this output, change interact =0 to interact=1 leaving imp =1 and mdim2nd =10. 数据挖掘研究院
The output consists of a code list: telling us the numbers of the genes corresponding to id. 1-10. The interactions are rounded to the closest integer and given in the matrix following two column list that tells which gene number is number 1 in the table, etc.
1 2 3 4 5 6 7 8 9 10
1 0 13 2 4 8 -7 3 -1 -7 -2
2 13 0 11 14 11 6 3 -1 6 1
3 2 11 0 6 7 -4 3 1 1 -2
4 4 14 6 0 11 -2 1 -2 2 -4
5 8 11 7 11 0 -1 3 1 -8 1
6 -7 6 -4 -2 -1 0 7 6 -6 -1
7 3 3 3 1 3 7 0 24 -1 -1
8 -1 -1 1 -2 1 6 24 0 -2 -3
9 -7 6 1 2 -8 -6 -1 -2 0 -5
10 -2 1 -2 -4 1 -1 -1 -3 -5 0
数据挖掘研究院
There are large interactions between gene 2 and genes 1,3,4,5 and between 7 and 8. 数据挖掘研究院

