RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

Random Forests : A case study-microarray data[1]

来源: 作者:unkonwn 时间:2004-12-09 点击:
 

  数据挖掘研究院

Classification mode

To do a straight classification run, use the settings:

        parameter(
c               DESCRIBE DATA
     1          mdim=4682, nsample0=81, nclass=3, maxcat=1,
     1          ntest=0, labelts=0, labeltr=1,
c
c               SET RUN PARAMETERS
     2          mtry0=150, ndsize=1, jbt=1000, look=100, lookcls=1,
     2          jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c               SET IMPORTANCE OPTIONS
     3          imp=0, interact=0, impn=0, impfast=0,
c
c               SET PROXIMITY COMPUTATIONS
     4          nprox=0, nrnn=5,
c
c               SET OPTIONS BASED ON PROXIMITIES
     5          noutlier=0, nscale=0, nprot=0,
c
c               REPLACE MISSING VALUES  
     6          code=-999, missfill=0,  mfixrep=0, 
c
c               GRAPHICS
     7          iviz=1,
c
c               SAVING A FOREST
     8          isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c               RUNNING A SAVED FOREST
     9          irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)

	  
Note: since the sample size is small, for reliability 1000 trees are grown using mtry0=150. The results are not sensitive to mtry0 over the range 50-200. Since look=100, the oob results are output every 100 trees in terms of percentage misclassified
    100 2.47 
    200 2.47 
    300 2.47 
    400 2.47 
    500 1.23 
    600 1.23 
    700 1.23 
    800 1.23 
    900 1.23 
    1000 1.23 
	 

数据挖掘研究院

(note: an error rate of 1.23% implies 1 of the 81 cases was misclassified,)

Variable importance

The variable importances are critical. The run computing importances is done by switching imp =0 to imp =1 in the above parameter list. The output has four columns: 数据挖掘研究院

	gene number 
	the raw importance score 
	the z-score obtained by dividing the raw score by its standard error 
	the significance level. 

数据挖掘研究院

The highest 25 gene importances are listed sorted by their z-scores. To get the output on a disk file, put impout =1, and give a name to the corresponding output file. If impout is put equal to 2 the results are written to screen and you will see a display similar to that immediately below:

gene       raw     z-score  significance
number    score
  667     1.414     1.069     0.143
  689     1.259     0.961     0.168
  666     1.112     0.903     0.183
  668     1.031     0.849     0.198
  682     0.820     0.803     0.211
  878     0.649     0.736     0.231
 1080     0.514     0.729     0.233
 1104     0.514     0.718     0.237
  879     0.591     0.713     0.238
  895     0.519     0.685     0.247
 3621     0.552     0.684     0.247
 3529     0.650     0.683     0.247
 3404     0.453     0.661     0.254
  623     0.286     0.655     0.256
 3617     0.498     0.654     0.257
  650     0.505     0.650     0.258
  645     0.380     0.644     0.260
 3616     0.497     0.636     0.262
  938     0.421     0.635     0.263
  915     0.426     0.631     0.264
  669     0.484     0.626     0.266
  663     0.550     0.625     0.266
  723     0.334     0.610     0.271
  685     0.405     0.605     0.272
 3631     0.402     0.603     0.273
 数据挖掘研究院 

Using important variables

Another useful option is to do an automatic rerun using only those variables that were most important in the original run. Say we want to use only the 15 most important variables found in the first run in the second run. Then in the options change mdim2nd=0 to mdim2nd=15 , keep imp=1 and compile. Directing output to screen, you will see the same output as above for the first run plus the following output for the second run. Then the importances are output for the 15 variables used in the 2nd run.

    gene         raw       z-score    significance
   number       score
    3621 		6.235 		2.753 		0.003 
    1104 		6.059 		2.709 		0.003 
    3529 		5.671 		2.568 		0.005 
     666 		7.837 		2.389 		0.008 
    3631 		4.657 		2.363 		0.009 
     667 		7.005 		2.275 		0.011 
     668 		6.828 		2.255 		0.012 
     689 		6.637 		2.182 		0.015 
     878 		4.733 		2.169 		0.015 
     682 		4.305 		1.817 		0.035 
     644 		2.710 		1.563 		0.059 
     879 		1.750 		1.283 		0.100 
     686 		1.937 		1.261 		0.104 
    1080 		0.927 		0.906 		0.183 
     623 		0.564 		0.847 		0.199 
	 数据挖掘研究院 

Variable interactions

Another option is looking at interactions between variables. If variable m1 is correlated with variable m2 then a split on m1 will decrease the probability of a nearby split on m2 . The distance between splits on any two variables is compared with their theoretical difference if the variables were independent. The latter is subtracted from the former-a large resulting value is an indication of a repulsive interaction. To get this output, change interact =0 to interact=1 leaving imp =1 and mdim2nd =10. 数据挖掘研究院

The output consists of a code list: telling us the numbers of the genes corresponding to id. 1-10. The interactions are rounded to the closest integer and given in the matrix following two column list that tells which gene number is number 1 in the table, etc.

		
     1   2   3   4   5   6   7   8   9  10 
 1   0  13   2   4   8  -7   3  -1  -7  -2 
 2  13   0  11  14  11   6   3  -1   6   1 
 3   2  11   0   6   7  -4   3   1   1  -2 
 4   4  14   6   0  11  -2   1  -2   2  -4 
 5   8  11   7  11   0  -1   3   1  -8   1 
 6  -7   6  -4  -2  -1   0   7   6  -6  -1 
 7   3   3   3   1   3   7   0  24  -1  -1 
 8  -1  -1   1  -2   1   6  24   0  -2  -3 
 9  -7   6   1   2  -8  -6  -1  -2   0  -5 
10  -2   1  -2  -4   1  -1  -1  -3  -5   0 
 数据挖掘研究院 

There are large interactions between gene 2 and genes 1,3,4,5 and between 7 and 8. 数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?