Hello World question.(about entropy, feature selection)

Hi, I am new to here and this is my first post. I have a classification
problem and I would like to use features to do it.

Suppose I have two datasets, positive and negative. I pre-select a
batch of features. Now I want to weight the features according to the
datasets. There are some methods to do this weighting. Two of them are
chi-square test and entropy. I use entropy to do it. So the smallest
entropy (=0) value says a feature is exclusively embedded in only one
dataset while the largest entropy (=1 for binary classification) means
the feature is equally embedded in two datasets. An example can be: f1
(+5, -0) has entropy as 0 because 5 positive instances contain it while
no negative instance does. f2 (+5, -5) has entropy as 1 on the other
hand.

While entropy can be useful I found that it is not a good weighting
function. Suppose I have two features f1(+5, -0) and f2 (+100, -0). The
entropy values for f1 and f2 are the same (=0). But suppose there are
totally 100 positive instances and 100 negative instances. Obviously f2
is better than f1 because it is not only unambiguous but very common in
positive set. Base on this observation I would like to give f2 a higher
weight than f1.

This requires the entropy and the support to work together. Does anyone
have any idea about this combination of a weighting or scoring
function?

Thank you very much.

Sticker

How is it that feature 'f1' could have only 5 positive and 0 negative
observations?  If there are 200 instances total, what value do they
have for 'f1'?  Is 'f1' missing?

Yes. I mean if there are totally 200 instances in + and 200 in -
(totally 400 instances). f1 is only observed in 5 + instances and 0 -
instance. For the 195 other + instances f1 is missing and 200 -
instances are missing f1 as well. f1 can be an arbitrary feature. For
instance, f1 is an symptom of a certain disease.
Hope that make things clearer.

Thank you

Sticker wrote:
> Yes. I mean if there are totally 200 instances in + and 200 in -
> (totally 400 instances). f1 is only observed in 5 + instances and 0 -
> instance. For the 195 other + instances f1 is missing and 200 -
> instances are missing f1 as well. f1 can be an arbitrary feature. For
> instance, f1 is an symptom of a certain disease.

The next step is to clarify the nature of the missing values.  If
missing values are correlated with the target variable, then perhaps it
is better to consider them as a third symbol and re-calculate entropy?

-Will Dwinnell

 

How to measure the correlationship between two instances supposing they
are sequences rather than relational data records? Sequences are
proteins let say, relational data records are database table rows with
columns and values.


i think this is similar to the similarity measure in web sessions. someone
has used the cosine function as the measurement, and i'm still trying to
figure out a way to measure the similarity of web sessions, how?

To Jackie,
I am not so sure about the definition of web sessions. Do you mean web
logs? I know how they measure two web pages' content using cosine
similarity function. They change the sequences of words into bags of
words (removing the duplicated words and may disorder them). This is
similar to measuring itemsets. Alternatively you can use Euclidean
distance or dot product as well.

The sequence similarity is solved by Smith-Waterman algorithm (locally)
and Needleman-Wunsch algorithm (globally).

My question is about entropies and supports of the features. I do not
know how to combine them together to be a better scoring functions

 

 

(阅读次数:


分享收藏到:  新浪ViVi 365Key网摘 Google书签 Windows Live Yahoo书签 添加到百度搜藏
上一篇:Cluster using predefined seeds value   下一篇:ClearForest Launches Semantic Web Service - $2,000 Mashup Contest
[本文源自互联网,版权归原作者,转摘为学习参考使用]

评论内容:(不能超过250字,需审核后才会公布,请自觉遵守互联网相关政策法规。
匿名评论
 
数据挖掘论坛导航
资讯点击排行帮
相关资讯
数据挖掘论坛资讯

关于我们  - 网站地图 - 联系方式 - 版权申明 - 友情链接 - 使用帮助
数据挖掘研究院(www.dmresearch.net)
增值电信业务经营许可证编号:皖B2-20040042 文网文:[2005]027号