Hi, I am new to here and this is my first post. I have a classification
problem and I would like to use features to do it.
Suppose I have two datasets, positive and negative. I pre-select a
batch of features. Now I want to weight the features according to the
datasets. There are some methods to do this weighting. Two of them are
chi-square test and entropy. I use entropy to do it. So the smallest
entropy (=0) value says a feature is exclusively embedded in only one
dataset while the largest entropy (=1 for binary classification) means
the feature is equally embedded in two datasets. An example can be: f1
(+5, -0) has entropy as 0 because 5 positive instances contain it while
no negative instance does. f2 (+5, -5) has entropy as 1 on the other
hand.
While entropy can be useful I found that it is not a good weighting
function. Suppose I have two features f1(+5, -0) and f2 (+100, -0). The
entropy values for f1 and f2 are the same (=0). But suppose there are
totally 100 positive instances and 100 negative instances. Obviously f2
is better than f1 because it is not only unambiguous but very common in
positive set. Base on this observation I would like to give f2 a higher
weight than f1.
This requires the entropy and the support to work together. Does anyone
have any idea about this combination of a weighting or scoring
function?
Thank you very much.
Sticker
How is it that feature 'f1' could have only 5 positive and 0 negative
observations? If there are 200 instances total, what value do they
have for 'f1'? Is 'f1' missing?
Yes. I mean if there are totally 200 instances in + and 200 in -
(totally 400 instances). f1 is only observed in 5 + instances and 0 -
instance. For the 195 other + instances f1 is missing and 200 -
instances are missing f1 as well. f1 can be an arbitrary feature. For
instance, f1 is an symptom of a certain disease.
Hope that make things clearer.
Thank you
Sticker wrote:
> Yes. I mean if there are totally 200 instances in + and 200 in -
> (totally 400 instances). f1 is only observed in 5 + instances and 0 -
> instance. For the 195 other + instances f1 is missing and 200 -
> instances are missing f1 as well. f1 can be an arbitrary feature. For
> instance, f1 is an symptom of a certain disease.
The next step is to clarify the nature of the missing values. If
missing values are correlated with the target variable, then perhaps it
is better to consider them as a third symbol and re-calculate entropy?
-Will Dwinnell
How to measure the correlationship between two instances supposing they
are sequences rather than relational data records? Sequences are
proteins let say, relational data records are database table rows with
columns and values.
i think this is similar to the similarity measure in web sessions. someone
has used the cosine function as the measurement, and i'm still trying to
figure out a way to measure the similarity of web sessions, how?
To Jackie,
I am not so sure about the definition of web sessions. Do you mean web
logs? I know how they measure two web pages' content using cosine
similarity function. They change the sequences of words into bags of
words (removing the duplicated words and may disorder them). This is
similar to measuring itemsets. Alternatively you can use Euclidean
distance or dot product as well.
The sequence similarity is solved by Smith-Waterman algorithm (locally)
and Needleman-Wunsch algorithm (globally).
My question is about entropies and supports of the features. I do not
know how to combine them together to be a better scoring functions