RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

序列分析

来源: 作者:unkonwn 时间:2004-12-11 点击:

The input data is a set of sequences, called data-sequences. Each data sequence is a ordered list of transactions(or itemsets), where each transaction is a sets of items (literals). Typically there is a transaction-time associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a user-specified minimum support, where the support of a sequential pattern is the percentage of data sequences that contain the pattern.

数据挖掘研究院

An example of such a pattern is that customers typically rent ``Star Wars′′, then ``Empire Strikes Back′′, and then ``Return of the Jedi′′. Note that these rentals need not be consecutive. Customers who rent some other videos in between also support this sequential pattern. Elements of a sequential pattern need not be simple items. ``Fitted Sheet and flat sheet and pillow cases′′, followed by ``comforter′′, followed by ``drapes and ruffles′′ is an example of a sequential pattern in which the elements are sets of items. This problem was initially motivated by applications in the retailing industry, including attached mailing, add-on sales, and customer satisfaction. But the results apply to many scientific and business domains. For instance, in the medical domain, a data-sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered using this data could be used in disease research to help identify symptoms/diseases that precede certain diseases.


 
数据挖掘研究院

Algorithms for Finding Sequential Patterns 数据挖掘研究院

Various groups working in this field have suggested algorithms for mining sequential patterns. Listed below are two algorithms proposed by IBM′s Quest data team.[6]

Terminology : The length of a sequence is the number of itemsets in the sequence. A sequence of length k is called a k-sequence. The sequence formed by the concatenation of two sequences x and y is denoted as x.y. The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction. Thus the itemset i and the 1-sequence i have the same support. An itemset with minimum support is called a large itemset or litemset. Note that each itemset in a large sequence must have minimum support. Hence, any large sequence must be a list of litemsets. In the algorithms, Lk denotes the set of all large k-sequences, and Ck the set of candidate k-sequences. 数据挖掘研究院

  • Algorithm
The problem of mining sequential patterns can be split into the following phases:
  • Sort Phase. This step implicitly converts the original transaction database into a database of sequences.
  • Litemset Phase. In this phase we find the set of all litemsets L. We are also simultaneously finding the set of all large 1-sequences, since this set is just l | l L.
  • Transformation Phase. We need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. To make this test fast, we transform each customer sequence into an alternative representation. In a transformed customer sequence, each transaction is replaced by the set of all litemsets contained in that transaction. If a transaction does not contain any litemset, it is not retained in the transformed sequence. If a customer sequence does not contain any litemset, this sequence is dropped from the transformed database. However, it still contributes to the count of total number of customers. A customer sequence is now represented by a list of sets of litemsets.
  • Sequence Phase. Use the set of litemsets to find the desired sequences. Algorithms for this phase below.
  • Maximal Phase. Find the maximal sequences among the set of large sequences. In some algorithms this phase is combined with the sequence phase to reduce the time wasted in counting non maximal sequences.
The general structure of the algorithms for the sequence phase is that they make multiple passes over the data. In each pass, we start with a seed set of large sequences. We use the seed set for generating new potentially large sequences, called candidate sequences. We find the support for these candidate sequences during the pass over the data. At the end of the pass, we determine which of the candidate sequences are actually large. These large candidates become the seed for the next pass. In the first pass, all 1-sequences with minimum support, obtained in the litemset phase, form the seed set.

There are two families of algorithms- count-all and count-some. The count-all algorithms count all the large sequences, including non-maximal sequences. The non-maximal sequences must then be pruned out (in the maximal phase). AprioriAll listed below is a count-all algorithm, based on the Apriori algorithm for finding large itemsets presented in chapter2. Apriori- Some is a count-some algorithm. The intuition behind these algorithms is that since we are only interested in maximal sequences, we can avoid counting sequences which are contained in a longer sequence if we first count longer sequences. However, we have to be careful not to count a lot of longer sequences that do not have minimum support. Otherwise, the time saved by not counting sequences contained in a longer sequence may be less than the time wasted counting sequences without minimum support that would never have been counted because their subsequences were not large.


 
数据挖掘研究院

  • Algorithm AprioriAll

L1 = large 1-sequences; // Result of litemset phase
for ( k = 2; Lk-1 0; k++) do
begin
数据挖掘研究院

    Ck = New Candidates generated from Lk-1 (see below)
    foreach customer-sequence c in the database do
        Increment the count of all candidates in Ck that are contained in c.
    Lk = Candidates in Ck with minimum support.
end
Answer = Maximal Sequences in k Lk ;
AprioriAll Algorithm

The algorithm is given above. In each pass, we use the large sequences from the previous pass to generate the candidate sequences and then measure their support by making a pass over the database. At the end of the pass, the support of the candidates is used to determine the large sequences. In the first pass, the output of the litemset phase is used to initialize the set of large 1-sequences. The candidates are stored in hash-tree to quickly find all candidates contained in a customer sequence.

Apriori Candidate Generation

The apriori-generate function takes as argument Lk-1, the set of all large (k-1)-sequences. It works as follows. First join Lk-1 with Lk-1

数据挖掘研究院

insert into Ck 数据挖掘研究院

select p.litemset1 , ..., p.litemsetk-1 , q.litemsetk-1

数据挖掘实验室

from Lk-1 p, Lk-1 q

where p.litemset1 = q.litemset1 , . . ., 数据挖掘研究院

p.litemsetk-2 = q.litemsetk-2 ;

Next delete all sequences c Ck such that some (k-1)-subsequence of c is not in Lk-1 数据挖掘实验室

Example 数据挖掘研究院

Consider a database with the customer-sequences shown below in Fig4.1. We have not shown the original database in this example. The customer sequences are in transformed form where each transaction has been replaced by the set of litemsets contained in the transaction and the litemsets have been replaced by integers. The minimum support has been specified to be 40% (i.e. 2 customer sequences). The first pass over the database is made in the litemset phase, and we determine the large 1-sequences shown in Fig. 4.2. The large sequences together with their support at the end of the second, third, and fourth passes are also shown in the same figure. No candidate is generated for the fifth pass. The maximal large sequences would be the three sequences 1 2 3 4 , 1 3 5 and 4 5 .

1 5 2 3 4 数据挖掘研究院

1 3 4 3 5 数据挖掘研究院

1} 2} 3} 4}

数据挖掘研究院

1} 3} 5} 数据挖掘研究院

4} 5}

数据挖掘研究院

Fig4.1: Customer Sequences
 
 

  • Algorithm AprioriSome
This algorithm is given below. In the forward pass, we only count sequences of certain lengths. For example, we might count sequences of length 1, 2, 4 and 6 in the forward phase and count sequences of length 3 and 5 in the backward phase. The function next takes as parameter the length of sequences counted in the last pass and returns the length of sequences to be counted in the next pass. Thus, this function determines exactly which sequences are counted, and balances the tradeoff between the time wasted in counting non-maximal sequences versus counting extensions of small candidate sequences. One extreme is next(k) = k + 1 (k is the length for which candidates were counted last), when all non-maximal sequences are counted, but no extensions of small candidate sequences are counted. In this case, AprioriSome degenerates into AprioriAll. The other extreme is a function like next(k) = 100 * k, when almost no non-maximal large sequence is counted, but lots of ex- tensions of small candidates are counted. 数据挖掘研究院
 

// Forward Phase 数据挖掘实验室

L1 = large 1-sequences; // Result of litemset phase 数据挖掘研究院

C1 = L1 ; // so that we have a nice loop condition 数据挖掘研究院

last = 1; // we last counted Clast 数据挖掘研究院

for ( k = 2; Ck-1 0 and Llast 0; k++) do

数据挖掘研究院

begin

数据挖掘研究院

if (Lk-1 known) then 数据挖掘研究院

Ck= New candidates generated from Lk-1 ; 数据挖掘研究院

else

数据挖掘研究院

Ck= New candidates generated from Ck-1 ; 数据挖掘研究院

if (k == next(last) ) then begin 数据挖掘实验室

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck that are contained in c. 数据挖掘实验室

Lk = Candidates in Ck with minimum support.

last = k;

数据挖掘研究院

end 数据挖掘研究院

end 数据挖掘研究院

// Backward Phase

数据挖掘研究院

for ( k-- ; k >=1; k==) do

数据挖掘研究院

if (Lk not found in forward phase) then begin 数据挖掘研究院

Delete all sequences in Ck contained in some Li , i>k; 数据挖掘研究院

foreach customer-sequence in DT do 数据挖掘研究院

Increment the count of all candidates in Ck that are contained in c. 数据挖掘实验室

Lk = Candidates in Ck with minimum support. 数据挖掘实验室

end

数据挖掘实验室

else // Lk already known

Delete all sequences in Lk contained in some Li , i > k. 数据挖掘实验室

Answer = k Lk ;

AprioriSome Algorithm

Let hitk denote the ratio of the number of large k-sequences to the number of candidate k-sequences (i.e., |Lk | / |Ck|). The next function we used in the experiments is given below. The intuition behind the heuristic is that as the percentage of candidates counted in the current pass which had minimum support increases, the time wasted by counting extensions of small candidates when we skip a length goes down. 数据挖掘研究院

function next(k: integer) 数据挖掘研究院

begin 数据挖掘实验室

if (hitk < 0.666) return k + 1;

数据挖掘研究院

elsif (hitk <0.75) return k + 2;

elsif (hitk < 0.80) return k + 3;

数据挖掘研究院

elsif (hitk < 0.85) return k + 4;

数据挖掘研究院

else return k + 5;

end

数据挖掘研究院

We use the apriori-generate function given above to generate new candidate sequences. However, in the kth pass, we may not have the large sequence set Lk-1 available as we did not count the (k-1)-candidate sequences. In that case, we use the candidate set Ck-1 to generate Ck. Correctness is maintained because Ck-1 Lk-1

In the backward phase, we count sequences for the lengths we skipped over during the forward phase, after first deleting all sequences contained in some large sequence. These smaller sequences cannot be in the answer because we are only interested in maximal sequences. We also delete the large sequences found in the forward phase that are non-maximal.

数据挖掘研究院

Example 数据挖掘研究院

Using again the database used in the example for the AprioriAll algorithm, we find the large 1-sequences (L1) shown in Fig. 4.2 below in the litemset phase (during the first pass over the database). Take for illustration simplicity, f(k) = 2k. In the second pass, we count C2 to get L2 (Fig. 4.2). After the third pass, apriori-generate is called with L2 as argument to get C3 . We do not count C3 , and hence do not generate L3 . Next, apriori-generate is called with C3 to get C4. After counting C4 to get L4 (Fig. 4.2), we try generating C5, which turns out to be empty. 数据挖掘研究院

We then start the backward phase. Nothing gets deleted from L4 since there are no longer sequences. We had skipped counting the support for sequences in C3 in the forward phase. After deleting those sequences in C3 that are subsequences of sequences in L4, i.e., subsequences of 1 2 3 4 , we are left with the sequences 1 3 5 and 3 4 5. These would be counted to get 1 3 5 as a maximal large 3-sequence. Next, all the sequences in L2 except 4 5 are deleted since they are contained in some longer sequence. For the same reason, all sequences in L1 are also deleted.

数据挖掘研究院

L 1

1-Sequences Support
1 4
2 2
3 4
4 4
5 4
  数据挖掘研究院

L 2

2-Sequences Support
1 2 2
1 3 4
1 4 3
1 5 3
2 3 2
2 4 2
3 4 3
3 5 2
4 5 2
 

数据挖掘实验室

L 3

3-Sequences Support
1 2 3 2
1 2 4 2
1 3 4 3
1 3 5 2
2 3 4 2
 

数据挖掘研究院

L 4

4-Sequences Support
1 2 3 4 2
 
 
 
数据挖掘研究院

Fig 4.3 : Large Sequences
 
  • Relative Performance of the two Algorithms
As expected, the execution times of all the algorithms increase as the support is decreased because of a large increase in the number of large sequences in the result. The apriori-generate does not count any candidate sequence that contains any subsequence which is not large. The major advantage of AprioriSome over AprioriAll is that it avoids counting many non-maximal sequences. However, this advantage is reduced because of two reasons. First, candidates Ck in AprioriAll are generated using Lk-1, whereas AprioriSome sometimes uses Ck-1 for this purpose. Since Ck-1 Lk-1, the number of candidates generated using AprioriSome can be larger. Second, although AprioriSome skips over counting candidates of some lengths, they are generated nonetheless and stay memory resident. If memory gets filled up, AprioriSome is forced to count the last set of candidates generated even if the heuristic suggests skipping some more candidate sets. This effect decreases the skipping distance between the two candidate sets that are indeed counted, and AprioriSome starts behaving more like AprioriAll. For lower supports, there are longer large sequences, and hence more non-maximal sequences, and AprioriSome does better.
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?