The input data is a set of sequences, called data-sequences. Each data sequence is a ordered list of transactions(or itemsets), where each transaction is a sets of items (literals). Typically there is a transaction-time associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a user-specified minimum support, where the support of a sequential pattern is the percentage of data sequences that contain the pattern.
An example of such a pattern is that customers typically rent ``Star Wars′′, then ``Empire Strikes Back′′, and then ``Return of the Jedi′′. Note that these rentals need not be consecutive. Customers who rent some other videos in between also support this sequential pattern. Elements of a sequential pattern need not be simple items. ``Fitted Sheet and flat sheet and pillow cases′′, followed by ``comforter′′, followed by ``drapes and ruffles′′ is an example of a sequential pattern in which the elements are sets of items. This problem was initially motivated by applications in the retailing industry, including attached mailing, add-on sales, and customer satisfaction. But the results apply to many scientific and business domains. For instance, in the medical domain, a data-sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered using this data could be used in disease research to help identify symptoms/diseases that precede certain diseases.
数据挖掘研究院
Algorithms for Finding Sequential Patterns 数据挖掘研究院
Various groups working in this field have suggested algorithms for mining sequential patterns. Listed below are two algorithms proposed by IBM′s Quest data team.[6]
Terminology : The length of a sequence is the number of itemsets in the sequence. A sequence of length k is called a k-sequence. The sequence formed by the concatenation of two sequences x and y is denoted as x.y. The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction. Thus the itemset i and the 1-sequence i have the same support. An itemset with minimum support is called a large itemset or litemset. Note that each itemset in a large sequence must have minimum support. Hence, any large sequence must be a list of litemsets. In the algorithms, Lk denotes the set of all large k-sequences, and Ck the set of candidate k-sequences. 数据挖掘研究院
The problem of mining sequential patterns can be split into the following phases:- Sort Phase. This step implicitly converts the original transaction database into a database of sequences.
- Litemset Phase. In this phase we find the set of all litemsets L. We are also simultaneously finding the set of all large 1-sequences, since this set is just l | l L.
- Transformation Phase. We need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. To make this test fast, we transform each customer sequence into an alternative representation. In a transformed customer sequence, each transaction is replaced by the set of all litemsets contained in that transaction. If a transaction does not contain any litemset, it is not retained in the transformed sequence. If a customer sequence does not contain any litemset, this sequence is dropped from the transformed database. However, it still contributes to the count of total number of customers. A customer sequence is now represented by a list of sets of litemsets.
- Sequence Phase. Use the set of litemsets to find the desired sequences. Algorithms for this phase below.
- Maximal Phase. Find the maximal sequences among the set of large sequences. In some algorithms this phase is combined with the sequence phase to reduce the time wasted in counting non maximal sequences.
There are two families of algorithms- count-all and count-some. The count-all algorithms count all the large sequences, including non-maximal sequences. The non-maximal sequences must then be pruned out (in the maximal phase). AprioriAll listed below is a count-all algorithm, based on the Apriori algorithm for finding large itemsets presented in chapter2. Apriori- Some is a count-some algorithm. The intuition behind these algorithms is that since we are only interested in maximal sequences, we can avoid counting sequences which are contained in a longer sequence if we first count longer sequences. However, we have to be careful not to count a lot of longer sequences that do not have minimum support. Otherwise, the time saved by not counting sequences contained in a longer sequence may be less than the time wasted counting sequences without minimum support that would never have been counted because their subsequences were not large.
数据挖掘研究院
L1 = large 1-sequences; // Result of litemset phase
for ( k = 2; Lk-1 0; k++) do
begin 数据挖掘研究院
- Ck = New Candidates generated from Lk-1 (see below)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck that are contained in c.
Lk = Candidates in Ck with minimum support.
Answer = Maximal Sequences in k Lk ;
The algorithm is given above. In each pass, we use the large sequences from the previous pass to generate the candidate sequences and then measure their support by making a pass over the database. At the end of the pass, the support of the candidates is used to determine the large sequences. In the first pass, the output of the litemset phase is used to initialize the set of large 1-sequences. The candidates are stored in hash-tree to quickly find all candidates contained in a customer sequence.
Apriori Candidate Generation
The apriori-generate function takes as argument Lk-1, the set of all large (k-1)-sequences. It works as follows. First join Lk-1 with Lk-1
insert into Ck 数据挖掘研究院
select p.litemset1 , ..., p.litemsetk-1 , q.litemsetk-1
from Lk-1 p, Lk-1 q
where p.litemset1 = q.litemset1 , . . ., 数据挖掘研究院
p.litemsetk-2 = q.litemsetk-2 ;
Next delete all sequences c Ck such that some (k-1)-subsequence of c is not in Lk-1 数据挖掘实验室
Example 数据挖掘研究院
Consider a database with the customer-sequences shown below in Fig4.1. We have not shown the original database in this example. The customer sequences are in transformed form where each transaction has been replaced by the set of litemsets contained in the transaction and the litemsets have been replaced by integers. The minimum support has been specified to be 40% (i.e. 2 customer sequences). The first pass over the database is made in the litemset phase, and we determine the large 1-sequences shown in Fig. 4.2. The large sequences together with their support at the end of the second, third, and fourth passes are also shown in the same figure. No candidate is generated for the fifth pass. The maximal large sequences would be the three sequences 1 2 3 4 , 1 3 5 and 4 5 .
1 5 2 3 4 数据挖掘研究院
1 3 4 3 5 数据挖掘研究院
1} 2} 3} 4}
1} 3} 5} 数据挖掘研究院
4} 5}
Fig4.1: Customer Sequences
// Forward Phase 数据挖掘实验室
L1 = large 1-sequences; // Result of litemset phase 数据挖掘研究院
C1 = L1 ; // so that we have a nice loop condition 数据挖掘研究院
last = 1; // we last counted Clast 数据挖掘研究院
for ( k = 2; Ck-1 0 and Llast 0; k++) do
begin
if (Lk-1 known) then 数据挖掘研究院
Ck= New candidates generated from Lk-1 ; 数据挖掘研究院
else
Ck= New candidates generated from Ck-1 ; 数据挖掘研究院
if (k == next(last) ) then begin 数据挖掘实验室
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck that are contained in c. 数据挖掘实验室
Lk = Candidates in Ck with minimum support.
last = k;
end 数据挖掘研究院
end 数据挖掘研究院
// Backward Phase
for ( k-- ; k >=1; k==) do
if (Lk not found in forward phase) then begin 数据挖掘研究院
Delete all sequences in Ck contained in some Li , i>k; 数据挖掘研究院
foreach customer-sequence in DT do 数据挖掘研究院
Increment the count of all candidates in Ck that are contained in c. 数据挖掘实验室
Lk = Candidates in Ck with minimum support. 数据挖掘实验室
end
else // Lk already known
Delete all sequences in Lk contained in some Li , i > k. 数据挖掘实验室
Answer = k Lk ;
Let hitk denote the ratio of the number of large k-sequences to the number of candidate k-sequences (i.e., |Lk | / |Ck|). The next function we used in the experiments is given below. The intuition behind the heuristic is that as the percentage of candidates counted in the current pass which had minimum support increases, the time wasted by counting extensions of small candidates when we skip a length goes down. 数据挖掘研究院
function next(k: integer) 数据挖掘研究院
begin 数据挖掘实验室
if (hitk < 0.666) return k + 1;
elsif (hitk <0.75) return k + 2;
elsif (hitk < 0.80) return k + 3;
elsif (hitk < 0.85) return k + 4;
else return k + 5;
end
We use the apriori-generate function given above to generate new candidate sequences. However, in the kth pass, we may not have the large sequence set Lk-1 available as we did not count the (k-1)-candidate sequences. In that case, we use the candidate set Ck-1 to generate Ck. Correctness is maintained because Ck-1 Lk-1
In the backward phase, we count sequences for the lengths we skipped over during the forward phase, after first deleting all sequences contained in some large sequence. These smaller sequences cannot be in the answer because we are only interested in maximal sequences. We also delete the large sequences found in the forward phase that are non-maximal.
Example 数据挖掘研究院
Using again the database used in the example for the AprioriAll algorithm, we find the large 1-sequences (L1) shown in Fig. 4.2 below in the litemset phase (during the first pass over the database). Take for illustration simplicity, f(k) = 2k. In the second pass, we count C2 to get L2 (Fig. 4.2). After the third pass, apriori-generate is called with L2 as argument to get C3 . We do not count C3 , and hence do not generate L3 . Next, apriori-generate is called with C3 to get C4. After counting C4 to get L4 (Fig. 4.2), we try generating C5, which turns out to be empty. 数据挖掘研究院
We then start the backward phase. Nothing gets deleted from L4 since there are no longer sequences. We had skipped counting the support for sequences in C3 in the forward phase. After deleting those sequences in C3 that are subsequences of sequences in L4, i.e., subsequences of 1 2 3 4 , we are left with the sequences 1 3 5 and 3 4 5. These would be counted to get 1 3 5 as a maximal large 3-sequence. Next, all the sequences in L2 except 4 5 are deleted since they are contained in some longer sequence. For the same reason, all sequences in L1 are also deleted.
L 1
| 1-Sequences | Support |
| 1 | 4 |
| 2 | 2 |
| 3 | 4 |
| 4 | 4 |
| 5 | 4 |
L 2
| 2-Sequences | Support |
| 1 2 | 2 |
| 1 3 | 4 |
| 1 4 | 3 |
| 1 5 | 3 |
| 2 3 | 2 |
| 2 4 | 2 |
| 3 4 | 3 |
| 3 5 | 2 |
| 4 5 | 2 |
L 3
| 3-Sequences | Support |
| 1 2 3 | 2 |
| 1 2 4 | 2 |
| 1 3 4 | 3 |
| 1 3 5 | 2 |
| 2 3 4 | 2 |
L 4
| 4-Sequences | Support |
| 1 2 3 4 | 2 |
数据挖掘研究院

