Pattern Growth Algorithm

Popular Articles

The Pattern Growth algorithm is an alternative method for mining frequent itemsets, designed to overcome the limitations of the Apriori algorithm. While Apriori is widely used, it can become inefficient when the minimum support threshold is low or when patterns are long. This is because Apriori repeatedly generates large numbers of candidate itemsets and requires multiple scans of the database to verify their support. Both candidate generation and repeated database scanning are computationally expensive. Pattern growth addresses these problems by avoiding candidate generation and by compressing the database into a compact structure that can be mined more efficiently.

The key idea of the pattern growth approach is to use a data structure called an FP-tree (Frequent Pattern Tree) to represent the database in a compressed form. The algorithm operates in two main phases. In the first phase, the database is scanned to identify frequent items and build the FP-tree. In the second phase, frequent itemsets are extracted directly from this tree through a process of traversal and partitioning. Because the tree stores the essential frequency information, further database scans are not required, which significantly improves performance.

An FP-tree is a specialized tree structure. Each node in the tree has the form i:c, where i represents an item and c is a counter indicating how many transactions share that prefix path. Edges connect items that appear together in transactions, following a fixed order so that transactions with common prefixes share the same path in the tree. In addition, a header table is maintained, containing one entry for each frequent item. Nodes containing the same item are linked together through additional pointers, forming a chain for that item. This structure allows quick access to all occurrences of an item in the tree.

The construction of the FP-tree requires two passes over the database. During the first pass, the algorithm counts the support of each item. Items whose support is below the minimum support threshold are discarded, since they cannot be part of any frequent itemset. The remaining frequent items are then sorted in descending order of their support. For each transaction, only these frequent items are kept, and they are arranged in this order. Sorting ensures that transactions with similar item prefixes will overlap in the tree, increasing compression.

In the second pass, the FP-tree is built. A root node is created, and each transaction is processed again, considering only its frequent items in the sorted order. For the first transaction, a path from the root is created, and the count of each node is set to one. For subsequent transactions, the algorithm checks whether part of the transaction matches an existing path. If a common prefix is found, the counters of the corresponding nodes are incremented. If not, new branches are created. For example, if two transactions both begin with the same item, they share the initial part of the path, and only the counters increase. This process results in a tree that compactly represents all frequent items and their co-occurrences.

The FP-tree thus captures the essential structure of the database while ignoring infrequent items. Each transaction corresponds to a path in the tree, and the more transactions share prefixes, the greater the compression. In some cases, the entire database can be represented by a very small tree, possibly fitting in main memory. In the best case, all transactions share a common prefix, resulting in a single path.

Once the FP-tree is constructed, the second phase of the algorithm begins: mining frequent itemsets from the tree. Instead of generating candidate itemsets, the algorithm explores the tree recursively. For each item in the header table, a conditional pattern base is built, which consists of the paths leading to that item. From this, a conditional FP-tree is constructed. The process is repeated on these smaller trees, growing frequent patterns by combining suffixes with prefixes found in conditional trees. This recursive partitioning continues until no more frequent itemsets can be found.

The pattern growth approach is generally more efficient than Apriori because it avoids generating large numbers of candidate itemsets and reduces database scans. By working with a compressed in-memory structure, it significantly decreases I/O costs and computation time. This makes it especially suitable for large datasets and situations where support thresholds are low, leading to many potential patterns.

Reference:

Vaisman, A., & Zimányi, E. (2014). Data warehouse systems: Design and implementation. Springer.

Bryant Sylfen Tan Information System Laboratory, Binus University