felipekunzler/frequent-itemset-mining-spark

Fork: 9 Star: 16 (更新于 2025-01-02 10:42:03)

license: 暂无

Language: Scala .

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

介绍
版本
相关

Frequent Itemset Mining in Scala and Spark

Implementation of Apriori and FP-Growth in Scala, as well as their counterpart distributed implementations YAFIM, RApriori and DFPS using Spark.

This work has been developed as part of my CS undergraduate thesis where three previously proposed distributed algorithms were implemented in Spark and compared on Amazon EMR.

Algorithms

Sequential implementation of Apriori and FP-Growth in Scala.
- Apriori.scala
- FPGrowth.scala and FPTree.scala
- Apriori using a Hash Tree for filtering: AprioriHashTree.scala
- Hash Tree implementation: HashTree.scala
Distributed implementations of Apriori and FP-Growth using Spark.
- YAFIM.scala
- YAFIM using a Hash Tree for filtering: YAFIMHashTree.scala
- RApriori.scala
- DFPS.scala
Distributed algorithms originally proposed by the following papers:

Experiments

The three distributed FIM algorithms were compared in terms of execution time and normalized speedup. All experiments were run on Amazon EMR using m5.xlarge (dual core) machines.

Execution time in seconds. Datasets replicated 9 times and increasing cluster size.

Normalized speedup factor, if the number of cores is increased to 10, the ideal speedup factor is 10 (meaning that the execution time has decreased by 10 times). Datasets replicated 9 times and increasing cluster size (X axis measured in cores).

Building and running

Install and set Java 8 to your path, e.g. sdk install java 8.0.252.hs-adpt and sdk use java 8.0.252.hs-adpt
Instsall sbt, e.g. sdk install sbt
Compile the project using sbt compile
Adjust the absolute path for the provided datasets in src/main/resources/defaultfim.properties
Run the experiments with sbt run and select the main class experiments.Runner

Sample output (execution times in seconds):

+-----------------+------------+-------+-------+-------+-------+------+
|           Class |    Dataset | Run 1 | Run 2 | Run 3 |  Mean |   SD |
+-----------------+------------+-------+-------+-------+-------+------+
| AprioriHashTree |   mushroom | 14.90 | 13.96 | 14.65 | 14.50 | 0.40 |
|        FPGrowth |   mushroom |  0.63 |  0.55 |  0.64 |  0.61 | 0.04 |
|   YAFIMHashTree |   mushroom |  7.26 |  6.64 |  7.92 |  7.27 | 0.52 |
|        RApriori |   mushroom |  6.71 |  6.79 |  7.45 |  6.98 | 0.33 |
|            DFPS |   mushroom |  2.49 |  2.31 |  2.27 |  2.36 | 0.10 |
|                 |            |       |       |       |       |      |
| AprioriHashTree | T10I4D100K |  3.68 |  2.39 |  2.42 |  2.83 | 0.60 |
|        FPGrowth | T10I4D100K |  3.26 |  3.27 |  3.06 |  3.20 | 0.10 |
|   YAFIMHashTree | T10I4D100K |  1.96 |  1.90 |  1.89 |  1.92 | 0.03 |
|        RApriori | T10I4D100K |  1.89 |  1.80 |  1.75 |  1.81 | 0.06 |
|            DFPS | T10I4D100K |  4.13 |  4.07 |  4.14 |  4.11 | 0.03 |
+-----------------+------------+-------+-------+-------+-------+------+

+------------+-----------------+----------+---------------+----------+------+
|            | AprioriHashTree | FPGrowth | YAFIMHashTree | RApriori | DFPS |
+------------+-----------------+----------+---------------+----------+------+
|   mushroom |           14.50 |     0.61 |          7.27 |     6.98 | 2.36 |
| T10I4D100K |            2.83 |     3.20 |          1.92 |     1.81 | 4.11 |
+------------+-----------------+----------+---------------+----------+------+

Additionally, algorithms can be run individually by instantiating its respective class. Example:

object Main {
  def main(args: Array[String]): Unit = {
    
    val dfps: SparkFIM = new DFPS()
    
    val fileName = "/projects/spark-scala/frequent-itemset-mining-spark/src/main/resources/datasets/mushroom.txt"
    val frequentItemsets = dfps.execute(fileName, separator = " ", minSupport = 0.65)
    
    Util.printItemsets(frequentItemsets)
    
  }
}

Would output:

Elapsed time: 2.06 seconds. Class: DFPS.
Found 39 itemsets
[1] {34}, {36}, {39}, {85}, {86}, {90}
[2] {34, 36}, {34, 39}, {34, 85}, {34, 86}, {34, 90}, {36, 85}, {36, 86}, {36, 90}, {39, 85}, {39, 86}, {85, 86}, {85, 90}, {86, 90}
[3] {34, 36, 85}, {34, 36, 86}, {34, 36, 90}, {34, 39, 85}, {34, 39, 86}, {34, 85, 86}, {34, 85, 90}, {34, 86, 90}, {36, 85, 86}, {36, 85, 90}, {36, 86, 90}, {39, 85, 86}, {85, 86, 90}
[4] {34, 36, 85, 86}, {34, 36, 85, 90}, {34, 36, 86, 90}, {34, 39, 85, 86}, {34, 85, 86, 90}, {36, 85, 86, 90}
[5] {34, 36, 85, 86, 90}

最近版本更新:(数据更新于 2024-08-28 00:44:53)

主题(topics):

apriori, dfps, fp-growth, rapriori, scala, spark, yafim

felipekunzler/frequent-itemset-mining-spark同语言 Scala最近更新仓库

2024-10-05 03:06:12 delta-io/delta-sharing

2024-09-25 11:49:48 enso-org/enso

2024-09-05 00:48:36 delta-io/delta

2024-08-30 10:16:51 microsoft/SynapseML

2024-05-08 03:46:22 twitter/finagle

2024-03-16 01:53:23 databricks-industry-solutions/smolder