felipekunzler/frequent-itemset-mining-spark
Fork: 9 Star: 16 (更新于 2025-01-02 10:42:03)
license: 暂无
Language: Scala .
Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.
Frequent Itemset Mining in Scala and Spark
Implementation of Apriori and FP-Growth in Scala, as well as their counterpart distributed implementations YAFIM, RApriori and DFPS using Spark.
This work has been developed as part of my CS undergraduate thesis where three previously proposed distributed algorithms were implemented in Spark and compared on Amazon EMR.
Algorithms
-
Sequential implementation of Apriori and FP-Growth in Scala.
- Apriori.scala
- FPGrowth.scala and FPTree.scala
- Apriori using a Hash Tree for filtering: AprioriHashTree.scala
- Hash Tree implementation: HashTree.scala
-
Distributed implementations of Apriori and FP-Growth using Spark.
- YAFIM.scala
- YAFIM using a Hash Tree for filtering: YAFIMHashTree.scala
- RApriori.scala
- DFPS.scala
-
Distributed algorithms originally proposed by the following papers:
Experiments
The three distributed FIM algorithms were compared in terms of execution time and normalized speedup. All experiments were run on Amazon EMR using m5.xlarge (dual core) machines.
- Execution time in seconds. Datasets replicated 9 times and increasing cluster size.
- Normalized speedup factor, if the number of cores is increased to 10, the ideal speedup factor is 10 (meaning that the execution time has decreased by 10 times). Datasets replicated 9 times and increasing cluster size (X axis measured in cores).
Building and running
- Install and set Java 8 to your path, e.g.
sdk install java 8.0.252.hs-adpt
andsdk use java 8.0.252.hs-adpt
- Instsall sbt, e.g.
sdk install sbt
- Compile the project using
sbt compile
- Adjust the absolute path for the provided datasets in
src/main/resources/defaultfim.properties
- Run the experiments with
sbt run
and select the main classexperiments.Runner
Sample output (execution times in seconds):
+-----------------+------------+-------+-------+-------+-------+------+
| Class | Dataset | Run 1 | Run 2 | Run 3 | Mean | SD |
+-----------------+------------+-------+-------+-------+-------+------+
| AprioriHashTree | mushroom | 14.90 | 13.96 | 14.65 | 14.50 | 0.40 |
| FPGrowth | mushroom | 0.63 | 0.55 | 0.64 | 0.61 | 0.04 |
| YAFIMHashTree | mushroom | 7.26 | 6.64 | 7.92 | 7.27 | 0.52 |
| RApriori | mushroom | 6.71 | 6.79 | 7.45 | 6.98 | 0.33 |
| DFPS | mushroom | 2.49 | 2.31 | 2.27 | 2.36 | 0.10 |
| | | | | | | |
| AprioriHashTree | T10I4D100K | 3.68 | 2.39 | 2.42 | 2.83 | 0.60 |
| FPGrowth | T10I4D100K | 3.26 | 3.27 | 3.06 | 3.20 | 0.10 |
| YAFIMHashTree | T10I4D100K | 1.96 | 1.90 | 1.89 | 1.92 | 0.03 |
| RApriori | T10I4D100K | 1.89 | 1.80 | 1.75 | 1.81 | 0.06 |
| DFPS | T10I4D100K | 4.13 | 4.07 | 4.14 | 4.11 | 0.03 |
+-----------------+------------+-------+-------+-------+-------+------+
+------------+-----------------+----------+---------------+----------+------+
| | AprioriHashTree | FPGrowth | YAFIMHashTree | RApriori | DFPS |
+------------+-----------------+----------+---------------+----------+------+
| mushroom | 14.50 | 0.61 | 7.27 | 6.98 | 2.36 |
| T10I4D100K | 2.83 | 3.20 | 1.92 | 1.81 | 4.11 |
+------------+-----------------+----------+---------------+----------+------+
Additionally, algorithms can be run individually by instantiating its respective class. Example:
object Main {
def main(args: Array[String]): Unit = {
val dfps: SparkFIM = new DFPS()
val fileName = "/projects/spark-scala/frequent-itemset-mining-spark/src/main/resources/datasets/mushroom.txt"
val frequentItemsets = dfps.execute(fileName, separator = " ", minSupport = 0.65)
Util.printItemsets(frequentItemsets)
}
}
Would output:
Elapsed time: 2.06 seconds. Class: DFPS.
Found 39 itemsets
[1] {34}, {36}, {39}, {85}, {86}, {90}
[2] {34, 36}, {34, 39}, {34, 85}, {34, 86}, {34, 90}, {36, 85}, {36, 86}, {36, 90}, {39, 85}, {39, 86}, {85, 86}, {85, 90}, {86, 90}
[3] {34, 36, 85}, {34, 36, 86}, {34, 36, 90}, {34, 39, 85}, {34, 39, 86}, {34, 85, 86}, {34, 85, 90}, {34, 86, 90}, {36, 85, 86}, {36, 85, 90}, {36, 86, 90}, {39, 85, 86}, {85, 86, 90}
[4] {34, 36, 85, 86}, {34, 36, 85, 90}, {34, 36, 86, 90}, {34, 39, 85, 86}, {34, 85, 86, 90}, {36, 85, 86, 90}
[5] {34, 36, 85, 86, 90}
最近版本更新:(数据更新于 2024-08-28 00:44:53)
主题(topics):
apriori, dfps, fp-growth, rapriori, scala, spark, yafim
felipekunzler/frequent-itemset-mining-spark同语言 Scala最近更新仓库
2024-10-05 03:06:12 delta-io/delta-sharing
2024-09-25 11:49:48 enso-org/enso
2024-09-05 00:48:36 delta-io/delta
2024-08-30 10:16:51 microsoft/SynapseML
2024-05-08 03:46:22 twitter/finagle
2024-03-16 01:53:23 databricks-industry-solutions/smolder