Difference between Apache Mahout and Weka

Hello all, What is the difference between Apache Mahout and Weka, related to unstructured data processing?

Weka is a collection of machine learning algorthims for data mining tasks. The algorithms can be applied directly to a dataset or called from one’s java code. Weka contains tools for data pre-processing, classification, regression clustering, association rules and visualization. It is also suitable for developing new machine learning schemes.

Apache Mahout on the other hand is an open source project by the Apcahe Software Foundation (ASF) with the primary goal of creating scalable machine learning algorithms that are free to use under the Apache license. That is, it is a suite of machine learning libraries designed to be scalable and robust.

Weka allows deep analysis on smaller data sets which can fit the memory on the node on which the tool runs. They can facilitate deep analytics as they have a wide set ig ML algorithms. But they cannot work on large data sets like terabytes or petabytes of data due to scalability limitations because of non distributed nature and are vertically scalable.

Mahout can however scale to large data sets by implementing algorithms over Hadoop, the open source MR implementation. So it is horizontally scalable. Mahout has a set of algorithms for clustering and classification and recommendation as well.

In order to perform machine learning on data, the data has to be converted into name value pair in both the cases. So your decision will depend on the size of your data.