NOTE: Bright support for Hadoop and the Big Data ended with Bright Cluster Manager 8.1, these instructions will not work on Bright Cluster Manager 8.2 and newer.
Apache Mahout is a suite of machine learning libraries. Depending on the algorithm, Mahout can work with or without Hadoop.
We will show how Mahout can be added to a Bright cluster that has a Hadoop instance already installed. In this case, it is “CDH5.2.1”, and uses Cloudera CDH 5.2.1. An example is given of the use of Mahout to run MapReduce jobs.
Download Apache Mahout
Execute the following commands on the active head node as the root user:
# cd /tmp/
# curl -O <URL TO ARCHIVE> # was http://archive.cloudera.com/cdh5/cdh/5/mahout-0.9-cdh5.2.1.tar.gz
# /cm/shared/apps/hadoop/Cloudera
# tar xvzf /tmp/mahout-0.9-cdh5.2.1.tar.gz
Grant access to HDFS for user “foobar”
Granting access will create a directory /user/foobar in HDFS.
# cmsh
% user use user foobar
% set hadoophdfsaccess cdh5.2.1
% commit
Prepare execution of Mahout test
For the Naive Bayes classifier test, a sample of Wikipedia articles in XML format will be used.
NOTE: URL may change, please use an alternate XML dum from https://dumps.wikimedia.org/enwiki/latest/
# su - foobar
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2
$ bunzip2 enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2
$ module load hadoop/CDH5.2.1/Cloudera/2.5.0-cdh5.2.1
$ hdfs dfs -mkdir /user/foobar/wiki
$ hdfs dfs -copyFromLocal enwiki-latest-pages-articles11.xml-p6899367p7054859 /user/foobar/wiki
$ hdfs dfs -ls /user/foobar/wiki
Execute Mahout job (as YARN application) and check result
# su - foobar
$ /cm/shared/apps/hadoop/Cloudera/mahout-0.9-cdh5.2.1/bin/mahout seqwiki -i /user/foobar/wiki/enwiki-latest-pages-articles11.xml-p6899367p7054859 -o /user/foobar/wiki/seqfiles
$ hdfs dfs -ls /user/foobar/wiki/seqfiles
# su - foobar
$ /cm/shared/apps/hadoop/Cloudera/mahout-0.9-cdh5.2.1/bin/mahout seqwiki -i /user/foobar/wiki/enwiki-latest-pages-articles11.xml-p6899367p7054859 -o /user/foobar/wiki/seqfiles
$ hdfs dfs -ls /user/foobar/wiki/seqfiles