How can I add Apache Mahout to my Hadoop instance?

Contents

NOTE: Bright support for Hadoop and the Big Data ended with Bright Cluster Manager 8.1, these instructions will not work on Bright Cluster Manager 8.2 and newer.

NOTE: Cloudera is now requiring all downloads to use a username and password which is beyond the scope of this knowledgebase.

Apache Mahout is a suite of machine learning libraries. Depending on the algorithm, Mahout can work with or without Hadoop.

We will show how Mahout can be added to a Bright cluster that has a Hadoop instance already installed. In this case, it is “CDH5.2.1”, and uses Cloudera CDH 5.2.1. An example is given of the use of Mahout to run MapReduce jobs.

Download Apache Mahout

Execute the following commands on the active head node as the root user:

# cd /tmp/
# curl -O <URL TO ARCHIVE> # was http://archive.cloudera.com/cdh5/cdh/5/mahout-0.9-cdh5.2.1.tar.gz
# /cm/shared/apps/hadoop/Cloudera
# tar xvzf /tmp/mahout-0.9-cdh5.2.1.tar.gz

Grant access to HDFS for user “foobar”

Granting access will create a directory /user/foobar in HDFS.

# cmsh
% user use user foobar
% set hadoophdfsaccess cdh5.2.1 
% commit

Prepare execution of Mahout test

For the Naive Bayes classifier test, a sample of Wikipedia articles in XML format will be used.

NOTE: URL may change, please use an alternate XML dum from https://dumps.wikimedia.org/enwiki/latest/

# su - foobar
$ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2
$ bunzip2 enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2
$ module load hadoop/CDH5.2.1/Cloudera/2.5.0-cdh5.2.1 
$ hdfs dfs -mkdir /user/foobar/wiki
$ hdfs dfs -copyFromLocal enwiki-latest-pages-articles11.xml-p6899367p7054859 /user/foobar/wiki
$ hdfs dfs -ls /user/foobar/wiki

Execute Mahout job (as YARN application) and check result

# su - foobar
$ /cm/shared/apps/hadoop/Cloudera/mahout-0.9-cdh5.2.1/bin/mahout seqwiki -i /user/foobar/wiki/enwiki-latest-pages-articles11.xml-p6899367p7054859 -o /user/foobar/wiki/seqfiles
$ hdfs dfs -ls /user/foobar/wiki/seqfiles

# su - foobar
$ /cm/shared/apps/hadoop/Cloudera/mahout-0.9-cdh5.2.1/bin/mahout seqwiki -i /user/foobar/wiki/enwiki-latest-pages-articles11.xml-p6899367p7054859 -o /user/foobar/wiki/seqfiles
$ hdfs dfs -ls /user/foobar/wiki/seqfiles

Updated on June 7, 2022

Download Apache Mahout

Grant access to HDFS for user “foobar”

Prepare execution of Mahout test

Execute Mahout job (as YARN application) and check result

Related Articles

Leave a Comment Cancel