Cloud Computing Made Easy®
Apache Hadoop Deployment for EC2
From Cloud Computing Wiki - Kaavo
Contents |
Overview :
Like other Sample System Definitions provided with IMOD, the hadoop-multinode-amazon is provided as an example. Users are expected to customize the provided solutions for their own needs. The information assumes that the user has basic understanding of deploying and starting systems in IMOD; deploying systems from the sample templates and configuring them with certain required parameters. For detail instructions please watch the following 15 minutes video:
Please also check how to use Apache Hadoop (http://hadoop.apache.org/) . Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines. We tested this setup using Fedora 8 AMI, however, you can use your own custom image with a different flavor/version of Linux. In case there are any issues with using a different flavor/version of Linux please post it on http://forums.kaavo.com. Also refer to the list of supported versions of Linux for monitoring Installing_Monitoring_Agents; the monitoring may not work for unsupported versions. What does it do? Deploy a fully functional Hadoop cluster with a single click. What deployment time Actions are included in the System Definition? Bring online fully functional Hadoop cluster with 3 servers, 1 is hadoop master server group having role master, and 2 are hadoop slave servers group having role slave .
What does it do?
Automatic Data Processing at Scheduled Intervals in the Cloud using Hadoop and Kaavo IMOD. By schedule this system at regular interval it will start automatically , process the new file and when everything is done , it will shutdown itself.
List of actions :
Need to specify volume-id,device-name and mount-path in the attache ebs volume section to attache master node with an ebs :
<command type="ec2" name="attach-ebs-vol">[volume-id=][device-name=][mount-path=]</command>
File system should be exists in the volume ie mkfs mannually done at least once after attached to the instance.
- copy-user-creation-script-on-master : This action copy hadoop user creation script on the master node .
- copy-user-creation-script-on-slave : This action copy hadoop user creation script on the slave node .
- generate-key-ssh-on-master : Hadoop requires SSH access to manage its nodes, i.e. remote machines plus local machine if you want to use Hadoop on it . For our multi-node setup of Hadoop, we therefore need to configure SSH access to localhost and from master to slave . This action generate an SSH key on master.
- generate-key-ssh-on-slave : Hadoop requires SSH access to manage its nodes, i.e. remote machines plus local machine if you want to use Hadoop on it . For our multi-node setup of Hadoop, we therefore need to configure SSH access to localhost and from slave to master . This action generate an SSH key on slave.
- copy-keyfile-from-master-to-slave : This action copy ssh key (id_rsa.pub) from master node to the slave node .
- copy-keyfile-from-slave-to-master : This action copy ssh key (id_rsa.pub) from slave node to the master node .
- install-java-hadoop-on-master : Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. Therefore this action install Java 1.6 on master node and download the latest version of hadoop and configure it .
You will need to provide the value of the parameter for the following
<parameter name="hadoop_tmp_dir" type="literal" value="put_hadoop_data_directory_path"/>
for eg : \/mnt\/data-store put your hadoop data directory path where hadoop will store process data.
- install-java-hadoop-on-slave : Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. Therefore this action install Java 1.6 on master node and download the latest version of hadoop and configure it.
You will need to provide the value of the parameter for the following
<parameter name="hadoop_tmp_dir" type="literal" value="put_hadoop_data_directory_path"/>
for eg : \/mnt\/data-store
put your hadoop data directory path where hadoop will store process data .
- start-hadoop-cluster : This action starting the hadoop cluster .
- running-mapreduce-job-on-master : We are using the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
- add-script-to-cronjob
- copy-javaclient-for-teardown : For tear down the system we are using a java client which call Kavvo webservice to shutting down the mention system .
You need to provide the value of following parameters.
<parameter name="user" type="literal" value="put_login_user_name"/> <parameter name="password" type="literal" value="put_login_password"/> <parameter name="systemName" type="literal" value="put_system_name"/>
- tear-down-hadoop-system : When everything is done this action will shutting down the system by execute the java client.
Formatting the name node :
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
/usr/local/hadoop/bin/hadoop namenode -format
Testing :
download any book in the /mnt/data-store/newfiles folder for example cd /mnt/data-store/newfiles wget http://3rdparty-tools.s3.amazonaws.com/gutenburg/pg132.txt after a minute hadoop will process this files and move it to processedfiles folder.
Check the processed files
/usr/local/hadoop/bin/hadoop dfs -ls /usr/local/hadoop/bin/hadoop dfs -ls processed_file_output /usr/local/hadoop/bin/hadoop dfs -ls processed_file_output/20110305 /usr/local/hadoop/bin/hadoop dfs -cat processed_file_output/20110305-0923/part-r-00000
![[Wiki Home]](/skins/common/images/wiki.png)