Installing Hive and Hadoop locally

Recently, I got a new laptop and therefore needed to do a fresh installation for Hive and Hadoop. There are quite a few steps along the way. Here’s my recordings:

Setup SSH

ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

However, I’ve already have an key pair that is associated with a password. To avoid overwriting the key, I need to generate a separate key pair and use that to access localhost:

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/chao/.ssh/id_rsa): /Users/chao/.ssh/id_rsa_local
...
$ cat ~/.ssh/id_rsa_local.pub >> ~/.ssh/authorized_keys

Then, add (or create the file if it doesn’t exist) this to the ~/.ssh/config:

This tells SSH to use the key id_rsa_local when accessing localhost, or 0.0.0.0. This is necessary to start a one-node Hadoop cluster.

With this, now try:

ssh localhost

It should succeed.

Install Hadoop package

For this step, you’ll need to download the Hadoop package and setup some configuration files. For me, I use CDH 5.4.x Hadoop version from here. After the downloading is finished, uncompress the file and put it in a proper location (e.g., /opt). Then, add the following to your .bash_profile:

export HADOOP_CLASSPATH=
export HADOOP_HEAPSIZE=2000
export HADOOP_HOME=/path/to/hadoop/dir
export PATH=$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH

Also, you’ll need to add configs to a few files under $HADOOP_HOME/etc/hadoop:

core-site.xml:

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost:9000</value>
</property>

hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>/path/to/namenode/dir</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/path/to/datanode/dir</value>
</property>

mapred-site.xml

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

yarn-site.xml

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

For the 2), you’ll need to specify a dir. I use a .hadoop dir under my home directory, and have dirs datanode and namenode under it.

Also, (at least for CDH 5.4.x) there’s a bug in $HADOOP_HOME/libexec/hadoop-conf.sh, which will result Yarn to use the hardcoded /bin/java, instead of /usr/bin/java. This can be fixed by following the comments in this JIRA.

Now, you should be able to start hdfs and yarn by:

start-dfs.sh
start-yarn.sh

Check whether the services are started by using jps. You should see something like the following:

3296 NodeManager
3111 SecondaryNameNode
3017 DataNode
3210 ResourceManager
2943 NameNode

If the NameNode doesn’t start. You should check whether the datanode and namenode dirs have been created yet. Also, you can try

hadoop namenode -format

to reformat the namenode dir.

Install MySQL

On Mac, MySQL can be installed via Homebrew. Simply use the following command:

brew install mysql

After this, you can start MySQL server by:

mysql.server restart

Next, you’ll need to create a user for Hive. Go to MySQL CLI:

and type:

> CREATE DATABASE metastore;
> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'some_pass';
> USE metastore;
> GRANT ALL ON *.* TO 'hive'@'localhost' IDENTIFIED BY 'some_pass';

After installing the MySql database. You’ll need to run:

bin/schematool -dbType mysql -initSchema

to initialize the database for Hive usage. Note that you only need to run this ONCE at the beginning.

Install Hive

First, you can clone the Hive repository here. Then, suppose you put the Hive repo under $HOME/git/hive, to build Hive, do:

mvn clean install -DskipTests -Phadoop-2

(Note that for upstream Hive the -Phadoop-2 flag is no longer necessary).

You also need to set $HIVE_HOME env variable. For this I usually set it to the path under: $HOME/git/hive/packaging/target/$HIVE_VERSION-bin/$HIVE_VERSION-bin, where $HIVE_VERSION is the version of Hive you’re using. For instance: apache-hive-1.1.0-cdh5.4.7.

Also remember to add $HIVE_HOME/bin to your $PATH.

Ther eare a few Hive configurations that need to be changed. First, in $HIVE_HOME/conf/hive-site.xml, you should have at least the following:

<name>javax.jdo.option.ConnectionURL</name>
  <!-- useSSL to false to avoid warning messages -->
  <value>jdbc:mysql://localhost/metastore?useSSL=false</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

which tells Hive to use MySql as underlying meta DB and the metastore is the database to access.

To launch Hive, first you’ll need to start the MySQL server:

mysql.server start

To start HiveMetastore, do:

hive --service metastore &> /tmp/hive/metastore.log &

To start HiveServer2, do:

hive --service hiveserver2 &> /tmp/hive/hiveserver2.log &

Then you can launch Beeline and connect to localhost on port 10000, like normal.