Recently, I got a new laptop and therefore needed to do a fresh installation for Hive and Hadoop. There are quite a few steps along the way. Here’s my recordings:
Setup SSH
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
However, I’ve already have an key pair that is associated with a password. To avoid overwriting the key, I need to generate a separate key pair and use that to access localhost
:
$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/chao/.ssh/id_rsa): /Users/chao/.ssh/id_rsa_local
...
$ cat ~/.ssh/id_rsa_local.pub >> ~/.ssh/authorized_keys
Then, add (or create the file if it doesn’t exist) this to the ~/.ssh/config:
This tells SSH to use the key id_rsa_local
when accessing localhost, or 0.0.0.0
. This is necessary to start a one-node Hadoop cluster.
With this, now try:
ssh localhost
It should succeed.
Install Hadoop package
For this step, you’ll need to download the Hadoop package and setup some configuration files. For me, I use CDH 5.4.x Hadoop version from here. After the downloading is finished, uncompress the file and put it in a proper location (e.g., /opt). Then, add the following to your .bash_profile
:
export HADOOP_CLASSPATH=
export HADOOP_HEAPSIZE=2000
export HADOOP_HOME=/path/to/hadoop/dir
export PATH=$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH
Also, you’ll need to add configs to a few files under $HADOOP_HOME/etc/hadoop
:
core-site.xml:
<property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/path/to/namenode/dir</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/path/to/datanode/dir</value> </property>
mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
For the 2), you’ll need to specify a dir. I use a .hadoop
dir under my home directory, and have dirs datanode
and namenode
under it.
Also, (at least for CDH 5.4.x) there’s a bug in $HADOOP_HOME/libexec/hadoop-conf.sh
, which will result Yarn to use the hardcoded /bin/java
, instead of /usr/bin/java
. This can be fixed by following the comments in this JIRA.
Now, you should be able to start hdfs and yarn by:
start-dfs.sh
start-yarn.sh
Check whether the services are started by using jps
. You should see something like the following:
3296 NodeManager
3111 SecondaryNameNode
3017 DataNode
3210 ResourceManager
2943 NameNode
If the NameNode doesn’t start. You should check whether the datanode
and namenode
dirs have been created yet. Also, you can try
hadoop namenode -format
to reformat the namenode dir.
Install MySQL
On Mac, MySQL can be installed via Homebrew. Simply use the following command:
brew install mysql
After this, you can start MySQL server by:
mysql.server restart
Next, you’ll need to create a user for Hive. Go to MySQL CLI:
and type:
> CREATE DATABASE metastore;
> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'some_pass';
> USE metastore;
> GRANT ALL ON *.* TO 'hive'@'localhost' IDENTIFIED BY 'some_pass';
After installing the MySql database. You’ll need to run:
bin/schematool -dbType mysql -initSchema
to initialize the database for Hive usage. Note that you only need to run this ONCE at the beginning.
Install Hive
First, you can clone the Hive repository here. Then, suppose you put the Hive repo under $HOME/git/hive
, to build Hive, do:
mvn clean install -DskipTests -Phadoop-2
(Note that for upstream Hive the -Phadoop-2
flag is no longer necessary).
You also need to set $HIVE_HOME
env variable. For this I usually set it to the path under: $HOME/git/hive/packaging/target/$HIVE_VERSION-bin/$HIVE_VERSION-bin
, where $HIVE_VERSION
is the version of Hive you’re using. For instance: apache-hive-1.1.0-cdh5.4.7
.
Also remember to add $HIVE_HOME/bin
to your $PATH
.
Ther eare a few Hive configurations that need to be changed. First, in $HIVE_HOME/conf/hive-site.xml
, you should have at least the following:
<name>javax.jdo.option.ConnectionURL</name>
<!-- useSSL to false to avoid warning messages -->
<value>jdbc:mysql://localhost/metastore?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
which tells Hive to use MySql as underlying meta DB and the metastore
is the database to access.
To launch Hive, first you’ll need to start the MySQL server:
mysql.server start
To start HiveMetastore, do:
hive --service metastore &> /tmp/hive/metastore.log &
To start HiveServer2, do:
hive --service hiveserver2 &> /tmp/hive/hiveserver2.log &
Then you can launch Beeline and connect to localhost on port 10000
, like normal.