Hadoop documentation

From MetaCentrum
Jump to navigation Jump to search

(Česká verze)

Basic information

Hadoop is a software for parallel processing of large amount of data using MapReduce algorithm. It is devised to distribute data processing to many computational nodes and to collect the results from these nodes.

Currently it is possible to launch Hadoop from within the Metacentrum cloud infrastructure.

Installed SW

BigTop Distribution 1.5.0.

Hadoop 2.10.1 - distributed storage and processing of very large data sets

HBase 1.5.0 - distributed, scalable, big data store

Hive 2.3.6 - data warehouse software facilitates

Spark 2.4.5 - fast and general engine for large-scale data processing

Creating Hadoop in the Cloud

The all-in-one Hadoop single machine requires > 4GB RAM (eg flavor standard.large).

Notes (for more see OpenStack documentation at MUNI):

  • security groups: defaults is used by default
  • ssh key: produce or upload the public part of the key to OpenStack

OpenStack CLI - single machine

Starting the machine:

# example of parameters
image=debian-9-x86_64_hadoop
network=auto_allocated_network
sshkey=openstack
public_network=public-muni-147-251-124-GROUP
name=hadoop

# machine
openstack server create --image "$image" --flavor standard.large --network "$network" --key-name "$sshkey" "$name"

# public IP
ip=$(openstack floating ip create -c floating_ip_address -f value "$public_network")
openstack server add floating ip "$name" "$ip"

Login to the machine:

ssh debian@$ip

Start the installation (runs for about 5 minutes):

sudo /usr/local/sbin/hadoop-single-setup.sh

OpenStack GUI (Horizon) - single machine

Compute -> Images -> "debian-9-x86_64_hadoop" -> Launch

  • Details -> Instance Name: name
  • Flavor: at least standard.large (> 4GB RAM)
  • Networks: auto_allocated_network
  • Security Groups: at least port 22 (ssh), but it can public (authentication is enabled)
  • Key Pair: required to access the console
  • Launch instance

Compute -> Instances -> (machine) -> Associate floating IP

Login to the machine:

ssh debian@$ip

Start the installation (runs for about 5 minutes):

sudo /usr/local/sbin/hadoop-single-setup.sh

Terraform - clustered

Download terraform code:

git clone https://gitlab.cesnet.cz/702/HADOOP/terraform

Create mycluster.auto.tfvars file (replace mydomain, PUBLIC_IP and SSH_KEYPAIR_NAME by real values):

domain = 'mydomain'
floating_ip = 'PUBLIC_IP'
n = 3
security_trusted_cidr = [
    "0.0.0.0/0",
    "::/0",
]
ssh = 'SSH_KEYPAIR_NAME'
# local_network = 'auto_allocated_network'

Launch terraform:

terraform init
terraform apply

See also terraform#build-cluster.

Usage

How to login

  1. ssh to the public floating IP, using the ssh key (where $ip is the public IP):
     ssh debian @$ip
  2. get a ticket
     kinit < password.txt

A local Kerberos server is running within the cluster:

  • realm: HADOOP
  • principal: debian@HADOOP
  • password: in the file /home/debian/password.txt
  • local domain: terra
  • configuration: /etc/krb5.conf.d/hadoop

Basic test of the installation

HDFS listing:

 hdfs dfs -ls /

Testsuite:

 /opt/hadoop-tests/run-tests.sh

Remote access

Kerberos client

How to set up the local client:

  1. check/add a line includedir /etc/krb5.conf.d/ in /etc/krb5.conf
  2. script (where $ip is the public floating ip, or the server name if specified in /etc/ hosts - see #Web Access):
    cat > /tmp/hadoop <<EOF
    [realms]
        HADOOP = {
            kdc = $ip
            admin_server = $ip
            default_domain = terra
        }
    
    [domain_realm]
        .terra = HADOOP
        terra = HADOOP
    EOF
    sudo mv /tmp/hadoop /etc/krb5.conf.d/

Web accessibility

Data can be approached though web (WebHDFS, read-only) or service cluster information are available. For web access to Hadoop, protocol SPNEGO (Kerberos through HTTP(S)) is necessary.

Hadoop must be accessed using the DNS name corresponding to the locally used name (eg hadoop.terra). You need to add an entry to /etc/hosts (where $ip is the public IP address):

echo "$ip hadoop.terra" >> /etc/hosts

Follows the SPNEGO protocol permission in the browser:

  • firefox [1]:
    1. about:config
    2. set network.negotiate-auth.trusted-uris=terra ("terra" is locally used domain)
    3. for Windows:
      1. install Kerberos
      2. set also network.auth.use-sspi=false
  • chrome: [2]
    • parameter --auth-server-whitelist=terra" ("terra" is locally used domain)
    • permanent setting: make file /etc/opt/chrome/policies/managed/hadoop-metacentrum.json ("terra" is locally used domain):
      {
        "AuthServerWhitelist": "terra",
        "AuthNegotiateDelegateWhitelist": "terra"
      }
    • Windows: SPNEGO and Kerberos are probably not supported

Getting a ticket (password is on the cloud machine in ~debian/password.txt):

kinit debian@HADOOP

Note: beware of networks behind NAT, there you must use an address-less ticket, option -A

Result (replace "hadoop.terra" with the actual server name):

curl

Access using SPNEGO using parameters --negotiate -u:

First you need to get a ticket:

kinit debian@HADOOP

Note: beware of networks behind NAT, there you must use an address-less ticket, option -A

File information (where $HDFS_PATH is the path to HDFS):

curl -i --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=LISTSTATUS"

Download file (where $HDFS_PATH is the path to HDFS):

curl -L --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=OPEN" -o file

Upload the file using two steps (where $HDFS_PATH is the HDFS path, $LOCAL_FILE is the local file):

# 1) creates a file and reference to data is returned in 'Location'
curl --negotiate -u : -i -X PUT "http://hadoop.terra:50070/webhdfs/v1${HDFS_PATH}?op=CREATE" | tee curl.log
# URL from the 'Location' header
data_url="`grep ^Location: curl.log | cut -c11- | tr -d '\r'`"

# 2) data upload
curl -i -T ${LOCAL_FILE} ${data_url}

Hadoop

Information

  • max. job length is limited:
    • max. refresh time of user Kerberos ticket
    • max. lifetime of Hadoop user tokens, 7 days

Examples

On the front-end, access via Kerberos:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 2 10

hdfs dfs -put /etc/hadoop/conf input
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+'
hdfs dfs -cat output/*

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomwriter out/

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100 gendata
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort gendata sorted
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate gendata reportdata

Getting logs

yarn application -list
yarn logs -applicationId APPLICATION_ID

HBase

See Hadop HBase page.

Hive

Information

Access administration:

  • handled at HDFS level ==> after the creation of the database it is possible to control the rights for HDFS directory /user/hive/warehouse/${DATABASE}
  • default rights for database after its creation are both reading and writing for all Hadoop users

Connectivity

  • beeline client ("$HIVE_DB" should be replaced by database name):
 beeline -u "jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP"
  • beeline client - connection with command '!connect' ("HIVE_DB" should be replaced by database name):
 beeline
   !connect jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP  
  • java code:
String url = "jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP";
Connection con = DriverManager.getConnection(url);
  • hive client (deprecated, supported is beeline client):
hive

For details type hive cli --help.

Integration with HBase

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Integration with Spark

See #Examples.

Examples

HIVE_DB="`id -un`_test"
JDBC_URL="jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP"

# creating database <USER>_test
beeline -u $JDBC_URL -e "CREATE DATABASE $HIVE_DB"

# usage of <USER>_test
beeline -u $JDBC_URL
  CREATE TABLE pokes (foo INT, bar STRING);
  CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
  SHOW TABLES;
  SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  DESCRIBE invites;
  
  INSERT INTO pokes VALUES (1, 'A'), (2, 'B');
  INSERT INTO invites PARTITION (ds='2015-01-01') SELECT * FROM pokes;
  INSERT INTO invites PARTITION (ds='2015-12-11') SELECT * FROM pokes;

Spark

Information

Supported modes:

  • Spark through YARN (cluster): --master yarn --deploy-mode cluster
  • Spark through YARN (client): --master yarn --deploy-mode client
  • local Spark: --master local

Examples

  • calculating pi
spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn `ls /usr/lib/spark/examples/jars/spark-examples*.jar | head -n 1` 10
# (replace APPLICATION_ID by real ID)
yarn logs --applicationId APPLICATION_ID
  • calling map/reduce
    (replace LOGIN by user name)
hdfs dfs -put /etc/passwd

spark-shell --master yarn
  val file = sc.textFile("hdfs:///user/LOGIN/passwd")
  val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
  counts.saveAsTextFile("hdfs:///user/LOGIN/spark-output")

hdfs dfs -cat ./spark-output/\*
hdfs dfs -rm -r ./spark-output ./passwd
  • usage of Hive metastore
    (replace LOGIN by user name, presumes database created in #Examples_2)
spark-shell
  sql("FROM LOGIN_test.pokes SELECT *").collect().foreach(println)
  sql("FROM LOGIN_test.pokes SELECT *").show()
  • launching own application

It is necessary to copy /usr/lib/spark/lib/spark-assembly.jar, add own libraries, copy on HDFS and recast Spark:

# (replace /user/hawking/lib/spark-assembly.jar with real path to HDFSa)
spark-submit --master yarn --deploy-mode cluster \
  --conf spark.yarn.jar=hdfs:///user/hawking/lib/spark-assembly.jar \
  ...