Basic information

Hadoop is a software for parallel processing of large amount of data using MapReduce algorithm. It is devised to distribute data processing to many computational nodes and to collect the results from these nodes.

Currently it is possible to launch Hadoop from within the Metacentrum cloud infrastructure.

documentation for cloud:
- MetaCentrum Cloud
- https://cloud.gitlab-pages.ics.muni.cz/documentation/
image (public):
- id: 6e0bcd58-a1fc-4d57-a6a0-850837bcf9e5
- name: debian-9-x86_64_hadoop
test image (community):
- id: (see openstack image list --community, owner 85c8a74440e94d4b91d0dc067308cb64)
- name: debian-9-x86_64_hadoop_rc

Installed SW

BigTop Distribution 1.5.0.

Hadoop 2.10.1 - distributed storage and processing of very large data sets

HBase 1.5.0 - distributed, scalable, big data store

Hive 2.3.6 - data warehouse software facilitates

Spark 2.4.5 - fast and general engine for large-scale data processing

Creating Hadoop in the Cloud

The all-in-one Hadoop single machine requires > 4GB RAM (eg flavor standard.large).

Notes (for more see OpenStack documentation at MUNI):

security groups: defaults is used by default
ssh key: produce or upload the public part of the key to OpenStack

OpenStack CLI - single machine

Starting the machine:

# example of parameters
image=debian-9-x86_64_hadoop
network=auto_allocated_network
sshkey=openstack
public_network=public-muni-147-251-124-GROUP
name=hadoop

# machine
openstack server create --image "$image" --flavor standard.large --network "$network" --key-name "$sshkey" "$name"

# public IP
ip=$(openstack floating ip create -c floating_ip_address -f value "$public_network")
openstack server add floating ip "$name" "$ip"

Login to the machine:

ssh debian@$ip

Start the installation (runs for about 5 minutes):

sudo /usr/local/sbin/hadoop-single-setup.sh

OpenStack GUI (Horizon) - single machine

Compute -> Images -> "debian-9-x86_64_hadoop" -> Launch

Details -> Instance Name: name
Flavor: at least standard.large (> 4GB RAM)
Networks: auto_allocated_network
Security Groups: at least port 22 (ssh), but it can public (authentication is enabled)
Key Pair: required to access the console
Launch instance

Compute -> Instances -> (machine) -> Associate floating IP

Login to the machine:

ssh debian@$ip

Start the installation (runs for about 5 minutes):

sudo /usr/local/sbin/hadoop-single-setup.sh

Terraform - clustered

Download terraform code:

git clone https://gitlab.cesnet.cz/702/HADOOP/terraform

Create mycluster.auto.tfvars file (replace mydomain, PUBLIC_IP and SSH_KEYPAIR_NAME by real values):

domain = 'mydomain'
floating_ip = 'PUBLIC_IP'
n = 3
security_trusted_cidr = [
    "0.0.0.0/0",
    "::/0",
]
ssh = 'SSH_KEYPAIR_NAME'
# local_network = 'auto_allocated_network'

Launch terraform:

terraform init
terraform apply

Usage

How to login

ssh to the public floating IP, using the ssh key (where $ip is the public IP):
```
 ssh debian @$ip
```
get a ticket
```
 kinit < password.txt
```

A local Kerberos server is running within the cluster:

realm: HADOOP
principal: debian@HADOOP
password: in the file /home/debian/password.txt
local domain: terra
configuration: /etc/krb5.conf.d/hadoop

Basic test of the installation

HDFS listing:

 hdfs dfs -ls /

Testsuite:

 /opt/hadoop-tests/run-tests.sh

Remote access

Kerberos client

How to set up the local client:

check/add a line includedir /etc/krb5.conf.d/ in /etc/krb5.conf

script (where $ip is the public floating ip, or the server name if specified in /etc/ hosts - see #Web Access):

cat > /tmp/hadoop <<EOF
[realms]
    HADOOP = {
        kdc = $ip
        admin_server = $ip
        default_domain = terra
    }

[domain_realm]
    .terra = HADOOP
    terra = HADOOP
EOF
sudo mv /tmp/hadoop /etc/krb5.conf.d/

Web accessibility

Data can be approached though web (WebHDFS, read-only) or service cluster information are available. For web access to Hadoop, protocol SPNEGO (Kerberos through HTTP(S)) is necessary.

Hadoop must be accessed using the DNS name corresponding to the locally used name (eg hadoop.terra). You need to add an entry to /etc/hosts (where $ip is the public IP address):

echo "$ip hadoop.terra" >> /etc/hosts

Follows the SPNEGO protocol permission in the browser:

firefox [1]:
1. about:config
2. set network.negotiate-auth.trusted-uris=terra ("terra" is locally used domain)
3. for Windows:
  1. install Kerberos
  2. set also network.auth.use-sspi=false

chrome: [2]
- parameter --auth-server-whitelist=terra" ("terra" is locally used domain)
- permanent setting: make file /etc/opt/chrome/policies/managed/hadoop-metacentrum.json ("terra" is locally used domain):
  { "AuthServerWhitelist": "terra", "AuthNegotiateDelegateWhitelist": "terra" }
- Windows: SPNEGO and Kerberos are probably not supported

Getting a ticket (password is on the cloud machine in ~debian/password.txt):

kinit debian@HADOOP

Note: beware of networks behind NAT, there you must use an address-less ticket, option -A

Result (replace "hadoop.terra" with the actual server name):

HDFS: http://hadoop.terra:50070
YARN: http://hadoop.terra:8088
MapRed History: http://hadoop.terra:19888
Spark: http://hadoop.terra:18088

curl

Access using SPNEGO using parameters --negotiate -u:

First you need to get a ticket:

kinit debian@HADOOP

Note: beware of networks behind NAT, there you must use an address-less ticket, option -A

File information (where $HDFS_PATH is the path to HDFS):

curl -i --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=LISTSTATUS"

Download file (where $HDFS_PATH is the path to HDFS):

curl -L --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=OPEN" -o file

Upload the file using two steps (where $HDFS_PATH is the HDFS path, $LOCAL_FILE is the local file):

# 1) creates a file and reference to data is returned in 'Location'
curl --negotiate -u : -i -X PUT "http://hadoop.terra:50070/webhdfs/v1${HDFS_PATH}?op=CREATE" | tee curl.log
# URL from the 'Location' header
data_url="`grep ^Location: curl.log | cut -c11- | tr -d '\r'`"

# 2) data upload
curl -i -T ${LOCAL_FILE} ${data_url}

Hadoop

Information

max. job length is limited:
- max. refresh time of user Kerberos ticket
- max. lifetime of Hadoop user tokens, 7 days

Examples

On the front-end, access via Kerberos:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 2 10

hdfs dfs -put /etc/hadoop/conf input
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+'
hdfs dfs -cat output/*

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomwriter out/

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100 gendata
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort gendata sorted
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate gendata reportdata

Getting logs

yarn application -list
yarn logs -applicationId APPLICATION_ID

HBase

See Hadop HBase page.

Hive

Information

Access administration:

handled at HDFS level ==> after the creation of the database it is possible to control the rights for HDFS directory /user/hive/warehouse/${DATABASE}
default rights for database after its creation are both reading and writing for all Hadoop users

Connectivity

beeline client ("$HIVE_DB" should be replaced by database name):

 beeline -u "jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP"

beeline client - connection with command '!connect' ("HIVE_DB" should be replaced by database name):

 beeline
   !connect jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP

java code:

String url = "jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP";
Connection con = DriverManager.getConnection(url);

hive client (deprecated, supported is beeline client):

hive

For details type hive cli --help.

Integration with HBase

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Integration with Spark

See #Examples.

Examples

https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations

HIVE_DB="`id -un`_test"
JDBC_URL="jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP"

# creating database <USER>_test
beeline -u $JDBC_URL -e "CREATE DATABASE $HIVE_DB"

# usage of <USER>_test
beeline -u $JDBC_URL
  CREATE TABLE pokes (foo INT, bar STRING);
  CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
  SHOW TABLES;
  SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  DESCRIBE invites;
  
  INSERT INTO pokes VALUES (1, 'A'), (2, 'B');
  INSERT INTO invites PARTITION (ds='2015-01-01') SELECT * FROM pokes;
  INSERT INTO invites PARTITION (ds='2015-12-11') SELECT * FROM pokes;

Spark

Information

Supported modes:

Spark through YARN (cluster): --master yarn --deploy-mode cluster
Spark through YARN (client): --master yarn --deploy-mode client
local Spark: --master local

Examples

calculating pi

spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn `ls /usr/lib/spark/examples/jars/spark-examples*.jar | head -n 1` 10
# (replace APPLICATION_ID by real ID)
yarn logs --applicationId APPLICATION_ID

calling map/reduce
(replace LOGIN by user name)

hdfs dfs -put /etc/passwd

spark-shell --master yarn
  val file = sc.textFile("hdfs:///user/LOGIN/passwd")
  val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
  counts.saveAsTextFile("hdfs:///user/LOGIN/spark-output")

hdfs dfs -cat ./spark-output/\*
hdfs dfs -rm -r ./spark-output ./passwd

usage of Hive metastore
(replace LOGIN by user name, presumes database created in #Examples_2)

spark-shell
  sql("FROM LOGIN_test.pokes SELECT *").collect().foreach(println)
  sql("FROM LOGIN_test.pokes SELECT *").show()

launching own application

It is necessary to copy /usr/lib/spark/lib/spark-assembly.jar, add own libraries, copy on HDFS and recast Spark:

# (replace /user/hawking/lib/spark-assembly.jar with real path to HDFSa)
spark-submit --master yarn --deploy-mode cluster \
  --conf spark.yarn.jar=hdfs:///user/hawking/lib/spark-assembly.jar \
  ...

Hadoop documentation

Obsah

Basic information

Installed SW

Creating Hadoop in the Cloud

OpenStack CLI - single machine

OpenStack GUI (Horizon) - single machine

Terraform - clustered

Usage

How to login

Basic test of the installation

Remote access

Kerberos client

Web accessibility

curl

Hadoop

Information

Examples

Getting logs

HBase

Hive

Information

Connectivity

Integration with HBase

Integration with Spark

Examples

Spark

Information

Examples

Navigační menu

Hadoop documentation

Basic information

Installed SW

Creating Hadoop in the Cloud

OpenStack CLI - single machine

OpenStack GUI (Horizon) - single machine

Terraform - clustered

Usage

How to login

Basic test of the installation

Remote access

Kerberos client

Web accessibility

curl

Hadoop

Information

Examples

Getting logs

HBase

Hive

Information

Connectivity

Integration with HBase

Integration with Spark

Examples

Spark

Information

Examples

Navigační menu

Hledat