Hadoop documentation
Basic information
Hadoop is a software for parallel processing of large amount of data using MapReduce algorithm. It is devised to distribute data processing to many computational nodes and to collect the results from these nodes.
Currently it is possible to launch Hadoop from within the Metacentrum cloud infrastructure.
- documentation for cloud:
- image (public):
- id: 6e0bcd58-a1fc-4d57-a6a0-850837bcf9e5
- name: debian-9-x86_64_hadoop
- test image (community):
- id: (see openstack image list --community, owner 85c8a74440e94d4b91d0dc067308cb64)
- name: debian-9-x86_64_hadoop_rc
Installed SW
BigTop Distribution 1.5.0.
Hadoop 2.10.1 - distributed storage and processing of very large data sets
HBase 1.5.0 - distributed, scalable, big data store
Hive 2.3.6 - data warehouse software facilitates
Spark 2.4.5 - fast and general engine for large-scale data processing
Creating Hadoop in the Cloud
The all-in-one Hadoop single machine requires > 4GB RAM (eg flavor standard.large).
Notes (for more see OpenStack documentation at MUNI):
- security groups: defaults is used by default
- ssh key: produce or upload the public part of the key to OpenStack
OpenStack CLI - single machine
Starting the machine:
# example of parameters image=debian-9-x86_64_hadoop network=auto_allocated_network sshkey=openstack public_network=public-muni-147-251-124-GROUP name=hadoop # machine openstack server create --image "$image" --flavor standard.large --network "$network" --key-name "$sshkey" "$name" # public IP ip=$(openstack floating ip create -c floating_ip_address -f value "$public_network") openstack server add floating ip "$name" "$ip"
Login to the machine:
ssh debian@$ip
Start the installation (runs for about 5 minutes):
sudo /usr/local/sbin/hadoop-single-setup.sh
OpenStack GUI (Horizon) - single machine
Compute -> Images -> "debian-9-x86_64_hadoop" -> Launch
- Details -> Instance Name: name
- Flavor: at least standard.large (> 4GB RAM)
- Networks: auto_allocated_network
- Security Groups: at least port 22 (ssh), but it can public (authentication is enabled)
- Key Pair: required to access the console
- Launch instance
Compute -> Instances -> (machine) -> Associate floating IP
Login to the machine:
ssh debian@$ip
Start the installation (runs for about 5 minutes):
sudo /usr/local/sbin/hadoop-single-setup.sh
Terraform - clustered
Download terraform code:
git clone https://gitlab.cesnet.cz/702/HADOOP/terraform
Create mycluster.auto.tfvars file (replace mydomain, PUBLIC_IP and SSH_KEYPAIR_NAME by real values):
domain = 'mydomain' floating_ip = 'PUBLIC_IP' n = 3 security_trusted_cidr = [ "0.0.0.0/0", "::/0", ] ssh = 'SSH_KEYPAIR_NAME' # local_network = 'auto_allocated_network'
Launch terraform:
terraform init terraform apply
See also terraform#build-cluster.
Usage
How to login
- ssh to the public floating IP, using the ssh key (where $ip is the public IP):
ssh debian @$ip
- get a ticket
kinit < password.txt
A local Kerberos server is running within the cluster:
- realm: HADOOP
- principal: debian@HADOOP
- password: in the file /home/debian/password.txt
- local domain: terra
- configuration: /etc/krb5.conf.d/hadoop
Basic test of the installation
HDFS listing:
hdfs dfs -ls /
Testsuite:
/opt/hadoop-tests/run-tests.sh
Remote access
Kerberos client
How to set up the local client:
- check/add a line includedir /etc/krb5.conf.d/ in /etc/krb5.conf
- script (where $ip is the public floating ip, or the server name if specified in /etc/ hosts - see #Web Access):
cat > /tmp/hadoop <<EOF [realms] HADOOP = { kdc = $ip admin_server = $ip default_domain = terra } [domain_realm] .terra = HADOOP terra = HADOOP EOF sudo mv /tmp/hadoop /etc/krb5.conf.d/
Web accessibility
Data can be approached though web (WebHDFS, read-only) or service cluster information are available. For web access to Hadoop, protocol SPNEGO (Kerberos through HTTP(S)) is necessary.
Hadoop must be accessed using the DNS name corresponding to the locally used name (eg hadoop.terra). You need to add an entry to /etc/hosts (where $ip is the public IP address):
echo "$ip hadoop.terra" >> /etc/hosts
Follows the SPNEGO protocol permission in the browser:
- firefox [1]:
- about:config
- set network.negotiate-auth.trusted-uris=terra ("terra" is locally used domain)
- for Windows:
- install Kerberos
- set also network.auth.use-sspi=false
- chrome: [2]
- parameter --auth-server-whitelist=terra" ("terra" is locally used domain)
- permanent setting: make file /etc/opt/chrome/policies/managed/hadoop-metacentrum.json ("terra" is locally used domain):
{
"AuthServerWhitelist": "terra",
"AuthNegotiateDelegateWhitelist": "terra"
} - Windows: SPNEGO and Kerberos are probably not supported
Getting a ticket (password is on the cloud machine in ~debian/password.txt):
kinit debian@HADOOP
Note: beware of networks behind NAT, there you must use an address-less ticket, option -A
Result (replace "hadoop.terra" with the actual server name):
- HDFS: http://hadoop.terra:50070
- YARN: http://hadoop.terra:8088
- MapRed History: http://hadoop.terra:19888
- Spark: http://hadoop.terra:18088
curl
Access using SPNEGO using parameters --negotiate -u:
First you need to get a ticket:
kinit debian@HADOOP
Note: beware of networks behind NAT, there you must use an address-less ticket, option -A
File information (where $HDFS_PATH is the path to HDFS):
curl -i --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=LISTSTATUS"
Download file (where $HDFS_PATH is the path to HDFS):
curl -L --negotiate -u : "http://hadoop.terra:50070/webhdfs/v1/$HDFS_PATH?op=OPEN" -o file
Upload the file using two steps (where $HDFS_PATH is the HDFS path, $LOCAL_FILE is the local file):
# 1) creates a file and reference to data is returned in 'Location' curl --negotiate -u : -i -X PUT "http://hadoop.terra:50070/webhdfs/v1${HDFS_PATH}?op=CREATE" | tee curl.log # URL from the 'Location' header data_url="`grep ^Location: curl.log | cut -c11- | tr -d '\r'`" # 2) data upload curl -i -T ${LOCAL_FILE} ${data_url}
Hadoop
Information
- max. job length is limited:
- max. refresh time of user Kerberos ticket
- max. lifetime of Hadoop user tokens, 7 days
Examples
On the front-end, access via Kerberos:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 2 10 hdfs dfs -put /etc/hadoop/conf input hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+' hdfs dfs -cat output/* hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomwriter out/ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100 gendata hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort gendata sorted hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate gendata reportdata
Getting logs
yarn application -list yarn logs -applicationId APPLICATION_ID
HBase
See Hadop HBase page.
Hive
Information
Access administration:
- handled at HDFS level ==> after the creation of the database it is possible to control the rights for HDFS directory /user/hive/warehouse/${DATABASE}
- default rights for database after its creation are both reading and writing for all Hadoop users
Connectivity
- beeline client ("$HIVE_DB" should be replaced by database name):
beeline -u "jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP"
- beeline client - connection with command '!connect' ("HIVE_DB" should be replaced by database name):
beeline !connect jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP
- java code:
String url = "jdbc:hive2://hadoop.terra:10000/HIVE_DB;principal=hive/hadoop.terra@HADOOP"; Connection con = DriverManager.getConnection(url);
- hive client (deprecated, supported is beeline client):
hive
For details type hive cli --help.
Integration with HBase
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
Integration with Spark
See #Examples.
Examples
HIVE_DB="`id -un`_test" JDBC_URL="jdbc:hive2://`hostname -f`:10000/$HIVE_DB;principal=hive/`hostname -f`@HADOOP" # creating database <USER>_test beeline -u $JDBC_URL -e "CREATE DATABASE $HIVE_DB" # usage of <USER>_test beeline -u $JDBC_URL CREATE TABLE pokes (foo INT, bar STRING); CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES; SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; DESCRIBE invites; INSERT INTO pokes VALUES (1, 'A'), (2, 'B'); INSERT INTO invites PARTITION (ds='2015-01-01') SELECT * FROM pokes; INSERT INTO invites PARTITION (ds='2015-12-11') SELECT * FROM pokes;
Spark
Information
Supported modes:
- Spark through YARN (cluster): --master yarn --deploy-mode cluster
- Spark through YARN (client): --master yarn --deploy-mode client
- local Spark: --master local
Examples
- calculating pi
spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn `ls /usr/lib/spark/examples/jars/spark-examples*.jar | head -n 1` 10 # (replace APPLICATION_ID by real ID) yarn logs --applicationId APPLICATION_ID
- calling map/reduce
(replace LOGIN by user name)
hdfs dfs -put /etc/passwd spark-shell --master yarn val file = sc.textFile("hdfs:///user/LOGIN/passwd") val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs:///user/LOGIN/spark-output") hdfs dfs -cat ./spark-output/\* hdfs dfs -rm -r ./spark-output ./passwd
- usage of Hive metastore
(replace LOGIN by user name, presumes database created in #Examples_2)
spark-shell sql("FROM LOGIN_test.pokes SELECT *").collect().foreach(println) sql("FROM LOGIN_test.pokes SELECT *").show()
- launching own application
It is necessary to copy /usr/lib/spark/lib/spark-assembly.jar, add own libraries, copy on HDFS and recast Spark:
# (replace /user/hawking/lib/spark-assembly.jar with real path to HDFSa) spark-submit --master yarn --deploy-mode cluster \ --conf spark.yarn.jar=hdfs:///user/hawking/lib/spark-assembly.jar \ ...