Deploying the architecture

Part of the project consists in making tools to easily deploy a Kafka cluster, a Hadoop/Yarn cluster, and a Spark instance. The overall architecture is general/flexible and can run both on a local machine or a cluster of machines, without significant differences. It also allows to run the demo.

Overall the tools allow to deploy Kafka and Yarn/Hadoop clusters of arbitrary size, but while in the first case instances of Zookeeper and/or Kafka brokers can be both on multiple nodes, in the second case the setup is fixed to one "master" node (NameNode/ResourceManager) and the possibility of multiple instances only for "slave" nodes (DataNodes/NodeManagers).

This was tested both on AWS EC2 (t2 instances) mounting Ubuntu 18.04 LTS and on a single local machine (same OS). Focus is on these environments, but the changes needed to deploy this on other kinds of setup should be negligible (Docker images).

Some utilities that were used while developing and testing the project are not reported, to avoid unnecessary clutter (things like scripts to run SSH commands for multiple hosts).

The following sections describe procedures to deploy both on local environment and on cluster.

Docker setup

Installing Docker is essential for both cluster and local deployment.

Execute these commands on each node if using Ubuntu. If not working, refer to Docker documentation on how to install it.

sudo apt-get update
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get -y install docker-ce docker-ce-cli containerd.io

All next examples use sudo for Docker commands. This can be avoided by adding the user to docker group and logging out and in again:

sudo usermod -a -G ubuntu docker #ubuntu is the user here

Local deployment

A bridge network has to be created with the following command, in order to have containers be able to communicate through their name/hostname:

sudo docker network create -d bridge --attachable bridgenet

Every container has to be connected on this network (this is shown in the deployment section for each component).

Which IPs are associated with the logical names can be seen (once the containers are started) with:

sudo docker network inspect bridgenet

Cluster deployment

Using Docker Swarm allows for intercommunication of any kind between containers on different nodes (through Docker overlay networks). All that is needed from a network setup perspective is to open the ports needed for Docker Swarm on all nodes. Another benefit of using this is that container names can be used to point at each other (Docker offers an internal DNS for this).

Each machine in the Docker Swarm needs these ports to be reachable:

2376/tcp
2377/tcp
7946/tcp
7946/udp
4789/udp

See how to do it (for AWS) in port forwarding section.

On one node, that will be used as swarm "leader", run:

sudo docker swarm init

This will return a command to execute on other nodes, to make them join the swarm as "workers". Execute the returned command on each node (but not the "leader" one), preceding it by sudo.

Here is an example (mock token):

sudo docker swarm join --token SWMTKN-1-3pu6hszjas19xyp7ghgosyx9k8atbfcr8p2is99znpy26u2lkl-7p73s1dx5in4tatdymyhg9hu2 192.168.99.121:2377

If the token is lost, another one can be given executing this command on "leader" node:

sudo docker swarm join-token worker

On the "leader" node an overlay network has to be created with the following command:

sudo docker network --create -d overlay --attachable swarmnet

Every container on each node has to be connected on this network (this is shown in deployment section for each component).

Running detached containers

There are multiple options when running a Docker container, in particular there is the possibility to run them in a "detached" fashion, allowing to close the shell (and the SSH session, if running on a remote node), without making the container stop.

This can be done with the -d option, e.g.:

sudo docker run --name mycontainer -d --net mynet username/myimage

Access to that container can then be done at any time with:

sudo docker exec -it mycontainer bash

This is useful for situations like the one of the proposed demo, where it would be hard to mantain an open shell for each container at the same time.

Kafka

The Docker image for deploying a Zookeeper/Kafka cluster is available at carmineds/zookafka (Docker hub). To build the image (not needed), follow the image build section.

Run Kafka container(s)

Run the docker container on each node where a Kafka broker and/or a Zookeeper instance is wanted. Also refer to running detached containers section.

Components on the same cluster can connect to Kafka brokers on port 9093/tcp; connecting to a broker from outside can be done on port 9092/tcp (the network has to be properly configured).

Local deployment

Run:

sudo docker run -it -h kafka1 --name kafka1 --net bridgenet -d carmineds/zookafka

For other instances (optional), run on other shells the same command, changing container name and hostname, e.g.:

sudo docker run -it -h kafka2 --name kafka2 --net bridgenet -d carmineds/zookafka

Cluster deployment

Run on a node:

sudo docker run -it -h kafka1 --name kafka1 -p 9092:9092 -p 2881:2881 -p 3881:3881 -p 2181:2181 --net swarmnet -d  carmineds/zookafka

For each other node where a Zookeeper and/or Kafka broker is wanted, run the command with a different container name/hostname, e.g.:

sudo docker run -it -h kafka2 --name kafka2 -p 9092:9092 -p 2881:2881 -p 3881:3881 -p 2181:2181 --net swarmnet -d  carmineds/zookafka

With this command ports are exposed to the container host network interface, allowing client connection from outside the cluster (optional). In such case ports need to be exposed on host machine too (for AWS refer to port forwarding section).

Inside the container

Once the container(s) are started, follow the instructions printed on the console.

Testing deployment

Run on each node:

jps

This should print something like:

565 Jps
206 Kafka #only nodes designated for Kafka deployment
127 QuorumPeerMain #only nodes designated for Zookeeper deployment

Create a test topic:

kafka-topics.sh --bootstrap-server localhost:9093 --replication-factor 1 --partitions 1 --topic test --create

Populate the topic with some messages:

kafka-console-producer.sh --bootstrap-server localhost:9093 --topic test

Write something and press Enter, then close the process with Ctrl-C

Consume the topic:

kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic test --from-beginning

This should show messages previously submitted to the producer.

To test a cluster deployment, try running producer and consumer processes on different nodes.

Hadoop

The ready to go image for deploying a Yarn/Hadoop cluster is available at carmineds/hadoop (Docker hub). To build the image (not needed), follow the image build section.

Run Hadoop containers

Run the docker container on each node that is wanted to be part of the Yarn/Hadoop cluster. At least two running containers are needed (a "master" and a "slave"). Also refer to running detached containers section.

Naming

The container name and hostname of the "master" node have to be nn1 (needed by some scripts).

Ports that could be particularly useful to reach from outside are 8088/tcp for "master" node to access to the ResourceManager Web UI (Yarn) and 8042/tcp for NodeManager(s) Web UI (Yarn) For more informations check Hadoop/Yarn documentation.

Local deployment

Run (e.g. 3 instances, 1 "master, 2 "slaves"):

sudo docker run -it -h nn1 --name nn1 --net bridgenet -d carmineds/hadoop #master node, name and hostname have to be nn1
sudo docker run -it -h dn1 --name dn1 --net bridgenet -d carmineds/hadoop #for slave nodes name and hostname have to be the same, but can be anything.
sudo docker run -it -h dn2 --name dn2 --net bridgenet -d carmineds/hadoop #another slave node

Cluster deployment

E.g. (1 "master", 2 "slaves", all on different machines).

On machine 1 ("master"):

sudo docker run -it -h nn1 --name nn1 -p 9000:9000 -p 8020:8020 -p 8030-8033:8030-8033 -p 8088:8088 -p 50070:50070 -p 50090:50090 -p 4444:22 -d --net swarmnet carmineds/hadoop

On machine 2 ("slave"):

sudo docker run -it -h dn1 --name dn1 -p 50075:50075 -p 50475:50475 -p 50020:50020 -p 50010:50010 -p 8042:8042 -p 4444:22 -d --net swarmnet carmineds/hadoop

On machine 3 ("slave", different container's name/hostname):

sudo docker run -it -h dn2 --name dn2 -p 50075:50075 -p 50475:50475 -p 50020:50020 -p 50010:50010 -p 8042:8042 -p 4444:22 -d --net swarmnet carmineds/hadoop

With these commands ports are exposed to the container's host network interface, allowing client connection from outside the cluster (optional). In such case, ports need to be exposed on host machine too (for AWS refer to port forwarding section).

Inside the container

Once the container(s) are started, follow the instructions printed on the console.

Configurations

Configurations about node resources (memory and CPU cores) can be changed in /hadoop/etc/hadoop/yarn-site.xml. On "master" node there are the scheduler properties, they set maximum amount of resources that a Yarn container can be assigned. On "slave" node(s) nodemanager properties set how much node resources are available to Yarn containers (good to have some memory and a CPU core left for other things going on the node).
Configurations about queues (for parallel Yarn applications) can be changed in /hadoop/etc/hadoop/capacity-scheduler.xml on the "master" node. By default there are 3 queues (default, second, third) with corresponding resource propotions of 50%/25%/25% (ready for the demo). Get only default queue with 100% of resources by using the Hadoop distribution default configuration (saved as backup):
```
cp /hadoop/etc/hadoop/capacity.bck /hadoop/etc/hadoop/capacity-scheduler.xml
```

For more information refer to Yarn/Hadoop documentation.

Testing deployment

Run on each node:

jps

Results should be something like this on "master" node:

410 SecondaryNameNode
204 NameNode
572 ResourceManager
863 Jps

and this on "slave" node(s):

180 NodeManager
60 DataNode
303 Jps

To test HDFS, create a mock file, and put it into the HDFS filesystem:

echo "hello world" > test.txt
hdfs dfs -put test.txt /
hdfs dfs -ls / #in the output list, test.txt should be present

Yarn can be tested after Spark deployment .

Spark

A docker image has been built for this too. It's simply a Spark distribution with configuration to connect to Yarn "master" node already included. It can be found at carmineds/spark. To build the image (not needed), follow the image build section.

Run Spark container

Refer to running detached containers section

Run:

sudo docker run -it -d -h spark --name spark --net swarmnet -p 4040-4045:4040-4045 -p 4444:22 carmineds/spark

4040 port (and next ones incrementally) are for Spark Web UI. If access to other services is needed, add ports in the command. Remember network on the host machine has to be configured properly for external (to the cluster) access (for AWS refer to port forwarding section).

Testing Spark on Yarn/Hadoop

Previous deployment of Yarn/Hadoop is needed.

On nn1 (Hadoop/Yarn "master" node) create a simple text file and add it to HDFS:

echo "hello world" > test.txt
hdfs dfs -put test.txt /
hdfs dfs -ls / #in the output list, test.txt should be present

On Spark node open a PySpark interctive shell:

pyspark --master yarn

Once the shell is ready, run:

df = spark.read.text('/test.txt')
df.count()

Number of lines in the text file should be printed (one line, so 1).

Appendix: port forwarding (for cluster deployment only)

AWS allows incremental port forwarding with Security Groups, but in general similar results can be achieved on any cloud provider and also on on-premises clusters. A Security Group contains a set of ports to open towards a range of IPs. The security group can be applied to any EC2 machine (node). E.g. a Security Group can be made for ports needed for Docker Swarm (needed by each node in cluster deployment).

Other Security Groups can be made for any service we want to expose (e.g. ResourceManager Web UI). It's important to have at least a Security Group that includes port 22/tcp (opened towards local machine), to access nodes via SSH.

Note

Containers can expose internal SSH port (22/tcp) with an alternative port (port 22/tcp is already bound on the container's host for SSH server), e.g. (mapping on port 4444):

sudo docker run ... -p 4444:22 -p 1234:1234 ... username/myimage

Hadoop and Kafka images in this project provide a running SSH server by default.

Appendix: how to build docker images

Looking into the Dockerfile for the image to be built, some tar.gz packages (or similar) are used. As they are heavy files, they've been omitted from the project files. Download them from their websites and put them in the same directory as Dockerfile.

To build the image run (from that same directory) :

sudo docker build -t username/myimage .

For local deployment, this is enough to then run username/myimage image how many times as it is needed. For cluster deployment doing this for each node can be avoided by uploading the image to Docker Hub, or to some other kind of Docker registry.