Mitigating Compute Design Problem Spectre, alternate Solution to Public Cloud

I discussed in my last blog about the computer processor design flaw “Spectre”.

Now due to this design flaw in the processor ,many big cloud customers are now have a serious doubt about the security of their applications and data in the public cloud. Which is very much valid and worrisome ,then how to mitigate this problem is the next question,as the problem is with the hardware itself updating the hardware or releasing a hardware patch is too difficult task to achieve in a small span of time.And it is too impractical to remove the old processor and put a new one with a proper design.

Then what is the possible solution ? In my opinion seeing the different side to cloud computing,

Let’s understand the problem again,the design problem with the processor can give access to sensitive data when the attacker runs his code in the same hardware processor, it is a problem in a shared hardware resource environment like public cloud.So this problem does not have any impact on a non-shared hardware resource environment.

So if we prevent the attacker to have access to the underlying hardware,so that he can not have direct access to the processor or memory to run his code then we are safe,and how that can be possible ? If we run our application in our own data center and the data inside our own company firewall.

That gives us the answer,rethinking on our private cloud strategy.

Let’s discuss what we can derive from our experience from a public cloud.

Inspiration Drawn From Public Cloud

  • Start Quickly
    • Can we start a Project quickly as the Business user need? Answer is, Yes we can.
  • Start Small
    • Can we Start Small so that you can right size to match the need, right size the infrastructure spend to match your business spend? Answer is, Yes we can.
  • Scale as you grow
    • As we business grow up and down, can you scale up and down with your infrastructure? Answer is, Yes we can.

AWS, Azure, Google?

Yes, but it is important to answer the following questions before you should move your workload to public cloud.

  • Are all of your business Application Designed to run in the cloud?
  • Do you have many Predictive workload?
  • How many Elastic workload you run in the Public Cloud?
  • Does the economic reality of Public Cloud align with your business objective?
  • Does public Cloud security meet your business requirements?

The Hyperconverged Infrastructure can bring the Advantage of the Public Cloud along with the lower per VM cost and the required security we need for our Enterprise Data.

Hyperconverged Infrastructure Advantage


So let’s see how we can achieve the Public Cloud functionality with Hyperconverged Infrastructure for a Private Cloud.

  1. Start Quickly – You can order one node and start working, One question is do you really start working on a Public Cloud VM after just purchasing it with a Credit Card? Off course not, we need time to plan and then Cloud architect or admin has to the design ,create the infrastructure before using it, In the same time we can also order one Hyperconverged Infrastructure and start using it.
  2. Start Small – Yes we can start with one node.
  3. Scale as you grow – We can buy more nodes as we progress with our Project and Increase in demand. The existing Infrastructure will scale and rebalance itself without much administration headache.
  4. Shrink or release the Infra as and when you are done: ?

Well that’s a Question mark, I am sure we can’t return a VXRAIL or Nutanix after using it, But then we can carefully plan our requirement so that we don’t have to return the Compute power we are already using.

So we almost get all the advantages of Public Cloud with a cheaper price and higher security of Data as we are doing it all in a Private Cloud.

Let’s discuss about some of the latest innovations also happened with processor, memory, networking and storage technology using which cloud computing became matured.

Server Power, Size and Cost

As the public cloud gained maturity, the hardware cost and the size has gone significantly low and small too. To give an example a “Raspberry Pi Zero” with a specification of 1GHz single-core CPU, 512MB RAM cost only $5, where we can very well run Ubuntu Linux.

You see Per CPU core ratio has gone up, Memory cost has gone down, SSD cost has gone down.This has an impact in the server cost and power used, we can get a very powerful server with a lot of storage with very low price and size.

Here is a small explanation of the way one can create a private cloud environment with different underlying hardware and technology. The right-hand side of the picture it shows how the per VM cost in a cloud environment has gone down.


Let’s See how much the cost has gone down of the Hardware (I am assuming this configuration is not for a mission critical application) because I believe still mission critical application has not completely found their ways to Public Cloud.

A Server from 2U Rack mount Supermicro with following Quick Configuration may cost around $14000

2 CPU 24 Core, 256 GB DDR 4 RAM, Around 5 TB of Usable Storage.

Let’s assume I get 3 Servers and VMware vSphere Essentials Kit – $660 with 3 years Support.

Total cost is $14000 *3 + $660 = $42660 + Admin Cost

This configuration has –

24 * 3 = 72 Compute Cores

256 * 3 = 768 GB RAM

5 TB * 3 = 15 TB of Usable Storage.

VMware vSphere with one vCenter Server Instance.

I assume this Server configuration I can run for next 4 years (Enterprise Hardware is pretty stable and robust now) and if I don’t need all those HA and Data backup and Disaster Recovery Plan etc, I know I am over simplifying for the sake of Cost Containment but do all Enterprises really need all those features like Development and Testing team. Some low priority Software Servers etc?

Let’s see how many test Servers I can assign to my Development and Test team.

Assuming I ll give a minimum of 8 GB RAM server, I can create 768/8 = 96 VMs in this Server configuration. As I have 72 Compute Cores I ll reduce it to 72 Instances, assigning 1 core to 1 Instance.

Assuming my Developers will work 10 hrs per day, in Office 8hrs and may be 2 hrs. More in their own private time which is true, Developers work more than 8 hrs J.

I know I have to add the Electricity Cost, cooling cost, the real estate cost and the Human Resource cost etc to this calculation, but considering the size of today’s hardware I am sure it’s not too much, because the configuration I am speaking will hardly take any space, Any company can just create a small partition and run the server and secure it, It ll not add up any real estate cost.

Let’s compare this cost to same number of Instances of Same size on AWS.


It comes as $2038/month

So for 48 months, $2038 * 48 = $97820 + Admin Cost.

$42,660 + Admin Cost vs  $97820 + Admin Cost in 4 years.

Let’s consider another configuration of VMs for the Server configuration above.

24 * 3 = 72 Compute Cores

256 * 3 = 768 GB RAM

5 TB * 3 = 15 TB of Usable Storage.

We have 72 physical cores i.e 72 * 2 = 144 Logical Processor Threads.

So I can run 144 VMs with one thread each which is the vCPU we know in any Public Cloud.

So if I assign 4 GB RAM to each VM with then I ll need 144 * 4 = 576 GB or RAM which is still less than our available RAM.

So let’s calculate the cost of 144 VMs with 1 vCPU and 4 GB ram on AWS.

Almost same result.

Actually the vCPU given by AWS is not a whole thread provided to your VM,It is a Shared Environment,and our logical CPU can be divided into many vCPUs.So actually we can create more number of VMs in our Server Env,if I go with RAM availability we can create 768/4 = 192 number of VMs with 1vCPU and 4 GB RAM.As the users are not going to use the processor continuously so the Environment is perfectly valid and can work as desired.

Let’s see what the cost with AWS is.

$2718/month * 12 (months) * 4 years = $130, 464.

So now we can compare again the cost.

$42,660 + Admin Cost vs  $130, 464 + Admin Cost in 4 years.

Here also you ll need a Cloud administrator, so the administrator cost is not we can add here as same you would incurred for creating your private cloud too.

Hyper Converged Infrastructure

Let’s consider the cost of Hyper Converged Infrastructure for your Private Cloud.Though this is an old blog from VMware, and I am not comparing any Hyperconverged Infrastructure vendors here. My cost assumptions almost correct here in the blog each server cost $27229, naturally the server is double powerful than my configuration.

Refer to the following 451 RESEARCH report


And referring to the blog from virtualgeek , The minimum cost of a VxRAIL Appliance from VMware cost around 60K.

From 6 cores to all the way up to 40 cores per CPU, from 64GB of memory all the way up to 1536GB of memory, from 3.6TB of storage all the way up to 48TB of storage.


Let’s have a Quick Comparisons of Different Series of VxRAILs

VxRail Node Comparisons
  G Series E Series V Series P Series S Series
Form Factor 2U4N 1U1N 2U1N 2U1N 2U1N
Cores 8 – 32 6 – 40 16 – 40 8 – 44 6 – 36
Memory 64 GB – 512 GB 64 GB – 1536 GB 128 GB – 1024 GB 128 GB – 1536 GB 64 GB – 1536 GB
Hybrid Storage Capacity 3.6 TB – 10 TB 1.2 TB – 16 TB 1.2 TB – 24 TB 1.2 TB – 24 TB 4 TB – 48 TB
All-Flash Storage Capacity 3.84 TB – 19.2 TB 1.92 TB – 30.7 TB 1.92 TB – 46 TB 1.92 TB – 46 TB N/A
Use Cases General-purpose for broad hyper-converged use cases Basic for remote office, stretch cluster or entry workloads Graphics-ready for uses such as high-end 2D/3D visualization High-performance optimized for heavy workloads such as databases Capacity-optimized with expanded storage for collaboration, data, and analytics

 So the Hyperconverged Infrastructure can bring the Advantage of the Public Cloud along with the lower per VM cost and the required security we need for our Enterprise Data.

2018 Trends shaping IT cloud strategies

Here are some of the trends I believe will be followed post Meltdown and Spectre.

  • Co-location services are on the rise (It makes it easier to have multi-cloud strategy)
  • Hyperconverge your private cloud (build private clouds that operate like public clouds)
  • Use of container will be still a Question Mark as the Processor Design Flaw (Spectre in particular allows One Container can Access Data from another Container in the Same Host.
  • Cloud cost containment
  • Lift and shift those cloud apps (Lift-and-shift migration tools will accelerate the rate of cloud migration)
  • Enterprise apps may find their way out of Public Cloud to a more secured Co-Lo or a Hyper Converged Infrastructure based Private Cloud.
  • Openstack , Open Source Cloud Software adaption will be interesting to watch.

Please leave your comments.



With the discovery of Meltfdown and Spectre the Trend and Future of Cloud Computing

With the discovery of Meltfdown and Spectre the Trend and Future of Cloud Computing

Last week when the new Bug and Design Flaw (Meltdown, Spectre) was found with Processors (Intel and other Processors) there is a question now on the future of public Cloud computing, In fact one of my friend told me Public Cloud is dead. I replied I don’t agree with it, I have been seeing the Cloud Computing Technology perhaps from its Inception.

I am going to talk about the Processor Design Flaw names Spectre.

“Spectre” is a Design Flaw in Modern Computer Processor Design.

Modern processors want to be faster to serve the growing demand of today’s substantial computing requirements. In that way, the processor designers took some of the fundamental theories into consideration.

“Instead of waiting for a task, like a conditional check, to complete and then proceed to the new function based on the outcome; the systems speculate what may be the next task based on the previous task execution experience, and then save the result in a cache memory.

If the outcome of the conditional check is favorable then the system proceeds with the speculated task otherwise it discards it and proceeds with another task. By doing this – it allows the processor to work much faster.

Let’s analyze this problem with reference to a real world scenario in our life.

“I go for lunch to a restaurant every day, I order the same menu for a week and then change that next week and follow the same for that week again before changing it.

So first day the restaurant owner prepared the dish after my order but after 3-4 days of learning that I take the same menu every day the restaurant owner prepared the dish even before I arrived at the restaurant so that he can just serve me better and other customers faster as well.

But then suddenly on the fifth day I just didn’t order the same menu, I ordered a different dish, then obviously the restaurant owner has to discard the already prepared dish and then prepare the new dish for me and serve.

Now Mr. X was following me for sometimes to know my eating pattern and habits and what exactly I prefer to eat. Now Mr. X just went to the waste bin and try to find what is that discarded and then he got the discarded food and noted it down to know my eating habit and what dish I ordered to eat, in this way Mr X got my secret without anyone noticing that he got my secrets.”

So what the owner of restaurant did was “Speculative execution” and the waste bin can be compared to the “cache memory” inside the processor but the owner of the restaurant (the processor) didn’t bother about the discarded food in the bin.

It’s a good thing to have “speculative execution” which assists in the acceleration of the performance of the system. However, the designers never thought about the security of the data which is saved in the cache memory before the conditional check result comes back. Consequently, if someone can get access to this cache memory before it is discarded then they can access the data – including such things as encryption keys, usernames/passwords, and other security credentials – sensitive data which they are not supposed to have access to. And because this memory cache is inside the processor all of the security designed into the chip is circumvented.

This is the underlying problem which has been named “Spectre”.

Understanding Speculative Execution

Let’s take the Example of a Code.


C = C+1


C = C- 1



The IF … THEN instruction results in a branch, until this instruction is executed there is no way we would know which instruction will be computed next (addition or subtraction). Modern Processor takes advantage of “Speculative Execution”, A method where the processor may start speculating which instruction might be the next instruction based on the previous experience and then it starts executing the instruction even before the conditional branch instruction executes and comes back with the result.

So here in this case in may start executing both the instruction i.e addition and subtraction t the same time to reduce the time of waiting. And when the result of the conditional check comes back, the result of the undesired instruction is simply discarded.

Now if an attacker can read these cache memory (BTB) before it is discarded then the attacker has access to the Data which he is not supposed/allowed to access,

Branch prediction improves the performance of the instruction execution and results in faster processing of instruction of branches by making use of a special small cache called the branch target buffer or BTB. Whenever the processor executes a branch, it stores information about it in this cache memory. When the processor next encounters the same branch, it is able to make a speculation or “guess” about which branch is likely to execute.

Here is a Video which explained the Problem Spectre.

Here is video demonstrating the meltdown attack

Now I am writing my thoughts why I don’t think Public Cloud is dead. But yes there will be a change in the trends how enterprises use cloud computing.



Learn to use Openstack for Free

Here I am giving a link to my video,in this video I have explained how to use openstack to create a VM and learn to use it for free.You can access the VM from your home network using SSH and  putty from your windows laptop or desktop. is the Openstack cloud Infrastructure  where you can create your VM and use it and learn for free using your facebook profile username and password.

#YesWeCan – Letter to our PM Nerendra Modi

Dear @NarendraModi Ji,
Last night I was listening to your #MannKiBaat on youtube,it was quite interesting and appealing too. I found you have asked everybody to participate and share their thoughts on #MannKiBaat and #YesWeCan.

After listening to your conversation with ordinary citizens of India along with US President @BarackObama ,suddenly one thing struck to my mind which I am thinking for sometimes now.

Before sharing my thoughts let me tell you one of my own experience which changed my thought process and belief about our own Indian Medicine system and ancient India.

Right now I am based out of Phoenix,Arizona ,USA,but originally from a small town in the Indian state of Odisha.Sometimes in 1996 – 97 when I was studying in Chennai and I suffered from Jaundice,after struggling for a month I returned to my home town so that my parents can take care of me.

I went to a Govt hospital at Dhenkanal, Odisha and the doctor prescribed medicines for another 2 month, but to my surprise when I went to purchase those medicines from a medical store near by,the owner of the store suggested not to buy those medicines from him but to go to a person who can cure me in less than 24hrs without any cost.I smiled at him and said I am not a fool to believe in your story that some one can cure jaundice in less than 24hrs.

Are you mad ? Jaundice can be cured in less than 24hrs ?

He said his own brother was cured from advance stage of jaundice ,when they were planning to admit him to AIIMS, Delhi for blood dialysis,but before going to Delhi they just tried as last option and that man cured him in less than 24 hrs.He asked me not to buy medicines from him for jaundice but after getting cured,I should come to him and buy other medicines from him in future.

It was convincing and I thought to give it a try.

The person he suggested me to go was the then chief priest of Jaggnath temple at Tigiria,Odisha,not the famous puri temple.Tigiria is a small town near Dhenkanal.


Day 1 : around 11 am

The same day I went to Tigiria’s Jaggnath temple and I saw there are good number of people fomed a line and I stood to meet the priest ,all are jaundice patients.I stood in the line to meet him.

When my turn came I saw the man is old ,perhaps in his 70s with a typical Indian Village attire, I don’t think he was educated enough,but was decent, humble with a great confidence on his face with the satisfaction of curing the disease,there was no chair or table,he was sitting on ground,there was younger assistant for him,he asked me my problem and then started preparing his medicine. I wanted to see what he is preparing as I din’t believe him totally.

So in front of me he prepared and took 2 – 3 Ridge Gourd seeds and one tea spoon of curd and rubbed the outer black layer of seed with one of his finger on a plane stone with the curd.Then throw away the white portion of the seeds,The white colored curd became black.


He asked me to lie down with face up,then he administrated that black colored curd ,half tea spoon to each of my nostril and asked me to inhale the solution so that it goes towards my fore head rather than my lungs.Thats it.

He didn’t accept any payment that we offered,he politely declined and said he does not do it for money.If I am interested then I can donate something for the goddess Durga he was worshiping in that small hurt.

I came back Dhenkanal town to my sister’s home and found that I am developing severe cold and yellow colored mocus formed inside my nose and started coming out.Next day I found I am feeling better but not cured from Jaundice.So I decided to go back to the priest again and ask why I am not cured and it didn’t work for me as he claimed.

Day 2 : Around 11 am

I went Tigiria again and met him at the temple ,when he saw me again and found that I came back to him ,he got a bit irritated,he said I should have waited 2 -3 days more before coming back.

But then he prepared the same solution ,but this time with more Ridge Gourd Seeds perhaps 4 to 5,and 2 tea spoon of curd.I understood he gave me a stronger dose this time.He gave me a bowl of curd and asked me to eat the curd after taking a full bath with cold water.Also he asked me not to eat raw rice,non-veg, sag (leafs) for one week.And I can eat anything else I wish to including masala curry,in india doctors advice not to eat masala and particularly turmeric,he reasoned if I don’t eat properly I ll be weak.

Around 1 pm :

While returning back I developed severe cold and my head was very heavy,I was feeling uneasy,after returning back to Dhenkanal I took a full bath and had my launch ,some rice and the curd given by the priest.I could not eat fully as yellow colored water started running out of my nose.I developed fever and had little vomiting also.

As my condition deteriorated my sister asked my brother in law to take me to my home town to be with my parents.

Around 8 pm :

Evening I came back to my home town,that time my condition was really bad with fever,severe cold ,heavy head,running nose with yellow colored water.I could take very little dinner.My parents started worrying about me by seeing my condition.

12 midnight :

Around 12 midnight my condition further deteriorated ,the water flow from my nose was nonstop,if I lowered my head it was like two water streams coming out my nose.I had a thick towel it was totally wet with yellow colored water.Literally so much of water was coming out of my nose that I could squeeze the towel and water was coming from that.There was a continuous flow of yellow water.

From midnight to around 4 am :

My condition deteriorated and I could not tolerate the pain,I was feeling like any moment my head is going to burst.So much of heaviness and pain.Then I thought may be the priest gave me something bad and I am going to die tonight.I even said this to my mother that I may die tonight,as I can’t tolerate the pain in my head.I may stop breathing any moment,my mother didn’t sleep whole night and watching me helplessly,as there was no emergency clinic or hospital open that time of night.Even I started scolding the old priest (I regret it).

Around 4 am :
Slowly pain and heaviness of head reduced,I felt better and better,yellow water flow slowly reduced.

Around 7 AM morning :

There was no water flow from nose,only little mocus,my eyes were completely normal,no symbol of any yellow color in my whole body.There was no pain.

Around 10 am : I was feeling completely out of jaundice like I was one and half months earlier.

Yes I was out of jaundice in less than 24 hrs.Was it a miracle ? No, I don’t think so,it was not a miracle but nothing less than a miracle too.

I told my experience to one doctor in Bangalore ,he thought I am making a story,he didn’t believe me and said bring this man and I can give him a Nobel prize if he can cure jaundice in less than 24hrs and I am sure ,many who read this letter including you sir,will not believe too,but yes I am the proof I have experienced it,my parents, my sister and brother in law are the witness ,we know how it happened.

I believe ,It was not any supernatural power of him nor any miracle,but the respected ,humble and honest priest knew a medicine which the present medical fraternity of the world do not know.

Still many people of Odisha in Sambalpur district are suffering from jaundice ,it became a mater of concern for Odisha govt and central govt too.


But I am afraid the respected priest is no more and perhaps the knowledge of the medicine is gone with him as well,I heard his son is giving the medicine now but it’s not so effective as it used to be when his father was giving.

So what is the intention of telling my experience here ?

I appeal to you Sir,do something for these unsung heroes of India.lets take some steps to do some research on the techniques of these unfamiliar and unconventional medicines which really do wonder and miracles.

I know some will call them “Andh Vishwas” ,yes I too know some Andh Vishwas present in our Indian society.But there are some genuine people are there who knows some invaluable things.And we need to have something for them.May be we can find something which the world has not seen yet.As I believe ancient India was much more advanced than rest of the world.

Here is my analysis,why it didn’t work for me on first day ?

Because first day the priest didn’t give me the right amount of dose.

And Why I suffered with unbearable pain on the 2nd day and I felt like dying ?

May be second day ,it was a overdose for me.

So we,the modern world ,if we can do some research on this drug and find the perfect solution and dose then I am sure we can get ride of and can find a new medicine for this dreaded disease.

I have heard of many unconventional way of treatment present in different part of India for different disease ,instead of just calling them “Andh Vishwas” let’s give some platform to demonstrate their talent in public with govt support and do some research on them to refine those methods and drugs so that we all can be part of such miracles and revive our ancient and advanced treatment methods instead of just depending on the research of the western world.I am not telling western world is not advance in terms of medicine,but we also can have our own ways too.



Panchaleswar Nayak

Readers of this letter ,request you to share this as much as possible.If you know or experienced anything different and unconventional way of treating a different disease please share it and write your own experience.

Warning : Do not to try the method I explained here at home,it can be dangerous for you,I am no way responsible of the outcomes if you try out this method for curing jaundice.If you want to try, you can contact the priest (son of the priest who treated me) of the Lord Jagannath Temple, Tigiria, Odisha,India.,85.510776,9z/data=!4m2!3m1!1s0x3a18e4ae146a9a9b:0xea1a48eb421ea374

Cloudera’s Quickstart VM vs Hortonworks Sandbox comparison -Dec 2014

Last year I created a Comparative study of the two big hadoop distributions ,cloudera and hortonworks, with their Learning products Quick Start VMs from Cloudera and Sandbox from Hortonworks.

Now Lets do the same thing again,after one year lets see what has been changed and what is the Difference and similarities between these two products or Hadoop Distros.

So lets me act same way like a new user who is trying to learn Hadoop and Big Data through Cloudera or Hortonworks, I ll start analyzing these two products from a new users perspective and angle.

The pic of CloudEra Quickstart VM and Start CloudEra Manager.

CloudEra Quickstart VM-1

Well first thing comes to my mind is creating a working Hadoop Cluster. So lets create a Hadoop Cluster by adding more hosts to this Quickstart VM using the Cloudera manager.

I created  another CentOS VM on my Desktop.



open the Browser and access cloudera manager

Add the newly created VM to the Cluster by adding it as a Host to Cloudera Manager.

Here is the output


and the details of the error


So may be we are getting advanced in terms of adding new features and nationalities but we are still lacking on the basic functionalists and the need of a new user to learn Hadoop.I tried to install the cloudera -manager-agent manually to the datanode-host with the command

“sudo yum install cloudera-manager-agent ” and it installed correctly ,the download speed of the cloudera manager agent packages are quite slow,it took some time to download and install which is around 416 MB,in a 50 MBps pipe  it should be pretty fast.I am sure there will not be many simultaneous  download of the agent  so the download is slow.(the speed was really bad,I had to wait almost 30 mins to download the packages.)

cloudera-,anager-agent-download and install

Now lets see if still I can add the host to the hadoop cluster.


No still it is not going through,well I ll not try it more. As any new uesr would spend more time in learning the cluster and Hadoop rather than wasting time in digging out whats going wrong in adding the Hosts using Cloudera manager.

Now lets go and try the same thing with Hortonworks.

The Similar fuctionality to cloudera-manager is the Ambari from Hortonworks with which one can manage the hadoop cluster.

Hortoworks Sandbox VM

So lets check the Amabri fuctionality on the Hortonworks sandbox,I have logged in to the sandbox and enabled Amabri.

now lets login to Amabri and try to add a host to the Sandbox.

Hortoworks Sandbox Amari enabled

I created a new VM  ie datanode2.localdomain tried to add it to the Sandbox as a new host to the cluster.Well before I could add a host it is asking me to give a SSH Private Key.

Amari add-newhost

My comment :

Why should I crerate a Private SSH key and provide here,I have a VM ready and I need to add it to cluster ,simple let it create the key,this is a simple environment for learning ,why I need to do the basic admin work of creating a Private SSH key  ?

well let me create a Private SSH Key  and see if I can succeed

Immediately it returned as failed to add.It seems it didn’t connect to the VM at all.

Amari add-newhost-output-1

Let me try without using SSH and manually installing the Agent on the host.But is there any way to know how to install the Hortonworks Agent instalation ? I have to search on Google coz there is no info of that on the page 😦 .

Amari add-newhost-2

Well I found on Hortonworks site  the following info

” Chapter 5. Appendix: Installing Ambari Agents Manually

In some situations you may decide you do not want to have the Ambari Install Wizard install and configure the Agent software on your cluster hosts automatically. In this case you can install the software manually.

Before you begin: on every host in your cluster download the HDP repository as described in Set Up the Bits. ”

My Comments :

This tells me I have to install HDP repository first ,com on do I have to install all manually why should I use your tool  ? I can do eveything manually by learning hadoop. why should I use your tool ?

Should I install HDP repository now  ? I don’t think I  am in a mood to do that now.So I ll not proceed further on this.

So we came to know both the Learning VMs  lack the basic functionalists of creating a hadoop cluster. May be there are good for one VM show.This is not good for the guys who wants to know the admin part of hadoop cluster ,sorry guys better you create your own cluster from beginning by following documents provided by Apache or many be individually from Cloudera and Hortonworks.

These VMs are no good for learning Hadoop administration. Better you create you own Hadoop Cluster manually. To Create a Cluster please follow my earlier blogs.

Tomorrow I ll try to find the Development aspect of these Products (VMs)

Lets see how matured they are and how helpful they are for a Hadoop Developer.


High Availability in the OpenStack Cloud

Openstack is based on a modular architectural design where services can be co-resident on a single host or, more commonly, on multiple hosts.

Before Discussing HA lets know in brief about the Openstack Architecture.

Openstack conceptual architecture:


Q : With High Availability(HA) what we try to do ?

Ans : We try to minimize two things.

1. System downtime

This occurs when a user-facing service is unavailable beyond a specified maximum amount of time.

2. Data loss —

When the machine goes down abruptly there can be a loss of data ,Accidental deletion or Destruction of data.

Most high availability systems guarantee protection against system downtime and data loss only in the event of a single failure. However, they are also expected to protect against cascading failures, where a single failure deteriorates into a series of consequential failures.

The main aspect of high availability is the elimination of single points of failure (SPOFs). A SPOF is an individual piece of equipment or software which will cause system downtime or data loss if it fails. In order to eliminate SPOFs, check that mechanisms exist for redundancy of.

  1. Network components, such as switches and routers

  2. Applications and automatic service migration

  3. Storage components

  4. Facility services such as power, air conditioning, and fire protection

OpenStack currently satisfy HA requirements for its own infrastructure services. Means it does not guaranty the HA of the Guest VMs but its own Services.

Now Lets try to understand what are the Openstack Services are that needs to be Highly available.

1. Identity Service, i.e Keystone-API.

2. Messaging Service, RabbitMQ

3. Database Service, i.e MySql Serivce.

4. Image Service.i.e Glance-API.

5. Network Service,Newtron-API.

6. Compute Services ,Nova-API.

  1. Nova-API

  2. Nova-Conductor

  3. Nova-Scheduler

There two type of services here stateless and stateful.

A stateless service is one that provides a response after your request, and then requires no further attention. And the services are follows.

Keystone-api :

Keystone is an OpenStack project that provides Identity, Token, Catalog and Policy services for use specifically by projects in the OpenStack family. It implements OpenStack’s Identity API.


Glance-api :

Project Glance in OpenStack is the  Image Service which offers retrieval, storage, and metadata assignment for your images that you want to run in your OpenStack cloud. 


Neutron-api :

Neutron is an OpenStack project to provide “networking as a service” between interface devices (e.g., vNICs) managed by other Openstack services

Nova-api :


Nova-Conductor : Nova-conductor is a RPC server. It is stateless, and is horizontally scalable, meaning that you can start as many instances on many servers as you want. Note that most of what nova-conductor does is doing database operations on behalf of compute nodes. The majority of the APIs are database proxy calls. Some are proxy calls to other RPC servers such as nova-api and nova-network.

The client side of the RPC call is inside nova-compute. For example, if there is a need to update certain state of a VM instance in nova-compute, instead of connecting to the database directly, make a RPC call to nova-conductor, which connects to the database and makes the actual db update.

Nova-Scheduler: Compute resource scheduling, it is the DRS for Openstack environment, people know DRS (Distributed Resource Scheduler)  in a VMware environment can understand its functionality if they understand what VMware DRS is.


Openstack stateful Services.

A stateful service is one where subsequent requests to the service depend on the results of the first request. Stateful services are more difficult to manage because a single action typically involves more than one request, so simply providing additional instances and load balancing will not solve the problem.

  • OpenStack Database 
  • Message Queue

The important thing to achieve HA is to make sure these services are redundant, and available,apart from these services some of networking services also needs to be highly available and for all these services how you achieve that is up to you.

In my next blog I ll write some of the common ways to achieve high availability on these openstack services.

Top Threats In The Public Cloud

When a new comer to the cloud world wants to adapt cloud in his/her environment the first question arises in his/her mind is whether he/she  should host his IT and software applications in a public cloud or private cloud.

Today in my blog  I want to write some concerns normally raised by the guys who are in support of private cloud.

Public  Cloud or  private Cloud ?

The first thing they say public cloud is not secured because it uses internet so all the security issues of internet applies to public cloud too,it is a very generic , standard phrase and assumption made by the supporter of private cloud.

Many people and IT department raise concerns over  adapting and going to public cloud,some concerns are definitely valid but with the advancement of cloud technology and availability of Software tools, these concerns are addressed by today’s  public Cloud providers up to a large extent.

First lets see what are the top threats in the Public Cloud.For the definition of Cloud Please refer to my previous blog,when the  definition is followed to create a public cloud environment the following threats are bound to appear.

Cloud Security Alliance (CSA) & European Network and Information Security Agency (ENISA) outlined the following top threats in the Public Cloud……

Top Threats In The Public Cloud

1.Loss of Governance

2.Vendor Lock-in

3.Isolation Failure

4.Compliance Risk

5.Management Interface Compromise

6.Data Protection

7.Insecure or Incomplete Data Deletion

8.Malicious Insider

First I ll try to explain what these threats are and then I ll try to give solutions which are adapted by the public cloud service providers.

1.Loss of Governance.

By using public cloud infrastructures, the client or user necessarily cedes control to the Public Cloud Provider (CP) on a number of issues which may affect security. At the same time, SLAs may not offer a commitment to provide such services on the part of the cloud provider, thus leaving a gap in security defenses.

Q.Which way the Public Cloud Service Provider  Address this  issue or threat ?

A. I ll take the example of Amazon Web Services (AWS)  to see how they address the issue.

Amazon Virtual Private Cloud (Amazon VPC)


Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define.You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

You can move corporate applications to the cloud, launch additional webservers, or add more compute capacity to your network by connecting your VPC to your corporate network.Because your VPC can be hosted behind your corporate firewall, you can seamlessly move your IT resources into the cloud without changing how your users access these applications.

You can select “VPC with a Private Subnet Only and Hardware VPN Access” from the Amazon VPC console wizard to create a VPC that supports this use case.

So with this solution the issue of “Loss of Governance is somewhat addressed if not completely.

2.Vendor Lock-in.

There is few  on offer in the way of tools, procedures or standard data formats or services interfaces that could guarantee data, application and service portability.This can make it difficult for the customer to migrate from one provider to another or migrate data and services back to an in-house IT environment.

This introduces a dependency on a particular Cloud Service Provider for service provision, especially if data portability, as the most fundamental aspect, is not enabled.

Q : How AWS try to resolve the Lock-IN Issue ?

 Ans : VM Import/Export tool .

 VM Import/Export enables you to easily import virtual machine images from your existing environment to Amazon EC2 instances and export them back to your on-premise environment.


  • This offering allows you to leverage your existing investments in the virtual machines that you have built to meet your IT security, configuration management, and compliance requirements by bringing those virtual machines into Amazon EC2 as ready-to-use instances.
  • You can also export imported instances back to your on-premise virtualization infrastructure, allowing you to deploy workloads across your IT infrastructure.

Q: What is the way VMware try to address this issue of Lock-In ?

 Ans : vCloud Connector

 VMware vCloud Connector links your internal private cloud with public clouds, so you can manage them as a single hybrid environment and transfer workloads back and forth.


 VMware vCloud Connector links vSphere-based private and public clouds via a single interface, allowing you to manage them as a single hybrid environment. You can transfer virtual machines (VMs), vApps (collections of VMs with related resource controls and network settings), and templates from one vSphere-based cloud to another. You can also set up a content library to distribute and synchronize templates across clouds and extend a single Layer 2 network from your private data center to a vSphere-based public cloud.

3: Isolation Failure

 Failure of Mechanism separating Storage,Memory, Routing and even reputation between different tenants.

  • Multi-tenancy and shared resources are defining characteristics of cloud computing.
  • This risk category covers the failure of mechanisms separating storage, memory, routing and even reputation between different tenants (e.g., so-called guest-hopping attacks).
  • However it should be considered that attacks on resource isolation mechanisms (e.g.,. against hypervisors) are still less numerous and much more difficult for an attacker to put in practice compared to attacks on traditional OSs.

Q : How AWS address  this Issue ?

 Ans : Amazon EC2 Dedicated Instances.

 Dedicated Instances are Amazon EC2 instances launched within your Amazon Virtual Private Cloud (Amazon VPC) that run hardware dedicated to a single customer. Dedicated Instances let you take full advantage of the benefits of Amazon VPC and the AWS cloud – on-demand elastic provisioning, pay only for what you use, and a private, isolated virtual network, all while ensuring that your Amazon EC2 compute instances will be isolated at the hardware level.


  • You can easily create a VPC that contains dedicated instances only, providing physical isolation for all Amazon EC2 compute instances launched into that VPC, or you can choose to mix both dedicated instances and non-dedicated instances within the same VPC based on application-specific requirements.


4: Compliance Risk

Investment in achieving certification may be put art risk by migrating to the cloud.

 Investment in achieving certification (e.g., industry standard or regulatory requirements) may be put at risk by migration to the cloud:

  • If the CP cannot provide evidence of their own compliance with the relevant requirements .
  • If the CP does not permit audit by the cloud customer (CC).
  • In certain cases, it also means that using a public cloud infrastructure implies that certain kinds of compliance cannot be achieved.

Q : How AWS adress this Compliance Risk

Ans : AWS Risk and Compliance Program

  • AWS provides information about its risk and compliance program to enable customers to incorporate AWS controls into their governance framework.
  • This information can assist customers in documenting a complete control and governance framework with AWS included as an important part of that framework.

5: Management Interface Compromise

Remote Access and web browser vulnerability

  • Customer management interfaces of a public cloud provider are accessible through the Internet and mediate access to larger sets of resources (than traditional hosting providers) and therefore pose an increased risk, especially when combined with remote access and web browser vulnerabilities.

Q : How AWS address this issue ?

Ans : AWS ElasticWolf Client Console

  • AWS ElasticWolf Client Console is a client-side application for managing Amazon Web Services (AWS) cloud resources with an easy-to-use graphical user interface
  • ElasticWolf is packaged with all necessary tools and utilities to generate and deal with private and public keys and certificates, as well as a Windows ssh client for accessing Linux EC2 instances. In addition, it integrates easily with the AWS command line interface (CLI) tools so that you can use it and the CLI together.
  • Another advantage of ElasticWolf is that it logs all API calls that it makes to your local disk. You can view these logs by selecting File | Show Access Log from the ElasticWolf menu.


6 : Data Protection

 It may be difficult to for the cloud customer to effectively check the data handling practices of the cloud provider.

 Cloud computing poses several data protection risks for cloud customers and providers. In some cases, it may be difficult for the cloud customer (in its role as data controller) to effectively check the data handling practices of the cloud provider and thus to be sure that the data is handled in a lawful way.

  • This problem is exacerbated in cases of multiple transfers of data, e.g., between federated clouds.
  • Some cloud providers do provide information on their data handling practices.
  • Some also offer certification summaries on their data processing and data security activities and the data controls they have in place.

 Q: How AWS handles Data protection.

 Ans : The AWS SDK for Java provides an easy-to-use Amazon S3 client

 Client Side Data Encryption

–      In client-side encryption, your client application manages encryption of your data, the encryption keys, and related tools. You can upload data to an Amazon S3 bucket using client-side encryption.

  • Server Side Data Encryption

–      Server-side encryption is data encryption at rest, that is, Amazon S3 encrypts your data as it uploads it and decrypts it for you when you access it

  • The AWS SDK for Java provides an easy-to-use Amazon S3 client that allows you to securely store your sensitive data in Amazon S3.
  • The SDK automatically encrypts data on the client side when uploading to Amazon S3, and automatically decrypts it for the client when data is retrieved.
  • Your data is stored encrypted in Amazon S3 – no one can decrypt it without your private encryption key.


Also at Client Side the User can use many available security software products such as

 Symantec Data Loss Prevention Enforce Platform.

 Manage Data Loss Prevention policies centrally and deploy globally.

Symantec DLP

Symantec Data Loss Prevention Enforce Platform automatically enforces universal Data Loss Prevention policies with a centralized platform for detection, incident remediation workflow and automation, reporting, system management and security.

  • Data Loss Prevention delivers a unified solution to discover, monitor, and protect confidential data wherever it is stored or used.

7: Insecure or Incomplete Data Deletion

In the case of multiple tenancies and the reuse of hardware resources,this represents a higher risk to the customer than with dedicated hardware.

When a request to delete a cloud resource is made, as with most operating systems, this may not result in true wiping of the data.

  • Adequate or timely data deletion may also be impossible (or undesirable from a customer perspective), either because extra copies of data are stored but are not available, or because the disk to be destroyed also stores data from other clients.
  • In the case of multiple tenancies and the reuse of hardware resources, this represents a higher risk to the customer than with dedicated hardware.

Q : How AWS addresses this issue ?

Amazon Simple Storage Service (Amazon S3) Security.

  • Amazon S3 APIs provide both bucket- and object-level access controls, with defaults that only permit authenticated access by the bucket and/or object creator. Write and Delete permission is controlled by an Access Control List (ACL) associated with the bucket. Permission to modify the bucket ACLs is itself controlled by an ACL, and it defaults to creator-only access. Therefore, the customer maintains full control over who has access to their data. Amazon S3 access can be granted based on AWS Account ID, DevPay Product ID, or open to everyone.
  • Amazon S3 is accessible via SSL encrypted endpoints. The encrypted endpoints are accessible from both the Internet and from within Amazon EC2, ensuring that data is transferred securely both within AWS and to and from sources outside of AWS.
  • When an object is deleted from Amazon S3, removal of the mapping from the public name to the object starts immediately, and is generally processed across the distributed system within several seconds. Once the mapping is removed, there is no external access to the deleted object. That storage area is then made available only for write operations and the data is overwritten by newly stored data.
  • DragonDiskS3Client

8: Malicious Insider

The damage which may caused by malicious insider is often far greater.

While usually less likely, the damage which may be caused by malicious insiders is often far greater.

  • Cloud architectures necessitate certain roles which are extremely high-risk.
  • Examples include CP system administrators and managed security service providers.

Q : How to address this issue ?

The  strategy for data monitoring, addresses the gray area between static access controls.

  • Monitoring provides the capability to forensically identify misuse and proactively detect a data breach when valid permissions are abused.
  • By closely examining user actions, security pros can more effectively determine whether users’ intentions are malicious. Monitoring is how we sniff out suspicious—yet technically approved—access.

* Disclaimer : I have taken some of the words , phrases and diagrams from different product documents and blogs available on internet,if found any match then they belong to the respective owners, I don’t claim they are my own.I just tried to put my views and thoughts by reproducing some texts available free on internet.

The solution I have given  to various threats may not be to the point,it is my personal view and judgment and I am no way responsible for any outcome of the decision taken to adapt private or public cloud.

-Panchaleswar Nayak

Cloud Computing Explained

The Cloud Approach.

Now a days everyone is talking about cloud and it’s not a new technology anymore, many organizations have adapted it successfully and implemented it. Many who are yet to implement are thinking seriously of adapting the cloud model. May be some think they don’t want to be left out in the race of this cloud age and some really understood its requirement in today’s technological advancements.

Well I have been fortunate enough to work in some of the pioneer companies of this cloud technology.

One of them is Sun Microsystems and another is VMware.

I remember when I joined Sun almost 10 years back in 2004.Internally people used to talk about cloud computing, it was not hype that time but inside Sun we used to talk about cloud technology. Off course Sun was not into x86 Virtualization, but definitely they were very much into virtualizing their SPARC processor. With the help of Solaris 10 OS Sun could virtualize the SPARC and in Solaris OS we could create the Solaris Containers. Again they had LDOMS (Logical Domains) and now Oracle has changed its name to Oracle VM,I am not going to talk the LDOMS or Solaris Containers here but want to point out that Virtualization was there before VMware started it, Well before Sun there was Virtualization too with IBM Mainframes .

My point is virtualization is not a new technology but definitely the invention of virtualizing the x86 processor by VMware was a kick off point for cloud technology to take off. Because common IT people could not get access to virtualization technology of these IBM’s or Sun’s  Servers where they could understand or implement virtualization,infact it was too complicated  to understand and implement that time.

But When Vmware could virtualize the x86 processor then Virtualization became a household name among normal IT people. I remember when Vmware had its Workstation Software where you can run multiple Oss on a Single Desktop or laptop, it was looking like a College Student project who made it for fun, but suddenly people started realizing the potential in this technology and started adapting it for improving the so called computer resource utilization factor.

People and Enterprises started using Virtualization and use different virtualization technology and suddenly cloud computing became a reality that was envisioned by many technical guys.

Today I am going to write something about the way of implementing this cloud.

Well before discussing about the way it is implemented let’s understand first what is exactly Cloud computing means.

Cloud Computing Explained :

Here is a Definition of Cloud Computing given by National Institute of Standards and Technology’s (NIST).

“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that  can be rapidly provisioned and released with minimal management effort or service provider interaction. “

Let’s try to understand this definition. We ll try to understand by asking some questions to ourselves and then we ll try to find out the answer from this definition.

Q : What is Cloud Computing?

Ans : It’s a Model

Q : Model For what  ?

Ans : For enabling something.

Q : Enabling what  ?

Ans : For enabling on-demand network access of some computing resources type.

Q : What is on-demand  network access ?

Ans : On demand means when it is desired or required by a user to access and use, and it should be accessed by network means.

In a way the “configurable computing resources” should accessed by network means whenever required by a user.

Q : What are all these computing resource type ?

Ans : ubiquitous , convenient, configurable computing resources.

Q : What is ubiquitous type ?

Ans : The meaning of ubiquitous is ,present, appearing, or found everywhere.

So here in the context of Cloud computing and Computing resource, it is the type of Computing recourse whatever present and available and whatever fond everywhere there is no special type of hardware,network or computing resource required for Cloud.

Q : What is convenient type ?

Ans : It means , fitting in well with a person’s needs, activities, and plans.

In context of Cloud and Computing resources it meaning is the hardware or computing resource which fits into everybody’s need and something one should share which may not be of only for a special requirement or can’t be fit into normal everyday use.

Q : What is Configurable ?

Ans : The dictionary meaning of “configuration” means “to design or adapt to form a specific configuration or for some specific purpose”. So “Configurable”  in context of Cloud and Computing recourse means , the Computing resources that is used for cloud should be configurable any point of time. Not like it is once set and then made to use.

Q : And what are these computing resources ?

Ans : In the definition it is provided in the bracket as (e.g., networks, servers, storage, applications, and services). It means the computing resources can be network, server, storage, applications or any other services. You can see it is not just limited to hardware recourses but applications and services are also termed as computing recourses.

Q : Well there is a term called “ Shared Pool” of configurable computing resource. What is Shared pool ?

Ans : Shared pool is a pool of things where things are just accumulated or consolidated to so that it can be shared to be consumed.

Here we are talking of Shared pool of configurable computing resources. Note here it is not just computing resource but “configurable computing resources”. So this is pool of Computing resource which can be configured when there is a need and can be shared among the users when it is required.

Q : The Next line of the definition says “ ….. that can be rapidly provisioned and released with minimal management effort or service provider interaction” what does this mean ?

Ans : Again lets divide it into words  “rapidly provisioned and released” and “minimal management effort or service provider interaction

First lets discuss about “rapidly provisioned and released”. This is nothing but Rapid elasticity.

So any Computing resource that is in the shared pool should be rapidly provisioned and released. It means whenever a user needs to use these computing resources they should be available immediately if not atleast with minimal process delay and network delay and also these consumed resource should be released immediately whenever the user finished his work so that the computing resource can be returned to the Pool and be available for others to be used. So the property which is called “Rapid elasticity” in cloud is this one. Like it should be elastic in nature.

The Next one is “minimal management effort or service provider interaction” the characteristic of “On-demand Self Service portal”.

Here the definition says the shared pool of configurable computing resources should be used by individuals or users with minimal management effort or minimal effort from the providers who provide the Cloud Service. So the Cloud provider should provide a Self Service portal where user can register themselves and then be able to use the computing resources themselves rather than someone from the Services providers provides them.

So from these explanations we found the cloud model is composed of five essential characteristics.

Essential Characteristics:

On-demand self-service :  A user can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.

Broad network access : Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).

Resource pooling : The service provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, and network bandwidth.

Rapid elasticity : Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.

Measured service :  Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

In my next blog I ll discuss about what are all the ways Cloud Computing can be  Implemented.

Creating Eclipse Plugin for Hadoop-2.2.0

Today I am going to Create Hadoop Plugin for Eclipse.

1.Download and Install Eclipse IDE
2.Download source code for Eclipse Plugin for Hadoop
3.Create Eclipse Plugin for Hadoop.
4.Compile and create jar
5.Install the plugin to eclipse.

1.First Download and install Eclipse.


2.Then install git on your CentOS Virtual Machine.Run the following command as root.

#sudo yum install git

If not installed ant then download using the following

#sudo cd /usr/local


#tar -zxvf apache-ant-1.9.3-bin.tar.gz

Set the Path in /etc/profile file and then logout and relogin as normal hduser.

As a normal hduser run the following command.

3.Download source code for Hadoop Plugin for Eclipse from git

$git clone

4.Compile and create jar

$cd /home/hduser/hadoop2x-eclipse-plugin/src/contrib/eclipse-plugin

$ant jar  -Dversion=2.2.0 -Declipse.home=/usr/local/eclipse -Dhadoop.home=/usr/local/hadoop-2.2.0

The build output can be found at



Learning Hadoop Video Series : Creating Single Node Hadoop 2.2.0 Cluster

Here I am going to write the commands and show the snapshots taken from my Virtual Machine.Why I am writing the whole process instead of showing in a video is one can copy and paste the commands to run in his Vm as needed.

First lets understand and summarize the whole process of creating and testing out Hadoop Cluster.

So here is the steps to create the Single Node Hadoop Cluster ibn your laptop.

Creating the basic infrastructure for the deployment.

1.Dowload and Install VMware Player.
2.Dowload and install CentOS as a VM.
3.Download and Install Java 6.
4.Create a Dedicated Hadoop group : hadoop
5.Create a dedicated hadoop user : hduser and add the the hadoop group
6.Configure ssh to Use Public/Private Keys for Authentication .
7.Download Hadoop and extract it to defined directory.

Setup of OS and Hadoop Software

8.Setup Environment Variables in /etc/profile
9.Create Hadoop Data Directories for namenode and datanode.
10.Configure Hadoop Cluster
11.Format namenode
12.Starting HDFS processes and Map-Reduce Process
13.Verify Installation.

Run Java Application

14.Running Java Hadoop Application on the installed Single node cluster
15.Check the output at web interface.

Now Let me Explain you the Whole process in details.

1.Download and Install VMware Player.

2.Download CentOS ISO and Install it on VMware Player.

For this Please see my Previous post

3.Download and Install Java 6.

We need to Install Java.As java 6 is fully supported for hadoop so I am going use it,but one can try with java 7 also if needed.

See my previous post to install Java on CentOS.

4.Create a Dedicated Hadoop group : hadoop

#groupadd hadoop

5.Create a dedicated hadoop user : hduser and add the the hadoop group

If creating a new user then use

# useradd hduser -g hadoop

if changing existing user then use

# usermod -g hadoop hduser

6.Configure ssh to Use Public/Private Keys for Authentication .

#ssh-keygen -t rsa

This will create two files in your (hidden) ~/.ssh directory called: id_rsa and
The first: id_rsa is your private key and the other: is your public key.

Now set permissions on your private key:

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/id_rsa

Copy the public key ( to the server and install it to the authorized_keys list:

$ cat >> ~/.ssh/authorized_keys

and finally set file permissions on the server:

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/authorized_keys

The above permissions are required if StrictModes is set to yes in /etc/ssh/sshd_config (the default).

Ensure the correct SELinux contexts are set:

$ restorecon -Rv ~/.ssh

7.Download Hadoop and extract it to defined directory.

$ wget

Be user root and do the following.

$ mv /home/hduser/hadoop-2.2.0.tar.gz /usr/local

$ tar -xvzf hadoop-2.2.0.tar.gz

8.Setup Environment Variables in /etc/profile

The following Environment Variable has to be set.Setup the Environment Variables in /etc/profile so that it ll be available to all the users.

#vi /etc/profile and add the following line to it.

export JAVA_HOME=/usr/local/jdk1.6.0_45
export ANT_HOME=/usr/local/apache-ant-1.9.2
export HADOOP_HOME=/usr/local/hadoop-2.2.0

export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

$ sudo chown -R hduser:hadoop /usr/local/hadoop-2.2.0
$ sudo chmod -R 755 /usr/local/hadoop-2.2.0

Now logout of root and relogin to the VM as hduser.

9.Create Hadoop Data Directories for namenode and datanode.

Check the environment variable has been set or not using the following.


Now create two Directories for name node and datanode.

$ mkdir -p $HOME/myhadoop-data/hdfs/namenode
$ mkdir -p $HOME/myhadoop-data/hdfs/datanode

10.Configure Hadoop Cluster

$ vi etc/hadoop/yarn-site.xml

Add the following inside configuration tag


Now edit the core-site.xml file.

$ vi etc/hadoop/core-site.xml

Add the following contents inside configuration tag


Now edit the hdfs-site.xml file

$ vi etc/hadoop/hdfs-site.xml
Add the following contents inside configuration tag




Now edit mapred-site.xml file

$ vi etc/hadoop/mapred-site.xml

Add the following contents inside configuration tag

11.Format a new Hadoop distributed filesystem:

$HADOOP_HOME/bin/hdfs namenode -format <cluster_name>

# Command for formatting Name node.

$ /usr/local/hadoop-2.2.0/bin/hdfs namenode -format

12.To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.

Start the HDFS with the following command, run on the designated NameNode:

$ start namenode

Run the script to start DataNodes on all slaves:

$ start datanode

Start the YARN with the following command, run on the designated ResourceManager:

MR(Resource Manager, Node Manager & Job History Server).

$ start resourcemanager

Run the script to start NodeManagers on all slaves:

$ start nodemanager

Start the MapReduce JobHistory Server with the following command, run on the designated server:

$ start historyserver

13. Verifying Installation

$ jps


14.Running Java Hadoop Application on the installed Single node cluster

$ mkdir input
$ cat > input/textfile

word count example using hadoop 2.2.0.
Here we count the number of words this file has.

Add input directory to HDFS:

$ bin/hadoop hdfs -copyFromLocal input /input
Run wordcount example jar provided in HADOOP_HOME:
$ hadoop jar /usr/local/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar  /input /output

15.Check the output at web interface

Browse HDFS dir for /output folder