Cloudera’s Quickstart VM vs Hortonworks Sandbox comparison -Dec 2014

Last year I created a Comparative study of the two big hadoop distributions ,cloudera and hortonworks, with their Learning products Quick Start VMs from Cloudera and Sandbox from Hortonworks.

Now Lets do the same thing again,after one year lets see what has been changed and what is the Difference and similarities between these two products or Hadoop Distros.

So lets me act same way like a new user who is trying to learn Hadoop and Big Data through Cloudera or Hortonworks, I ll start analyzing these two products from a new users perspective and angle.

The pic of CloudEra Quickstart VM and Start CloudEra Manager.

CloudEra Quickstart VM-1

Well first thing comes to my mind is creating a working Hadoop Cluster. So lets create a Hadoop Cluster by adding more hosts to this Quickstart VM using the Cloudera manager.

I created  another CentOS VM on my Desktop.

datanode1-vm

 

open the Browser and access cloudera manager

Add the newly created VM to the Cluster by adding it as a Host to Cloudera Manager.

Here is the output

addhost-output

and the details of the error

addhost-error-details

So may be we are getting advanced in terms of adding new features and nationalities but we are still lacking on the basic functionalists and the need of a new user to learn Hadoop.I tried to install the cloudera -manager-agent manually to the datanode-host with the command

“sudo yum install cloudera-manager-agent ” and it installed correctly ,the download speed of the cloudera manager agent packages are quite slow,it took some time to download and install which is around 416 MB,in a 50 MBps pipe  it should be pretty fast.I am sure there will not be many simultaneous  download of the agent  so the download is slow.(the speed was really bad,I had to wait almost 30 mins to download the packages.)

cloudera-,anager-agent-download and install

Now lets see if still I can add the host to the hadoop cluster.

addhost-output

No still it is not going through,well I ll not try it more. As any new uesr would spend more time in learning the cluster and Hadoop rather than wasting time in digging out whats going wrong in adding the Hosts using Cloudera manager.

Now lets go and try the same thing with Hortonworks.

The Similar fuctionality to cloudera-manager is the Ambari from Hortonworks with which one can manage the hadoop cluster.

Hortoworks Sandbox VM

So lets check the Amabri fuctionality on the Hortonworks sandbox,I have logged in to the sandbox and enabled Amabri.

now lets login to Amabri and try to add a host to the Sandbox.

Hortoworks Sandbox Amari enabled

I created a new VM  ie datanode2.localdomain tried to add it to the Sandbox as a new host to the cluster.Well before I could add a host it is asking me to give a SSH Private Key.

Amari add-newhost

My comment :

Why should I crerate a Private SSH key and provide here,I have a VM ready and I need to add it to cluster ,simple let it create the key,this is a simple environment for learning ,why I need to do the basic admin work of creating a Private SSH key  ?

well let me create a Private SSH Key  and see if I can succeed

Immediately it returned as failed to add.It seems it didn’t connect to the VM at all.

Amari add-newhost-output-1

Let me try without using SSH and manually installing the Agent on the host.But is there any way to know how to install the Hortonworks Agent instalation ? I have to search on Google coz there is no info of that on the page 😦 .

Amari add-newhost-2

Well I found on Hortonworks site  the following info

” Chapter 5. Appendix: Installing Ambari Agents Manually

In some situations you may decide you do not want to have the Ambari Install Wizard install and configure the Agent software on your cluster hosts automatically. In this case you can install the software manually.

Before you begin: on every host in your cluster download the HDP repository as described in Set Up the Bits. ”

My Comments :

This tells me I have to install HDP repository first ,com on do I have to install all manually why should I use your tool  ? I can do eveything manually by learning hadoop. why should I use your tool ?

Should I install HDP repository now  ? I don’t think I  am in a mood to do that now.So I ll not proceed further on this.

So we came to know both the Learning VMs  lack the basic functionalists of creating a hadoop cluster. May be there are good for one VM show.This is not good for the guys who wants to know the admin part of hadoop cluster ,sorry guys better you create your own cluster from beginning by following documents provided by Apache or many be individually from Cloudera and Hortonworks.

These VMs are no good for learning Hadoop administration. Better you create you own Hadoop Cluster manually. To Create a Cluster please follow my earlier blogs.

Tomorrow I ll try to find the Development aspect of these Products (VMs)

Lets see how matured they are and how helpful they are for a Hadoop Developer.

Cheers

Top Threats In The Public Cloud

When a new comer to the cloud world wants to adapt cloud in his/her environment the first question arises in his/her mind is whether he/she  should host his IT and software applications in a public cloud or private cloud.

Today in my blog  I want to write some concerns normally raised by the guys who are in support of private cloud.

Public  Cloud or  private Cloud ?

The first thing they say public cloud is not secured because it uses internet so all the security issues of internet applies to public cloud too,it is a very generic , standard phrase and assumption made by the supporter of private cloud.

Many people and IT department raise concerns over  adapting and going to public cloud,some concerns are definitely valid but with the advancement of cloud technology and availability of Software tools, these concerns are addressed by today’s  public Cloud providers up to a large extent.

First lets see what are the top threats in the Public Cloud.For the definition of Cloud Please refer to my previous blog,when the  definition is followed to create a public cloud environment the following threats are bound to appear.

Cloud Security Alliance (CSA) & European Network and Information Security Agency (ENISA) outlined the following top threats in the Public Cloud……

Top Threats In The Public Cloud

1.Loss of Governance

2.Vendor Lock-in

3.Isolation Failure

4.Compliance Risk

5.Management Interface Compromise

6.Data Protection

7.Insecure or Incomplete Data Deletion

8.Malicious Insider

First I ll try to explain what these threats are and then I ll try to give solutions which are adapted by the public cloud service providers.

1.Loss of Governance.

By using public cloud infrastructures, the client or user necessarily cedes control to the Public Cloud Provider (CP) on a number of issues which may affect security. At the same time, SLAs may not offer a commitment to provide such services on the part of the cloud provider, thus leaving a gap in security defenses.

Q.Which way the Public Cloud Service Provider  Address this  issue or threat ?

A. I ll take the example of Amazon Web Services (AWS)  to see how they address the issue.

Amazon Virtual Private Cloud (Amazon VPC)

Branch_Offices

Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define.You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

You can move corporate applications to the cloud, launch additional webservers, or add more compute capacity to your network by connecting your VPC to your corporate network.Because your VPC can be hosted behind your corporate firewall, you can seamlessly move your IT resources into the cloud without changing how your users access these applications.

You can select “VPC with a Private Subnet Only and Hardware VPN Access” from the Amazon VPC console wizard to create a VPC that supports this use case.

So with this solution the issue of “Loss of Governance is somewhat addressed if not completely.

2.Vendor Lock-in.

There is few  on offer in the way of tools, procedures or standard data formats or services interfaces that could guarantee data, application and service portability.This can make it difficult for the customer to migrate from one provider to another or migrate data and services back to an in-house IT environment.

This introduces a dependency on a particular Cloud Service Provider for service provision, especially if data portability, as the most fundamental aspect, is not enabled.

Q : How AWS try to resolve the Lock-IN Issue ?

 Ans : VM Import/Export tool .

 VM Import/Export enables you to easily import virtual machine images from your existing environment to Amazon EC2 instances and export them back to your on-premise environment.

vsphere_vm_import_1

  • This offering allows you to leverage your existing investments in the virtual machines that you have built to meet your IT security, configuration management, and compliance requirements by bringing those virtual machines into Amazon EC2 as ready-to-use instances.
  • You can also export imported instances back to your on-premise virtualization infrastructure, allowing you to deploy workloads across your IT infrastructure.

Q: What is the way VMware try to address this issue of Lock-In ?

 Ans : vCloud Connector

 VMware vCloud Connector links your internal private cloud with public clouds, so you can manage them as a single hybrid environment and transfer workloads back and forth.

vmw-dgrm-vcloud-connector-one-cloud-lg

 VMware vCloud Connector links vSphere-based private and public clouds via a single interface, allowing you to manage them as a single hybrid environment. You can transfer virtual machines (VMs), vApps (collections of VMs with related resource controls and network settings), and templates from one vSphere-based cloud to another. You can also set up a content library to distribute and synchronize templates across clouds and extend a single Layer 2 network from your private data center to a vSphere-based public cloud.

3: Isolation Failure

 Failure of Mechanism separating Storage,Memory, Routing and even reputation between different tenants.

  • Multi-tenancy and shared resources are defining characteristics of cloud computing.
  • This risk category covers the failure of mechanisms separating storage, memory, routing and even reputation between different tenants (e.g., so-called guest-hopping attacks).
  • However it should be considered that attacks on resource isolation mechanisms (e.g.,. against hypervisors) are still less numerous and much more difficult for an attacker to put in practice compared to attacks on traditional OSs.

Q : How AWS address  this Issue ?

 Ans : Amazon EC2 Dedicated Instances.

 Dedicated Instances are Amazon EC2 instances launched within your Amazon Virtual Private Cloud (Amazon VPC) that run hardware dedicated to a single customer. Dedicated Instances let you take full advantage of the benefits of Amazon VPC and the AWS cloud – on-demand elastic provisioning, pay only for what you use, and a private, isolated virtual network, all while ensuring that your Amazon EC2 compute instances will be isolated at the hardware level.

CreateVPC

  • You can easily create a VPC that contains dedicated instances only, providing physical isolation for all Amazon EC2 compute instances launched into that VPC, or you can choose to mix both dedicated instances and non-dedicated instances within the same VPC based on application-specific requirements.

dedicatedVPC

4: Compliance Risk

Investment in achieving certification may be put art risk by migrating to the cloud.

 Investment in achieving certification (e.g., industry standard or regulatory requirements) may be put at risk by migration to the cloud:

  • If the CP cannot provide evidence of their own compliance with the relevant requirements .
  • If the CP does not permit audit by the cloud customer (CC).
  • In certain cases, it also means that using a public cloud infrastructure implies that certain kinds of compliance cannot be achieved.

Q : How AWS adress this Compliance Risk

Ans : AWS Risk and Compliance Program

  • AWS provides information about its risk and compliance program to enable customers to incorporate AWS controls into their governance framework.
  • This information can assist customers in documenting a complete control and governance framework with AWS included as an important part of that framework.

5: Management Interface Compromise

Remote Access and web browser vulnerability

  • Customer management interfaces of a public cloud provider are accessible through the Internet and mediate access to larger sets of resources (than traditional hosting providers) and therefore pose an increased risk, especially when combined with remote access and web browser vulnerabilities.

Q : How AWS address this issue ?

Ans : AWS ElasticWolf Client Console

  • AWS ElasticWolf Client Console is a client-side application for managing Amazon Web Services (AWS) cloud resources with an easy-to-use graphical user interface
  • ElasticWolf is packaged with all necessary tools and utilities to generate and deal with private and public keys and certificates, as well as a Windows ssh client for accessing Linux EC2 instances. In addition, it integrates easily with the AWS command line interface (CLI) tools so that you can use it and the CLI together.
  • Another advantage of ElasticWolf is that it logs all API calls that it makes to your local disk. You can view these logs by selecting File | Show Access Log from the ElasticWolf menu.

ElasticWolf

6 : Data Protection

 It may be difficult to for the cloud customer to effectively check the data handling practices of the cloud provider.

 Cloud computing poses several data protection risks for cloud customers and providers. In some cases, it may be difficult for the cloud customer (in its role as data controller) to effectively check the data handling practices of the cloud provider and thus to be sure that the data is handled in a lawful way.

  • This problem is exacerbated in cases of multiple transfers of data, e.g., between federated clouds.
  • Some cloud providers do provide information on their data handling practices.
  • Some also offer certification summaries on their data processing and data security activities and the data controls they have in place.

 Q: How AWS handles Data protection.

 Ans : The AWS SDK for Java provides an easy-to-use Amazon S3 client

 Client Side Data Encryption

–      In client-side encryption, your client application manages encryption of your data, the encryption keys, and related tools. You can upload data to an Amazon S3 bucket using client-side encryption.

  • Server Side Data Encryption

–      Server-side encryption is data encryption at rest, that is, Amazon S3 encrypts your data as it uploads it and decrypts it for you when you access it

  • The AWS SDK for Java provides an easy-to-use Amazon S3 client that allows you to securely store your sensitive data in Amazon S3.
  • The SDK automatically encrypts data on the client side when uploading to Amazon S3, and automatically decrypts it for the client when data is retrieved.
  • Your data is stored encrypted in Amazon S3 – no one can decrypt it without your private encryption key.

aws_java_sdk

Also at Client Side the User can use many available security software products such as

 Symantec Data Loss Prevention Enforce Platform.

 Manage Data Loss Prevention policies centrally and deploy globally.

Symantec DLP

Symantec Data Loss Prevention Enforce Platform automatically enforces universal Data Loss Prevention policies with a centralized platform for detection, incident remediation workflow and automation, reporting, system management and security.

  • Data Loss Prevention delivers a unified solution to discover, monitor, and protect confidential data wherever it is stored or used.

7: Insecure or Incomplete Data Deletion

In the case of multiple tenancies and the reuse of hardware resources,this represents a higher risk to the customer than with dedicated hardware.

When a request to delete a cloud resource is made, as with most operating systems, this may not result in true wiping of the data.

  • Adequate or timely data deletion may also be impossible (or undesirable from a customer perspective), either because extra copies of data are stored but are not available, or because the disk to be destroyed also stores data from other clients.
  • In the case of multiple tenancies and the reuse of hardware resources, this represents a higher risk to the customer than with dedicated hardware.

Q : How AWS addresses this issue ?

Amazon Simple Storage Service (Amazon S3) Security.

  • Amazon S3 APIs provide both bucket- and object-level access controls, with defaults that only permit authenticated access by the bucket and/or object creator. Write and Delete permission is controlled by an Access Control List (ACL) associated with the bucket. Permission to modify the bucket ACLs is itself controlled by an ACL, and it defaults to creator-only access. Therefore, the customer maintains full control over who has access to their data. Amazon S3 access can be granted based on AWS Account ID, DevPay Product ID, or open to everyone.
  • Amazon S3 is accessible via SSL encrypted endpoints. The encrypted endpoints are accessible from both the Internet and from within Amazon EC2, ensuring that data is transferred securely both within AWS and to and from sources outside of AWS.
  • When an object is deleted from Amazon S3, removal of the mapping from the public name to the object starts immediately, and is generally processed across the distributed system within several seconds. Once the mapping is removed, there is no external access to the deleted object. That storage area is then made available only for write operations and the data is overwritten by newly stored data.
  • DragonDiskS3Client

8: Malicious Insider

The damage which may caused by malicious insider is often far greater.

While usually less likely, the damage which may be caused by malicious insiders is often far greater.

  • Cloud architectures necessitate certain roles which are extremely high-risk.
  • Examples include CP system administrators and managed security service providers.

Q : How to address this issue ?

The  strategy for data monitoring, addresses the gray area between static access controls.

  • Monitoring provides the capability to forensically identify misuse and proactively detect a data breach when valid permissions are abused.
  • By closely examining user actions, security pros can more effectively determine whether users’ intentions are malicious. Monitoring is how we sniff out suspicious—yet technically approved—access.

* Disclaimer : I have taken some of the words , phrases and diagrams from different product documents and blogs available on internet,if found any match then they belong to the respective owners, I don’t claim they are my own.I just tried to put my views and thoughts by reproducing some texts available free on internet.

The solution I have given  to various threats may not be to the point,it is my personal view and judgment and I am no way responsible for any outcome of the decision taken to adapt private or public cloud.

-Panchaleswar Nayak

OpenStack Cloud Platform Installation

Today I am going to write about the OpenStack Implementation on a Desktop.

I saw many earlier blogs and post on openstack implementation but I didn’t find any comprehensive details about how can I implement openstack on my own.Off course there some official documentation on openstack Getting Started Page.But still it is not clearly mentioned how can I start learning on a desktop.

So I started my own implementation on my Desktop.Here is my Desktop Configuration.

I have i7 Processor ( 4 Cores and 8 Threads),16GB of RAM,Windows 7 OS and I have VMware’s Workstation installed on it Off course you don’t need such a high configuration to install openstack,

But if you want to implement the Multi Node openstack configuration then probably having a such a configuration will help.

I ll try to make it as simple as possible and as detailed as possible so a new learner can implement it.

First I ll try to create the devstack implementation which is a simple process of implementing and learning different aspect of openstack.

Here is the Complete Architecture of the Single Node OpenStack implementation on a Desktop.

Desktop-Arch

1.1 Create a Virtual Machine on your Desktop.

Create a new VM on VMware Workstation with 1 vCPU and 2 GB RAM,I assume you know how to create a VM on Workstation.Make sure while creating the VM you enable the “Virtualiza Intel VT-x or AMD-V/RVI” option for the processor.While creating the VM don’t select the OS,So that you can install it after VM creation.

CreateVM

1.2 .Make sure you have the network connection so that you can connect to internet.Connect the network adapter to “bridged” so that you can use your existing Router network and use internet.In my case my route network is 192.168.1.0.

network

1.3 .Download the Minimal Ubuntu OS from here https://help.ubuntu.com/community/Installation/MinimalCD,Download 64 bit Ubuntu 12.04 “Precise Pangolin” Minimal CD.Now install this on the newly created VM it ll be download and install all the required minimum packages needed for Ubuntu 64 12.4 Linux OS.

2.VM Network and Devstack Install Local Configuration.

Set the Network Configuration edit the file  /etc/network/interfaces as following. Make sure you set the dns-nameservers 8.8.8.8 or you may face problem during devstack installation.Particularly your devstack installation will hang at the following message.

———————————–

Requirement already satisfied (use –upgrade to upgrade): prettytable in /usr/lib/python2.7/dist-packages (from -r python_keystoneclient.egg-info/requires.txt

Downloading/unpacking httplib2>=0.7 (from -r python_keystoneclient.egg-info/requires.txt (line 1))

————————————-

auto eth0
iface eth0 inet static
address 192.168.1.110
netmask 255.255.255.0
network 192.168.1.0
gateway 192.168.1.1
dns-nameservers 8.8.8.8

auto eth1
iface eth1 inet manual
up ifconfig $IFACE 0.0.0.0 up
up ip link set $IFACE promisc on
down ip link set $IFACE promisc off
down ifconfig $IFACE down

Create a localrc file in your home directory and edit it as follows

localrc

Download and Install Devstack as follows.

$sudo apt-get install git -y
$git clone git://github.com/openstack-dev/devstack.git

Start the instaltion

$cd devstack
$./stack.sh

This ll install and configure complete devstack.

Now you can access the devstack console from a web browser with

http://192.168.1.110/ (my devstack IP address is 192.168.1.110) you can access this from your host desktop. Login with admin/password

devstack-login

Tomorrow I ll write the how to install  and configure the Multi Node OpenStack.