Big Data is a term which is making waves around the world and People say its going to change the way Business is being done today.
Cloud Computing will change the way computing is done and Big Data is going to change the way business is done today.
One of the Big Data technology i.e Hadoop is well accepted now for managing and processing this so called Big Data.I am not going to discuss more on Big Data or Hadoop now but here I am going to just compare some learning tools to learn Big Data and Hadoop and related technologies.
As I know there are two startup companies ( may be more are there) which are doing a lot of development on Hadoop technology, one is http://www.cloudera.com and another is http://hortonworks.com/.These two companies are developing many hadoop related software tools for making it easier to use and develop applications on them and writing most of the hadoop codes and donating to the Hadoop opensource technologies.
These companies provide tools to be downloaded and use for hadoop educational purposes.
From Cloudera it is “Cloudera QuickStart VM” and from Hortonworks it is the “Hortonworks Sandbox”.These tools are nothing but Virtual machines in which Hadoop is installed configured along with the tools these companies provide and support ,these can be downloaded and run on any of your preferred hypervisors.
Though I am not new to these tools or Hadoop but I ll pretend here I am new and write my experience for others to follow or comment because it is intended for learning Hadoop from basics.
What I am going to do is setup my running environment for these VMs and then download and run it on my laptop to give me a quick view of these tools and learn Hadoop and Big Data.
My Setup Details.
I have windows 7 OS running on an Intel i5 Processor with 8GB RAM on my Laptop.
I downloaded VMware Player, it is free for noncommercial use. Installed it on my laptop.Downloaded the Cloudera Quickstart VM 4.4.0-1 and Hortonwork’s Sandbox 2.0.Extracted the files to two different folders.Then I just double clicked the file with .vmx extension in each folder to start these VMs on my VMware Player.
Here are my observations for these Fast Hadoop Learning VMs from two major companies who develop and donate code to the Hadoop open source communities.
Cloudera’s QuickStart VM vs Hortonworks Sandbox
I am not going to compare who is good or who is bad but from a new learner perspective who is what kind of comparison.
OS : Both are based on CentOS
CloudEra with Full Desktop GUI and Eclipse Pre Installed
As soon as I started the CloudEra VM I found the firefox browser opens up with a page telling what is there to use as a Hadoop User and what is there to use as an Administrator.
By just one click I can open the CloudEra Manager or the Hadoop User Frontend like the HUE GUI.
The browser also has bookmarks for the UIs of various roles and services such as MapReduce JobTracker, HDFS NameNode, HBase Master, Cloudera Manager, Hue, and Solr.
As a Java developer I have seen many developers around the world have development environment as Windows OS based Desktop or laptop so they are used to GUI rather than text mode command line interface.May be Linux and Unix developers prefer command line interface.
I clicked the Use Hadoop Link, there is no user login and password given to login.Then I clicked the back page to go back on the page, then I clicked on Administer Hadoop, here also no knowledge of username and password given.I don’t know how to login now,I have to search in the document or search google to know the username and password.
Well this is a learning VM and I wish if I could login right away with a given username and password.I searched in google and I found the username and password as cloudera and cloudera.
The first page says “Potential misconfiguration detected. Fix and restart Hue.”
I don’t know yet what configuration this VM needs I am a Hadoop Developer I don’t know about Hadoop VM configuration. So I ll just ignore it and try to move forward. I feel this misconfiguration could have been fixed by cloudera before making this VM.
HortonWorks only Text mode ,i.e only you can login in the text mode.
All other user and admin work I have to do from another computer,or from my base Windows 7 OS using the browser by pointing the browser to the IP address of the VM.After opening the browser and pointing to this VM I found the following page.
When I clicked the Go To Sand Box link I found the following page.
Its quite obvious to find this page,why ? Because they are pointing to the localhost i.e 127.0.0.1:8000 in this case this is not the VM but the other machine where you have opened the browser so it will not open.Rather you should point to the the IP address of the VM instead of 127.0.0.1.
I changed the IP to the VM’s IP which it should point to the page then I found the following page.
Well here I found the user name and password given to login.i.e hue and 1111
I am good to go and login right away and login as a Hadoop user.And I found a good number of Hadoop user examples to learn Hadoop concepts.
I clicked back to go to the homepage and then I clicked on the Start tutorial button.
Lets look at the Contents of the Homepages and after login as a user.
As a Hadoop User or Developer
It is a Hue Application front end with the following extra Cloudera developed applications.
Cloudera Impala: Real-Time Queries in Apache Hadoop
With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.
When I tried to access Impala from outside the Cloudera VM I get the error page because is connecting to localhost only,so I have to go to the VM and then login to the Hue page to access Impala page.
Well I have allocated 4 GB RAM to the VM but still when I clicked on Impala Query UI it is just hang around there.Still it didn’t comeup even after 5 mins of waiting.The VM became unresponsive for sometime and now still it is not up after 10 mins.
So can’t show you the Impala GUI now.May be latter when I ll get it.
Sqoop : Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
1.Imports individual tables or entire databases to files in HDFS
2.Generates Java classes to allow you to interact with your imported data
3.Provides the ability to import from SQL databases straight into your Hive data warehouse
4.After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.
Solr Serach : Solr is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.
Hbase Browser : The Web UI for HBase – HBase Browser
Hue provides a tool visualize HBase data with HBase Browser. With HBase’s structure, it is difficult explore, understand, and search for data. Hue’s this smart view enables intelligent browsing of data in Bbase. It also makes adding, removing, and mutating cells easier. Users can limit the number of rows retrieved and what column families to show.
It is a Hue Application frontend with some extra application Apache HCatalog.
Apache™ HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid.
HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.
There are good number of Tutorial available to learn Hadoop and in just next one hour I could get a great deal of the hadoop development environment.
As a Hadoop Administrator :
Cloud Era :
I can click and open the Administrator Page from the open firefox page then it popup the username and password to login with cloudera and cloudera.
After login to the Admin Page with username as cloudera and password as cloudera here is page.
After running the Sandbox I don’t have any knowledge how to administer this Hadoop VM from the Web Browser.So I have to do all admin work from the console of the VM using command line.
I am going to write more on the usability of these tools as a normal user and as an administrator tomorrow.