Tuesday, 2 August 2016

HADOOP- BIGDATA

HADOOP- BIGDATA BASICS:


Before going to Hadoop first we will know some basic information about Big Data.

1. What is Big Data?
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.
According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.
Objectives: Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
2. What does Hadoop solve?
Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that makes it suitable for data mining and subsequent analysis.
Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.
Why we are moving towards Hadoop Distributed file system?
In the old days of distributed computing, failure was an exception, and hardware errors were not tolerated well.So for distributed computing made sure their hardware seldom failed .This is achieved by using high quality components, and having backup systems. This line of thinking created hardware that is impressive, but EXPENSIVE!.
Whereas Hadoop Distributed File system is used to store bulk amount of data like terabytes or peta bytes and also support high throughput mechanism for accessing this large amount information.
In HDFS files are stored in sequential redundant manner over the multiple machines and this guaranteed the following ones:
->Durability to failure
->High availability to parallel applications.
HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space.
HDFS is file system designed for storing some key characterstics.They are
->Support for very large files
->commodity hardware
->streaming data access
->high-latency data access
->lots of small files
->multiple writers, arbitrary file modifications
->Moving computations is than moving data.

3. HDFS Architecture:

Below are the key words we will come across HDFS:
Name Node: The master node in hadoop architecture is treated as Name node. Name node is responsible for maintaining the metadata information of the hadoop file system.
This file system has metadata for all the files and directories. This information is stored persistently on local disk.
Data node: Data node is a place holder of data. These data node store and retrieve blocks of data, and this information is reporting to Name node.
JobTracker:Job tracker is responsible for scheduling n rescheduling the tasks in the form on map reduce jobs. Generally job tracker will reside on top of name node.
Task Tracker: Task tracker is responsible for instantiating and monitoring individual map& reduce works. And it’s primarily responsible for executing the tasks assign by the job tracker. Generally task tracker will reside on top of the data nodes

Once the Architecture is understood, one might be interested to know how the data is stored on HDFS. This blog covers the basics of above mentioned.
4. Data Storage on HDFS:
As the name HDFS suggests us it is a file system, Data is stored in the form of files on Slave (Data Node) system in its local directories. Whenever a very large file is to be stored on HDFS, The Master (Name Node) splits this input file into a series of chunks each of 64 MB or 128 MB. These chunks are replicated for tolerance. The block size and the replication factor can be configured. Default specifications are chunk size of 64 MB and replication factor of 3.
HDFS allows files to be written once and read many times.
HDFS follows a Shared Nothing Architecture (i.e. No Sharing among its Data Nodes).
When the Name Node receives the first chunk, it places this chunk on any of the Data Nodes. The chunk which is placed first is called the primary replica. It is now the responsibility of this primary replica to copy itself to other Data Nodes to meet the replication factor. So it gets copied to other Node and it itself will be replicated in to some other Node. An illustration can be seen below.
1) Name Node copies the first block it receives on Data Node 1. This Block is now called the primary replica.
2) The Primary replica copies itself to another Data Node
3) This Chunk in return copies itself to other Data Node. This process in steps 2 & 3 is often referred to as Data Pipelining.
4) After the replication is done, The Data Nodes sends a heartbeat signal to Name Node about the completion of the job.
5) It can be seen that the Master will not store any data, but contains the Meta data information about the location of the chunks and their replicas.
6) Since Name Node could be a single point of failure, A Secondary Name Node is maintained which takes a copy of Meta Data information from Name Node at regular intervals. So whenever there is a failure in Name Node the Secondary Name Node comes into action without loss of operational time of the Cluster.
5. FOUR V’s in BIG DATA
Volume, variety, velocity, veracity

Volume
Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analysed is massive. Yet, Inderpal states that the volume of data is not as much the problem as other V’s like veracity.
Variety
Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spread sheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analysing data. Jeff Veis, VP Solutions at HP Autonomy presented how HP is helping organizations deal with big challenges including data variety.
Velocity
Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.

Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analysed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems

Before going in depth of HADOOP - BIGDATA it is good to know the components and its benefits. From my side I am just helping out to folks who are interested to know about hadoop and for the people who are eager to know the components .
HADOOP is a distributive file system that runs on large clusters of commodity machines.

6. STORAGE(HDFS) + PROCESSING(MAPREDUCE) = HADOOP
Below are the soul components which we run on top of Map Reduce

Pig : Pig is a data flow language and execution environment for exploring very large data sets. Pig runs on HDFS and Map Reduce clusters.
        Pig is for processing of huge data with respect to business logic with help of built in functions .
         The language used to express data flows in pig called PIG LATIN
        ex: LOAD, EXECUTE, GENERATE, JOIN, DUMP, GROUPBY, ORDERBY, TOKENIZE  .etc...

HIVE : Hive is a distributed data warehouse and it manages data stored in HDFS and provides a query language based on sql.
           Hive is for data summarization, querying of data and for advanced query.
           All the data will be organized in the means of table only either managed tables or external tables.
         ex: hive > create table name, row format delimited by, fields terminated by, lines terminated by,  etc ...

Hbase :Hadoop Database
            Hbase is a Distributed, column-oriented database.
             Hbase uses HDFS for its underlying storage and supports batch style computations using map reduce and point queries.
             To get the random access on data we use Hbase table

Zookeeper : A distributed highly available coordination service.
                    The zookeeper primitives are rich set of building blocks that can be used to build a large class of coordination data structures and protocols.
            ex: Distributed queues, distributed locks, and leader election among a group of peers.

SQOOP : Sql + Hadoop
            It is a tool for efficiently moving data between relational databases and HDFS.
             It is meant for interacting with target RDBMS from Hadoop i.e importing data from RDBMS to HDFS  OR exporting results from HDFS to RDBMS table.
   SQOOP will not communicate with LFS . Import and Export will happen only w.r.to HDFS only.
           ex for flow : Sqoop job -- create table -- import --connect jdbc-- table exp -m | --target dir -- incremental -- check column empid --lastvalue 0;

FLUME : Apache FLUME is to bring data and put it in HDFS (Hadoop) -> Live data injection
              It is not for processing data. FLUME is to bring live data to the system which changes every time  and not be stable.

            ex : server log data, Social media data (twitter counts), Stock exchange data

No comments:

Post a Comment