HADOOP- BIGDATA BASICS:
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analysed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems
Before
going to Hadoop first we will know some basic information about Big Data.
1. What is Big Data?
Every
day, we create 2.5 quintillion bytes of data — so much that 90% of the data in
the world today has been created in the last two years alone.
According
to IBM, 80% of data captured today is unstructured, from sensors used to gather
climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals, to name a few. All of
this unstructured data is Big Data.
Objectives:
Big Data as high volume, velocity and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision making.
2. What does Hadoop
solve?
Organizations
are discovering that important predictions can be made by sorting through and
analyzing Big Data.
However,
since 80% of this data is "unstructured", it must be formatted (or
structured) in a way that makes it suitable for data mining and subsequent
analysis.
Hadoop
is the core platform for structuring Big
Data, and solves the problem of making it useful for analytics purposes.
Why
we are moving towards Hadoop Distributed file system?
In
the old days of distributed computing, failure was an exception, and hardware
errors were not tolerated well.So for distributed computing made sure their hardware
seldom failed .This is achieved by using high quality components, and having
backup systems. This line of thinking created hardware that is
impressive, but EXPENSIVE!.
Whereas
Hadoop Distributed File system is used to store bulk amount of
data like
terabytes or peta bytes and also support high throughput mechanism for
accessing this large amount information.
In
HDFS files are stored in sequential redundant manner over the multiple machines
and this guaranteed the following ones:
->Durability
to failure
->High
availability to parallel applications.
HDFS
has many similarities with other distributed file systems, but is different in
several respects. One noticeable difference is HDFS's write-once-read-many
model that relaxes concurrency control requirements, simplifies data coherency,
and enables high-throughput access.
Another
unique attribute of HDFS is the viewpoint that it is usually better to locate
processing logic near the data rather than moving the data to the application space.
HDFS
is file system designed for storing some key characterstics.They are
->Support
for very large files
->commodity
hardware
->streaming
data access
->high-latency
data access
->lots
of small files
->multiple
writers, arbitrary file modifications
->Moving
computations is than moving data.
3. HDFS Architecture:
Below
are the key words we will come across HDFS:
Name Node:
The master node in hadoop architecture is treated as Name node. Name node is
responsible for maintaining the metadata information of the hadoop file system.
This
file system has metadata for all the files and directories. This information is
stored persistently on local disk.
Data node:
Data node is a place holder of data. These data node store and retrieve blocks
of data, and this information is reporting to Name node.
JobTracker:Job
tracker is responsible for scheduling n rescheduling the tasks in the form on map
reduce jobs. Generally job tracker will reside on top of name node.
Task
Tracker: Task tracker is responsible for instantiating and monitoring
individual map& reduce works. And it’s primarily responsible for executing
the tasks assign by the job tracker. Generally task tracker will reside on top
of the data nodes
Once the Architecture is
understood, one might be interested to know how the data is stored on HDFS.
This blog covers the basics of above mentioned.
4. Data Storage on HDFS:
As the name HDFS suggests
us it is a file system, Data is stored in the form of files on Slave (Data
Node) system in its local directories. Whenever a very large file is to be
stored on HDFS, The Master (Name Node) splits this input file into a series of
chunks each of 64 MB or 128 MB. These chunks are replicated for tolerance. The
block size and the replication factor can be configured. Default specifications
are chunk size of 64 MB and replication factor of 3.
HDFS allows files to be
written once and read many times.
HDFS follows a Shared
Nothing Architecture (i.e. No
Sharing among its Data Nodes).
When the Name Node receives
the first chunk, it places this chunk on any of the Data Nodes. The chunk which
is placed first is called the primary replica. It is now the responsibility of
this primary replica to copy itself to other Data Nodes to meet the replication
factor. So it gets copied to other Node and it itself will be replicated in to
some other Node. An illustration can be seen below.
1) Name
Node copies the first block it receives on Data Node 1. This Block is now
called the primary replica.
2) The
Primary replica copies itself to another Data Node
3) This
Chunk in return copies itself to other Data Node. This process in steps 2 &
3 is often referred to as Data Pipelining.
4) After
the replication is done, The Data Nodes sends a heartbeat signal to Name Node
about the completion of the job.
5) It can
be seen that the Master will not store any data, but contains the Meta data
information about the location of the chunks and their replicas.
6) Since Name Node could be a single point of failure, A Secondary
Name Node is maintained which takes a copy of Meta Data information from Name
Node at regular intervals. So whenever there is a failure in Name Node the
Secondary Name Node comes into action without loss of operational time of the
Cluster.
5. FOUR
V’s in BIG DATA
Volume,
variety, velocity, veracity
Volume
Big data implies enormous volumes of data. It used to be
employees created data. Now that data is generated by machines, networks and
human interaction on systems like social media the volume of data to be analysed
is massive. Yet, Inderpal states that the volume of data is not as much the
problem as other V’s like veracity.
Variety
Variety refers to the many sources and types of data both
structured and unstructured. We used to store data from sources like spread
sheets and databases. Now data comes in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. This variety of unstructured data creates
problems for storage, mining and analysing data. Jeff Veis, VP Solutions at HP
Autonomy presented how HP is helping organizations deal with big challenges
including data variety.
Velocity
Big Data Velocity deals with the pace at which data flows in
from sources like business processes, machines, networks and human interaction
with things like social media sites, mobile devices, etc. The flow of data is
massive and continuous. This real-time data can help researchers and businesses
make valuable decisions that provide strategic competitive advantages and ROI
if you are able to handle the velocity. Inderpal suggest that sampling data can
help deal with issues like volume and velocity.
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analysed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems
Before going in depth of HADOOP - BIGDATA it is good to know the
components and its benefits. From my side I am just helping out to folks who
are interested to know about hadoop and for the people who are eager to
know the components .
HADOOP is a
distributive file system that runs on large clusters of commodity machines.
6. STORAGE(HDFS)
+ PROCESSING(MAPREDUCE) = HADOOP
Below are the
soul components which we run on top of Map Reduce
Pig : Pig is a data
flow language and execution environment for exploring very large data sets. Pig
runs on HDFS and Map Reduce clusters.
Pig is for processing of huge data with respect to business logic with help of
built in functions .
The language used to express data flows in pig called PIG LATIN
ex:
LOAD, EXECUTE, GENERATE, JOIN, DUMP, GROUPBY, ORDERBY, TOKENIZE .etc...
HIVE : Hive is
a distributed data warehouse and it manages data stored in HDFS and provides a
query language based on sql.
Hive is for data summarization, querying of data and for advanced query.
All the data will be organized in the means of table only either managed tables
or external tables.
ex: hive > create table name, row format delimited by, fields terminated by,
lines terminated by, etc ...
Hbase :Hadoop
Database
Hbase is a Distributed, column-oriented database.
Hbase uses HDFS for its underlying storage and supports batch style computations
using map reduce and point queries.
To
get the random access on data we use Hbase table
Zookeeper : A
distributed highly available coordination service.
The zookeeper primitives are rich set of building blocks that can be used to
build a large class of coordination data structures and protocols.
ex: Distributed queues, distributed locks, and leader election among a group of
peers.
SQOOP : Sql +
Hadoop
It is a tool for efficiently moving data between relational databases and
HDFS.
It
is meant for interacting with target RDBMS from Hadoop i.e importing data from
RDBMS to HDFS OR exporting results from HDFS to RDBMS table.
SQOOP will not communicate with LFS . Import and Export will happen only w.r.to
HDFS only.
ex for flow : Sqoop job -- create table -- import --connect jdbc-- table exp -m
| --target dir -- incremental -- check column empid --lastvalue 0;
FLUME : Apache FLUME
is to bring data and put it in HDFS (Hadoop) -> Live data injection
It is not for processing data. FLUME is to bring live data to the system which
changes every time and not be stable.
ex : server log data, Social media data (twitter counts), Stock exchange data
No comments:
Post a Comment