Sunday, May 31, 2015


3:36 AM - By ajay desai 0

                               1) Support for very large files

Hadoop can process small number of large files efficiently, but cannot process large number of small files efficiently. 

In hadoop, every file of any size is divided into fixed - size blocks each of 64 MB by default or in multiples of 64 MB (128 MB,192 MB so on) as per the client's requirements.

For ex: - Let us consider that, we have a file DeviceCount.docx of size: 248 MB. Now as per the hadoop file system, this file will get divided into 4 fixed size blocks each of 64 MB by default. Then memory for this file is allocated as follows: - 

When we are dividing 248 MB file of fixed size 64 MB blocks, we are getting 4 blocks in which 3 blocks are of 64 MB and the last block is having 56 MB of data, the remaining 8 MB of 64 MB block memory is wasted as single block can have data related to a single file only. This wastage can be tolerated as it is very less.

Now let us consider that, we have 4 files :-
  (i) A.txt - 40 MB (block allocated: 64 MB)
  (ii) B.txt - 20 MB (block allocated: 64 MB)
  (iii) C.txt - 30 MB (block allocated: 64 MB)
  (iv) ABC.docx - 10 MB (block allocated: 64 MB)

 So, in total, 24+44+34+10 = 112 MB of memory is wasted.

                              2. Commodity Hardware

Hadoop requires low end kind of hardware and inexpensive software for cluster setup.

                 3. High Latency

As data related has to be read from several blocks stored in various data nodes across the cluster. A lot of time is consumed when compared to OLTP (Online Transaction processing) systems. The reasons for high latency are: -

(i) Sequential file processing
(ii) Hadoop stores structured, semi-structured and unstructured data. 
(iii) Huge volume of data distributed across several data nodes in a cluster.
(iv) The RPCs (Remote procedure calls) that take place among the name node and data nodes via a network consume some time and bandwidth.

                            4. Streaming data access

 HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record

About the Author

I am Azeheruddin Khan having more than 6 year experience in c#, and ms sql.My work comprise of medium and enterprise level projects using and other Microsoft .net technologies. Please feel free to contact me for any queries via posting comments on my blog,i will try to reply as early as possible. Follow me @fresher2programmer
View all posts by admin →

Get Updates

Subscribe to our e-mail newsletter to receive updates.

Share This Post



© 2014 Fresher2Programmer. WP Theme-junkie converted by Bloggertheme9
Powered by Blogger.
back to top