Hadoop has HDFS, which is the default built in FileSystem, written in Java. Cloudera and HortonWorks both use this built-in default Java implementation. MapR has taken a different approach. What approach has MapR taken in its FileSystem implementation, and what may be the advantages and disadvantages of MapR's approach versus other vendors? If there are disadvantages, how can they be addressed? Look at the advantages and disadvantages from user, developer, administrator and risk perspective.
Approaches MapR has taken in its FileSystem implementation:-
The MapR Data Platform, which is the foundation of the MapR Distribution including Apache Hadoop, delivers a true file system that is POSIX-compliant with full random read-write capability. Instead of setting up Linux with EXT4 and then installing HDFS on top of that, you set up Linux with MapR XD. Significant speed benefits are observed because there are less layers in this architecture.
Let’s take a look at the different parts of the MapR Distribution that benefit from a read-write capable file system.
1) NFS
HDFS NFS support requires utilization of the local file system to
temporarily write data before it lands in HDFS. There are two major
problems with this. First, the data can potentially be copied out
of order. Second, this means space must be reserved in the local
file system to allow NFS enough space to land data before it can
get copied into HDFS.
MapR NFS support, on the other hand, is true NFS. It is accessed like any other storage device. Any application you have that can read and write to an NFS mount can read and write to MapR XD. You don’t need to reserve local storage for it to work.
In addition to MapR NFS, MapR also supports the HDFS API, giving you even more options for integrating the MapR Distribution in your environment.
2) NameNode
The NameNode in Apache Hadoop is a single point of failure and a
choke point for the platform. It limits the cluster to around
50-100 million total files in the system.
MapR doesn’t have a NameNode. The MapR distributed metadata architecture enables a single MapR cluster to support one trillion files and database tables on a single cluster. This is directly enabled by a random read-write file system. The MapR no-NameNode architecture means less hassles and less administrative overhead. Friends don’t let friends run NameNodes.
3) Real-time Hadoop
Apache HBase had to implement concepts like tombstones and
compactions in order to be able to run on HDFS. They are
workarounds for a write-once, read-many file system. Automatic
compactions and region splits can cause the platform to be unstable
during heavy production loads, and are recommended to be disabled
in a production environment.
MapR Database implements the same API as HBase, but because it is implemented on a random read-write-capable file system, it doesn’t need tombstones or compactions. This enables high performance (an average of 2-7x faster than standard Apache HBase) and consistent low latency for your operational applications using MapR Database.
Advantages and disadvantages of MapR's approach versus other vendors considering user, developer, administrator and risk perspective:-
MapR is generally considered more expensive than free, but to be clear you can still use MapR Community Edition for free. The free part of Apache Hadoop is usually considered to be the biggest cost driver, when in fact it isn't even close. Most people try to ignore details like number of hours to administer, and how much hardware you need to run the platform. Both of which cost a lot of money. MapR has customers running well over 1,000 nodes and have only one administrator for the entire MapR cluster. MapR was built to be as close to zero-administration as possible in every respect.
Regarding the community edition, it is free, but it doesn't give you the HA features. It still delivers a faster and better user experience that the competition because it still runs the MapR File System which does not have a NameNode (read that as no single point of failure and no bottle necks when under heavy file load) and still support the HDFS API. It also delivers NFS (the others don't offer this). Think about this, if you have a 10 node cluster with Apache Hadoop you lose 2 nodes to NameNode and Secondary. With MapR all 10 are for doing actual work. That is a 20% improvement right off the top.
MapR even supports multiple versions of open source software running on the same cluster, the other vendors do not. MapR also stays out of the politics and supports more open source software than the other vendors.
To clarify Edwards point on MapR-DB, it supports the HBase API and now the Open JSON Application Interface (OJAI™ - currently in developer preview). MapR-DB is truly a zero administration database. Unlike HBase which requires considerable care to make it operate properly.
The performance of the MapR platform is considerably faster. There is a case in India where MapR displaced the competition: Architecting the World’s Largest Biometric Identity System: The Aadhaar Experience ... In this case the government was able to handle the same workload with better service levels on 1/3rd the hardware as our competitor.
Any code you write to work with Apache Hadoop or Apache HBase
works just fine with MapR's distribution because it uses the same
API binaries as the Apache Distributions.
MapR basically rewrote HDFS and HBase to be more performant, but
some companies prefer the apache code base which is open source and
used in the all other distributions. It can make integration with
other tools easier, as there is more documentation and support from
a broader community available.
One final point, just remember that free open source still has a cost. People have a difficult time calculating the costs. When it comes down to it, you have to figure out if you want to use the technology to solve problems and focus on your company's core competency, or solve the problems within the technology to make it accomplish the task you want to complete. Hardware = Money and Time = Money.
Get Answers For Free
Most questions answered within 1 hours.