While very
powerful and applicable to a wide variety of problems, MapReduce is not the
answer to every problem. Here are some problems I found where MapReudce is not
suited and some papers that address the limitations of MapReuce.
1. Computation depends
on previously computed values
If the computation of a value depends on
previously computed values, then MapReduce cannot be used. One good example is
the Fibonacci series where each value is summation of the previous two values.
i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be
computed on a single machine, then it is better to do it as a single
reduce(map(data)) operation rather than going through the entire map reduce
process. [2]
2. Full-text indexing or ad hoc searching
The index
generated in the Map step is one dimensional, and the Reduce step must not
generate a large amount of data or there will be a serious performance
degradation. For example, CouchDB’s MapReduce may not be a good fit for
full-text indexing or ad hoc searching. This is a problem better suited for a
tool such as Lucene.[1]
3. Algorithms
depend on shared global state
Solutions to many interesting problems in text processing
do not require global synchronization. As a result, they can be expressed naturally
in MapReduce, since map and reduce tasks run independently and in isolation. However,
there are many examples of algorithms that depend crucially on the existence of
shared global state during processing, making them difficult to implement in
MapReduce (since the single opportunity for global synchronization in MapReduce
is the barrier between the map and reduce phases of processing). [3]
3.1. Online
learning
One example is online learning. In data intensive text processing with mapreduce Chapter 6 the concept
of learning as the setting of parameters in a statistical model. Both EM and
the gradient-based learning algorithms we described are instances of what are
known as batch learning algorithms. This simply means that the full
\batch" of training data is processed before any updates to the model
parameters are made. On one hand, this is quite reasonable updates are not made
until the full evidence of the training data has been weighed against the
model. An earlier update would seem, in some sense, to be hasty. However, it is
generally the case that more frequent updates can lead to more rapid
convergence of the model (in terms of number of training instances processed),
even if those updates are made by considering less data. Thinking in terms of
gradient optimization (see Section 6.5), online learning algorithms can be
understood as computing an approximation of the true gradient, using only a few
training instances. Although only an approximation, the gradient computed from
a small subset of training instances is often quite reasonable, and the
aggregate behavior of multiple updates tends to even out errors that are made.
In the limit, updates can be made after every training instance. Unfortunately,
implementing online learning algorithms in MapReduce is problematic. The model
parameters in a learning algorithm can be viewed as shared global state, which
must be updated as the model is evaluated against training data. All processes performing
the evaluation (presumably the mappers) must have access to this state. In a
batch learner, where updates occur in one or more reducers (or, alternatively,
in the driver code), synchronization of this resource is enforced by the
MapReduce framework. However, with online learning, these updates must occur
after processing smaller numbers of instances. This means that the framework
must be altered to support fasterprocessing of smaller datasets, which goes
against the design choices of most existingMapReduce implementations. Since
MapReduce was specially optimized for batch operations over large amounts of
data, such a style of computation would likely result in inefficient use of
resources. In Hadoop, for example, map and reduce tasks have considerable startup
costs. This is acceptable because in most circumstances, this cost is amortized
over the processing of many key-value pairs. However, for small datasets, these
high startup costs become intolerable. An alternative is to abandon shared
global state and run independent instances of the training algorithm in
parallel (on different portions of the data). A final solution is then arrived
at by merging individual results. Experiments, however, show that the merged
solution is inferior to the output of running the training algorithm on the
entire dataset. [3]
3.2. Monte
Carlo simulations
A related difficulty occurs when running what are called Monte
Carlo simulations, which are used to perform inference in probabilistic models
where evaluating or representing the model exactly is impossible. The basic
idea is quite simple: samples are drawn from the random variables in the model
to simulate its behavior, and then simple frequency statistics are computed
over the samples. This sort of inference is particularly useful when dealing
with so-called nonparametric models, which are models whose structure is not
specified in advance, but is rather inferred from training data. For an
illustration, imagine learning a hidden Markov model, but inferring the number of
states, rather than having them specified. Being able to parallelize Monte
Carlo simulations would be tremendously valuable, particularly for unsupervised
learning applications where they have been found to be far more effective than
EM-based learning (which requires specifying the model). Although recent work [4]
has shown that the delays in synchronizing sample statistics due to parallel
implementations do not necessarily damage the inference, MapReduce offers no
natural mechanism for managing the global shared state that would be required
for such an implementation. [3]
4. Join
algorithm for log processing
Paper- A Comparison of Join Algorithms for Log Processing in MapReduce
Spyros Blanas, Jignesh Patel reveals the
limitations of the MapReduce programming model for implementing joins.The
MapReduce framework is increasingly being used to analyze large volumes of
data. One important type of data analysis done with MapReduce is log
processing, in which a click-stream or an evnt log is filtered, aggregated, or mined
for patterns. As part of this analysis, the log often needs to be joined with
reference data such as information about users. Although there have been many
studies examining join algorithms in parallel and distributed DBMSs, the
MapReduce framework is cumbersome for joins. MapReduce programmers often use
simple but inefficient algorithms to perform joins. In this paper, we describe
crucial implementation details of a number of well-known join strategies in
MapReduce, and present a comprehensive experimental comparison of these join
techniques on a 100-node Hadoop cluster. Our results provide insights that are unique
to the MapReduce platform and offer guidance on when to use a particular join
algorithm on this platform. [7]
5. Paper - The
Limitation of MapReduce: A Probing Case and a Lightweight Solution
Zhiqiang Ma, Lin Gu
Zhiqiang Ma, Lin Gu
This paper describes that the MapReduce design allows a program
to scale up to handle extremely large data sets, but constrains a program’s
ability to process smaller data items and exploit variable-degrees of
parallelization opportunities which are likely to be the common case in general
application. In this paper, the author analyze the limitations of MapReduce and
present the design and implementation of a new lightweight parallelization framework,
MRlite. [5]
6. Paper
- Vision Paper: Towards an Understanding of the Limits of Map-Reduce
Computation, Foto N. Afrati, Anish Das Sarma
A significant amount of recent research work has addressed
the problem of solving various data management problems in the cloud. The major
algorithmic challenges in map-reduce computations involve balancing a multitude
of factors such as the number of machines available for mappers/reducers, their
memory requirements, and communication cost (total amount of data sent from
mappers to reducers). Most past work provides custom solutions to specific
problems, e.g., performing fuzzy joins in map-reduce, clustering, graph
analyses, and so on. While some problems are amenable to very efficient
map-reduce algorithms, some other problems do not lend themselves to a natural
distribution, and have provable lower bounds. Clearly, the ease of
"map-reducability" is closely related to whether the problem can be
partitioned into independent pieces, which are distributed across
mappers/reducers. What makes a problem distributable? Can we characterize
general properties of problems that determine how easy or hard it is to find
efficient map-reduce algorithms? This
is a vision paper that attempts to answer the questions described above.
[6]
Reference:
1. Writing and querying mapreduce views in couchdb, O’Reilly
Media
3. Data intensive text processing with mapreduce, Jimmy
Lin and Chris Dyer
4. In Advances in Neural Information Processing Systems21
(NIPS 2008)
Nice post.
ReplyDeleteHi Yanin,
ReplyDeleteNice compilation.
Here are the generalizations from the examples:
1) MapReduce can't be used unless the operation is parallelized(Fibonacci example, gradient descent computation. Shared global state problems may also come into this category)
2) MapReduce can't perform well unless the data is partitioned in a optimized way (other wise, lot of communication is involved)
I didn't get the 2nd example. Does reduce task generate more data in text indexing?
-Balu