Wednesday, October 10, 2012

Limitations of Mapreduce – where not to use Mapreduce


While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReuce.

1. Computation depends on previously computed values
If the computation of a value depends on previously computed values, then MapReduce cannot be used. One good example is the Fibonacci series where each value is summation of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a single machine, then it is better to do it as a single reduce(map(data)) operation rather than going through the entire map reduce process. [2]

2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of data or there will be a serious performance degradation. For example, CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a problem better suited for a tool such as Lucene.[1]

3. Algorithms depend on shared global state
Solutions to many interesting problems in text processing do not require global synchronization. As a result, they can be expressed naturally in MapReduce, since map and reduce tasks run independently and in isolation. However, there are many examples of algorithms that depend crucially on the existence of shared global state during processing, making them difficult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing). [3]

3.1. Online learning
One example is online learning. In data intensive text processing with mapreduce Chapter 6 the concept of learning as the setting of parameters in a statistical model. Both EM and the gradient-based learning algorithms we described are instances of what are known as batch learning algorithms. This simply means that the full \batch" of training data is processed before any updates to the model parameters are made. On one hand, this is quite reasonable updates are not made until the full evidence of the training data has been weighed against the model. An earlier update would seem, in some sense, to be hasty. However, it is generally the case that more frequent updates can lead to more rapid convergence of the model (in terms of number of training instances processed), even if those updates are made by considering less data. Thinking in terms of gradient optimization (see Section 6.5), online learning algorithms can be understood as computing an approximation of the true gradient, using only a few training instances. Although only an approximation, the gradient computed from a small subset of training instances is often quite reasonable, and the aggregate behavior of multiple updates tends to even out errors that are made. In the limit, updates can be made after every training instance. Unfortunately, implementing online learning algorithms in MapReduce is problematic. The model parameters in a learning algorithm can be viewed as shared global state, which must be updated as the model is evaluated against training data. All processes performing the evaluation (presumably the mappers) must have access to this state. In a batch learner, where updates occur in one or more reducers (or, alternatively, in the driver code), synchronization of this resource is enforced by the MapReduce framework. However, with online learning, these updates must occur after processing smaller numbers of instances. This means that the framework must be altered to support fasterprocessing of smaller datasets, which goes against the design choices of most existingMapReduce implementations. Since MapReduce was specially optimized for batch operations over large amounts of data, such a style of computation would likely result in inefficient use of resources. In Hadoop, for example, map and reduce tasks have considerable startup costs. This is acceptable because in most circumstances, this cost is amortized over the processing of many key-value pairs. However, for small datasets, these high startup costs become intolerable. An alternative is to abandon shared global state and run independent instances of the training algorithm in parallel (on different portions of the data). A final solution is then arrived at by merging individual results. Experiments, however, show that the merged solution is inferior to the output of running the training algorithm on the entire dataset. [3]

3.2. Monte Carlo simulations
A related difficulty occurs when running what are called Monte Carlo simulations, which are used to perform inference in probabilistic models where evaluating or representing the model exactly is impossible. The basic idea is quite simple: samples are drawn from the random variables in the model to simulate its behavior, and then simple frequency statistics are computed over the samples. This sort of inference is particularly useful when dealing with so-called nonparametric models, which are models whose structure is not specified in advance, but is rather inferred from training data. For an illustration, imagine learning a hidden Markov model, but inferring the number of states, rather than having them specified. Being able to parallelize Monte Carlo simulations would be tremendously valuable, particularly for unsupervised learning applications where they have been found to be far more effective than EM-based learning (which requires specifying the model). Although recent work [4] has shown that the delays in synchronizing sample statistics due to parallel implementations do not necessarily damage the inference, MapReduce offers no natural mechanism for managing the global shared state that would be required for such an implementation. [3]

4. Join algorithm for log processing
Paper- A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh Patel reveals the limitations of the MapReduce programming model for implementing joins.The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an evnt log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform. [7]


5. Paper - The Limitation of MapReduce: A Probing Case and a Lightweight Solution
Zhiqiang Ma, Lin Gu
This paper describes that the MapReduce design allows a program to scale up to handle extremely large data sets, but constrains a program’s ability to process smaller data items and exploit variable-degrees of parallelization opportunities which are likely to be the common case in general application. In this paper, the author analyze the limitations of MapReduce and present the design and implementation of a new lightweight parallelization framework, MRlite. [5]


6. Paper - Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, Foto N. Afrati, Anish Das Sarma
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e.g., performing fuzzy joins in map-reduce, clustering, graph analyses, and so on. While some problems are amenable to very efficient map-reduce algorithms, some other problems do not lend themselves to a natural distribution, and have provable lower bounds. Clearly, the ease of "map-reducability" is closely related to whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers. What makes a problem distributable? Can we characterize general properties of problems that determine how easy or hard it is to find efficient map-reduce algorithms? This is a vision paper that attempts to answer the questions described above. [6]

Reference:
1. Writing and querying mapreduce views in couchdb, O’Reilly Media
3. Data intensive text processing with mapreduce, Jimmy Lin and Chris Dyer
4. In Advances in Neural Information Processing Systems21 (NIPS 2008)

2 comments:

  1. Hi Yanin,

    Nice compilation.

    Here are the generalizations from the examples:
    1) MapReduce can't be used unless the operation is parallelized(Fibonacci example, gradient descent computation. Shared global state problems may also come into this category)
    2) MapReduce can't perform well unless the data is partitioned in a optimized way (other wise, lot of communication is involved)

    I didn't get the 2nd example. Does reduce task generate more data in text indexing?

    -Balu

    ReplyDelete