Thursday, November 15, 2012

Article on Facebook big data

Recent article on Facebook's big data. Inside Facebook, Hive is used heavily. Hive handles 60k+ queries daily.

Monday, November 12, 2012

Confusing terms

You are not confused if you are able to answer following questions without any doubts

Q) Can we build a system which is highly fault-tolerant but not highly available?
Q) Can we build a system with low reliability and High availability?
Q) is reliability
defined as MTBF or MTTR?

A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.

Elasticity often refers to a system's ability to allocate additional resources in an autonomic manner

In other words, a scalable system allows you to add resources in order to handle more load, while an elastic system will add resources itself when the load increases.


Fault tolerance refers to a system's ability to continue operating, perhaps gracefully degrading in performance when components of the system fail. There is no exact measure to measure fault-tolerance of a system


Availability is a percentage of time that a system is actually operational and providing its intended service.

A = Uptime/(Uptime + Downtime)

Where there are no single points of failure might be considered system as fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the system is not highly available.

In simple words, how long can a system stay up continuously?

More concrete definition is “reliability is the ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances”

Reliability is often defined in terms of mean time between failures (MTBF). We can build a system with low-quality, not-so-reliable components and subsystems, and still achieve HA.

Durability of a system guarantees that stored data
can't be lost.





Sunday, November 11, 2012

EC2 elastic ?

From the discussion in our last class, I thought it would be interesting to point out what elasticity means in regard to Amazon AWS.

"Elastic – Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs."

This clearly means that AWS as a service provides well defined API for the applications to scale up and down, but does not do this itself. 

FYI: Nevertheless, there is an AutoScaleUp functionality provided by AWS for Web Applications hosted on EC2 which increase the number of resources deployed when there is increase in web traffic. But I doubt if this is possible for other applications too. 

Tuesday, November 6, 2012

NoSQL databases

Dynamo is one of NoSQL database, Bigtable and HBase are other examples, crurrently there are 150 NoSQL database systems.  has a list of all types of nosql systems.

Thursday, October 25, 2012

A good article describing the evolution of GFS

It has been almost a decade since the GFS paper has been published. Since then, there are many changes that have been made. This interview here summarizes a few of them and the circumstances in which they are made

Tuesday, October 16, 2012

'The Future of Cloud Computing' event live stream

There is a Cloud Computing & Big Data event/conference going on in Amsterdam. They have invited many speakers from Amazon, Facebook, Salesforce, etc. including mostly CTO, VP, CEO, Scientists.

Looking at their schedule, there are several interesting panel talks / speaker sessions related to our class discussions such as: "Debate: Hadoop vs. The World", Facebook's "Data Center Strategies", "Infrastructure Challenges with Big Data", etc. Check out the schedule link for more detailed descriptions of these topics.

The event lasts only for two days, Oct 16 and 17 and there is free live stream on their website. Being in Europe, the time can be quite inconvenient, but recorded videos are available immediately.

Wednesday, October 10, 2012

Limitations of Mapreduce – where not to use Mapreduce

While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReuce.

1. Computation depends on previously computed values
If the computation of a value depends on previously computed values, then MapReduce cannot be used. One good example is the Fibonacci series where each value is summation of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a single machine, then it is better to do it as a single reduce(map(data)) operation rather than going through the entire map reduce process. [2]

2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of data or there will be a serious performance degradation. For example, CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a problem better suited for a tool such as Lucene.[1]

3. Algorithms depend on shared global state
Solutions to many interesting problems in text processing do not require global synchronization. As a result, they can be expressed naturally in MapReduce, since map and reduce tasks run independently and in isolation. However, there are many examples of algorithms that depend crucially on the existence of shared global state during processing, making them difficult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing). [3]

3.1. Online learning
One example is online learning. In data intensive text processing with mapreduce Chapter 6 the concept of learning as the setting of parameters in a statistical model. Both EM and the gradient-based learning algorithms we described are instances of what are known as batch learning algorithms. This simply means that the full \batch" of training data is processed before any updates to the model parameters are made. On one hand, this is quite reasonable updates are not made until the full evidence of the training data has been weighed against the model. An earlier update would seem, in some sense, to be hasty. However, it is generally the case that more frequent updates can lead to more rapid convergence of the model (in terms of number of training instances processed), even if those updates are made by considering less data. Thinking in terms of gradient optimization (see Section 6.5), online learning algorithms can be understood as computing an approximation of the true gradient, using only a few training instances. Although only an approximation, the gradient computed from a small subset of training instances is often quite reasonable, and the aggregate behavior of multiple updates tends to even out errors that are made. In the limit, updates can be made after every training instance. Unfortunately, implementing online learning algorithms in MapReduce is problematic. The model parameters in a learning algorithm can be viewed as shared global state, which must be updated as the model is evaluated against training data. All processes performing the evaluation (presumably the mappers) must have access to this state. In a batch learner, where updates occur in one or more reducers (or, alternatively, in the driver code), synchronization of this resource is enforced by the MapReduce framework. However, with online learning, these updates must occur after processing smaller numbers of instances. This means that the framework must be altered to support fasterprocessing of smaller datasets, which goes against the design choices of most existingMapReduce implementations. Since MapReduce was specially optimized for batch operations over large amounts of data, such a style of computation would likely result in inefficient use of resources. In Hadoop, for example, map and reduce tasks have considerable startup costs. This is acceptable because in most circumstances, this cost is amortized over the processing of many key-value pairs. However, for small datasets, these high startup costs become intolerable. An alternative is to abandon shared global state and run independent instances of the training algorithm in parallel (on different portions of the data). A final solution is then arrived at by merging individual results. Experiments, however, show that the merged solution is inferior to the output of running the training algorithm on the entire dataset. [3]

3.2. Monte Carlo simulations
A related difficulty occurs when running what are called Monte Carlo simulations, which are used to perform inference in probabilistic models where evaluating or representing the model exactly is impossible. The basic idea is quite simple: samples are drawn from the random variables in the model to simulate its behavior, and then simple frequency statistics are computed over the samples. This sort of inference is particularly useful when dealing with so-called nonparametric models, which are models whose structure is not specified in advance, but is rather inferred from training data. For an illustration, imagine learning a hidden Markov model, but inferring the number of states, rather than having them specified. Being able to parallelize Monte Carlo simulations would be tremendously valuable, particularly for unsupervised learning applications where they have been found to be far more effective than EM-based learning (which requires specifying the model). Although recent work [4] has shown that the delays in synchronizing sample statistics due to parallel implementations do not necessarily damage the inference, MapReduce offers no natural mechanism for managing the global shared state that would be required for such an implementation. [3]

4. Join algorithm for log processing
Paper- A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh Patel reveals the limitations of the MapReduce programming model for implementing joins.The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an evnt log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform. [7]

5. Paper - The Limitation of MapReduce: A Probing Case and a Lightweight Solution
Zhiqiang Ma, Lin Gu
This paper describes that the MapReduce design allows a program to scale up to handle extremely large data sets, but constrains a program’s ability to process smaller data items and exploit variable-degrees of parallelization opportunities which are likely to be the common case in general application. In this paper, the author analyze the limitations of MapReduce and present the design and implementation of a new lightweight parallelization framework, MRlite. [5]

6. Paper - Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, Foto N. Afrati, Anish Das Sarma
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e.g., performing fuzzy joins in map-reduce, clustering, graph analyses, and so on. While some problems are amenable to very efficient map-reduce algorithms, some other problems do not lend themselves to a natural distribution, and have provable lower bounds. Clearly, the ease of "map-reducability" is closely related to whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers. What makes a problem distributable? Can we characterize general properties of problems that determine how easy or hard it is to find efficient map-reduce algorithms? This is a vision paper that attempts to answer the questions described above. [6]

1. Writing and querying mapreduce views in couchdb, O’Reilly Media
3. Data intensive text processing with mapreduce, Jimmy Lin and Chris Dyer
4. In Advances in Neural Information Processing Systems21 (NIPS 2008)

Monday, October 8, 2012

Hi all,

I will be covering chapter 4 of NIST publication: security and privacy issues in public cloud computing.Please read it.

Friday, October 5, 2012

MapReduce Implementations

Hi everyone, here are all MapReduce implementations I found online, all of which are either an independent Mapreduce Framework or coupled to a database or file system, a single programming language or a single domain, or made as a library or implement only a subset of mature MapReduce features. Hope this list might be helpful. 


Hadoop mapreduce
The most commonly used open source Mapreduce implementation by Apache
Other Mapreduce Implementations

Mapreduce implementation in Erlang developed by Nokia

Cloud MapReduce
Python implementation of the MapReduce distributed computing framework
simple mapreduce framework using redis and resque
Fast Mapreduce framework for running iterative jobs across partitions of data. 
Mapreduce engine written in python using a redis backend.
Fast-n-easy MapReduce implementation for Python
Implementation of MapReduce in the bash shell
Open source ruby implimentatoin of mapreduce
open source Ruby implementation mapreduce created at Geni

system for applying Unix-style file processing tool and provides full map-reduce functionality

simple javascript Map/Reduce implemention
lightweight C++ implementation of the MapReduce
Mobile Mapreduce framework

Galago's TupleFlow
search engine toolkit which distribute execution across many processors 
Multi-core GPU mapreduce on GPU clusters
Hadoop Pipes
Hadoop C++ API for mapreduce
Mapreduce for storage and file system

distributed data storage use Mapreduce

Map/reduce database for batch processing of data and aggregation operations
Next-gen data warehousing and large-scale analytics processing database Supporting SQL and MapReduce parallel processing        

Plasma MapReduce
Distributed filesystem, key/value db, and map/reduce system
open-source, distributed database using mapreduce for Searching and Aggregating Data
Grid Grain
map reduce open source data grid

open-source free analytical (columnar) DB  implements MPP in a MapReduce-like fashion
Database use Json for ducuments and javascirpt for mapreduce queries
Mapreduce librarires

Boost C++ library for mapreduce
MapReduce library for  Haskell

Mapreduce-MPI library