Saturday, December 8, 2012
Thursday, November 15, 2012
Article on Facebook big data
http://techcrunch.com/2012/11/08/a-riddle-wrapped-in-a-mystery-inside-an-enigma/
Recent article on Facebook's big data. Inside Facebook, Hive is used heavily. Hive handles 60k+ queries daily.
Monday, November 12, 2012
Confusing terms
You are not confused if you are able to answer following questions without any doubts
Q) Can we build a system which is highly fault-tolerant but not highly available?
Q) Can we build a system with low reliability and High availability?
Q) is reliability defined as MTBF or MTTR?
Scalability
A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.
Elasticity
Elasticity often refers to a system's ability to allocate additional resources in an autonomic manner
In other words, a scalable system allows you to add resources in order to handle more load, while an elastic system will add resources itself when the load increases.
Fault-tolerance
Fault tolerance refers to a system's ability to continue operating, perhaps gracefully degrading in performance when components of the system fail. There is no exact measure to measure fault-tolerance of a system
Availability
Availability is a percentage of time that a system is actually operational and providing its intended service.
A = Uptime/(Uptime + Downtime)
Ai = MTBF/(MTBF+MTTR)
Where there are no single points of failure might be considered system as fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the system is not highly available.
Reliability
In simple words, how long can a system stay up continuously?
More concrete definition is “reliability is the ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances”
Reliability is often defined in terms of mean time between failures (MTBF). We can build a system with low-quality, not-so-reliable components and subsystems, and still achieve HA.
Durability
Durability of a system guarantees that stored data can't be lost.
References:
1) http://www.quora.com/Distributed-Systems/What-is-the-difference-between-the-terms-scalable-and-elastic
2) http://www.ibm.com/developerworks/library/pa-bigiron2/
3) http://www.quora.com/Distributed-Systems/What-is-the-difference-between-a-highly-fault-tolerant-and-a-highly-available-system#
4) http://www.wikipedia.org/
Sunday, November 11, 2012
EC2 elastic ?
From the discussion in our last class, I thought it would be interesting to point out what elasticity means in regard to Amazon AWS.
"Elastic – Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs."
This clearly means that AWS as a service provides well defined API for the applications to scale up and down, but does not do this itself.
FYI: Nevertheless, there is an AutoScaleUp functionality provided by AWS for Web Applications hosted on EC2 which increase the number of resources deployed when there is increase in web traffic. But I doubt if this is possible for other applications too.
Tuesday, November 6, 2012
NoSQL databases
Dynamo is one of NoSQL database, Bigtable and HBase are other examples, crurrently there are 150 NoSQL database systems. http://nosql-database.org/ has a list of all types of nosql systems.
Thursday, October 25, 2012
A good article describing the evolution of GFS
It has been almost a decade since the GFS paper has been published. Since then, there are many changes that have been made. This interview here summarizes a few of them and the circumstances in which they are made
http://delivery.acm.org/10.1145/1600000/1594206/p10-case_study.pdf?ip=134.84.46.146&acc=OPEN&CFID=132372171&CFTOKEN=23979261&__acm__=1351194189_d2b32e9f2b0897f11d07caa688ed10c0
http://delivery.acm.org/10.1145/1600000/1594206/p10-case_study.pdf?ip=134.84.46.146&acc=OPEN&CFID=132372171&CFTOKEN=23979261&__acm__=1351194189_d2b32e9f2b0897f11d07caa688ed10c0
Monday, October 22, 2012
Tuesday, October 16, 2012
'The Future of Cloud Computing' event live stream
There is a Cloud Computing & Big Data event/conference going on in Amsterdam. They have invited many speakers from Amazon, Facebook, Salesforce, etc. including mostly CTO, VP, CEO, Scientists.
Looking at their schedule, there are several interesting panel talks / speaker sessions related to our class discussions such as: "Debate: Hadoop vs. The World", Facebook's "Data Center Strategies", "Infrastructure Challenges with Big Data", etc. Check out the schedule link for more detailed descriptions of these topics.
The event lasts only for two days, Oct 16 and 17 and there is free live stream on their website. Being in Europe, the time can be quite inconvenient, but recorded videos are available immediately.
Looking at their schedule, there are several interesting panel talks / speaker sessions related to our class discussions such as: "Debate: Hadoop vs. The World", Facebook's "Data Center Strategies", "Infrastructure Challenges with Big Data", etc. Check out the schedule link for more detailed descriptions of these topics.
The event lasts only for two days, Oct 16 and 17 and there is free live stream on their website. Being in Europe, the time can be quite inconvenient, but recorded videos are available immediately.
Wednesday, October 10, 2012
Limitations of Mapreduce – where not to use Mapreduce
While very
powerful and applicable to a wide variety of problems, MapReduce is not the
answer to every problem. Here are some problems I found where MapReudce is not
suited and some papers that address the limitations of MapReuce.
1. Computation depends
on previously computed values
If the computation of a value depends on
previously computed values, then MapReduce cannot be used. One good example is
the Fibonacci series where each value is summation of the previous two values.
i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be
computed on a single machine, then it is better to do it as a single
reduce(map(data)) operation rather than going through the entire map reduce
process. [2]
2. Full-text indexing or ad hoc searching
The index
generated in the Map step is one dimensional, and the Reduce step must not
generate a large amount of data or there will be a serious performance
degradation. For example, CouchDB’s MapReduce may not be a good fit for
full-text indexing or ad hoc searching. This is a problem better suited for a
tool such as Lucene.[1]
3. Algorithms
depend on shared global state
Solutions to many interesting problems in text processing
do not require global synchronization. As a result, they can be expressed naturally
in MapReduce, since map and reduce tasks run independently and in isolation. However,
there are many examples of algorithms that depend crucially on the existence of
shared global state during processing, making them difficult to implement in
MapReduce (since the single opportunity for global synchronization in MapReduce
is the barrier between the map and reduce phases of processing). [3]
3.1. Online
learning
One example is online learning. In data intensive text processing with mapreduce Chapter 6 the concept
of learning as the setting of parameters in a statistical model. Both EM and
the gradient-based learning algorithms we described are instances of what are
known as batch learning algorithms. This simply means that the full
\batch" of training data is processed before any updates to the model
parameters are made. On one hand, this is quite reasonable updates are not made
until the full evidence of the training data has been weighed against the
model. An earlier update would seem, in some sense, to be hasty. However, it is
generally the case that more frequent updates can lead to more rapid
convergence of the model (in terms of number of training instances processed),
even if those updates are made by considering less data. Thinking in terms of
gradient optimization (see Section 6.5), online learning algorithms can be
understood as computing an approximation of the true gradient, using only a few
training instances. Although only an approximation, the gradient computed from
a small subset of training instances is often quite reasonable, and the
aggregate behavior of multiple updates tends to even out errors that are made.
In the limit, updates can be made after every training instance. Unfortunately,
implementing online learning algorithms in MapReduce is problematic. The model
parameters in a learning algorithm can be viewed as shared global state, which
must be updated as the model is evaluated against training data. All processes performing
the evaluation (presumably the mappers) must have access to this state. In a
batch learner, where updates occur in one or more reducers (or, alternatively,
in the driver code), synchronization of this resource is enforced by the
MapReduce framework. However, with online learning, these updates must occur
after processing smaller numbers of instances. This means that the framework
must be altered to support fasterprocessing of smaller datasets, which goes
against the design choices of most existingMapReduce implementations. Since
MapReduce was specially optimized for batch operations over large amounts of
data, such a style of computation would likely result in inefficient use of
resources. In Hadoop, for example, map and reduce tasks have considerable startup
costs. This is acceptable because in most circumstances, this cost is amortized
over the processing of many key-value pairs. However, for small datasets, these
high startup costs become intolerable. An alternative is to abandon shared
global state and run independent instances of the training algorithm in
parallel (on different portions of the data). A final solution is then arrived
at by merging individual results. Experiments, however, show that the merged
solution is inferior to the output of running the training algorithm on the
entire dataset. [3]
3.2. Monte
Carlo simulations
A related difficulty occurs when running what are called Monte
Carlo simulations, which are used to perform inference in probabilistic models
where evaluating or representing the model exactly is impossible. The basic
idea is quite simple: samples are drawn from the random variables in the model
to simulate its behavior, and then simple frequency statistics are computed
over the samples. This sort of inference is particularly useful when dealing
with so-called nonparametric models, which are models whose structure is not
specified in advance, but is rather inferred from training data. For an
illustration, imagine learning a hidden Markov model, but inferring the number of
states, rather than having them specified. Being able to parallelize Monte
Carlo simulations would be tremendously valuable, particularly for unsupervised
learning applications where they have been found to be far more effective than
EM-based learning (which requires specifying the model). Although recent work [4]
has shown that the delays in synchronizing sample statistics due to parallel
implementations do not necessarily damage the inference, MapReduce offers no
natural mechanism for managing the global shared state that would be required
for such an implementation. [3]
4. Join
algorithm for log processing
Paper- A Comparison of Join Algorithms for Log Processing in MapReduce
Spyros Blanas, Jignesh Patel reveals the
limitations of the MapReduce programming model for implementing joins.The
MapReduce framework is increasingly being used to analyze large volumes of
data. One important type of data analysis done with MapReduce is log
processing, in which a click-stream or an evnt log is filtered, aggregated, or mined
for patterns. As part of this analysis, the log often needs to be joined with
reference data such as information about users. Although there have been many
studies examining join algorithms in parallel and distributed DBMSs, the
MapReduce framework is cumbersome for joins. MapReduce programmers often use
simple but inefficient algorithms to perform joins. In this paper, we describe
crucial implementation details of a number of well-known join strategies in
MapReduce, and present a comprehensive experimental comparison of these join
techniques on a 100-node Hadoop cluster. Our results provide insights that are unique
to the MapReduce platform and offer guidance on when to use a particular join
algorithm on this platform. [7]
5. Paper - The
Limitation of MapReduce: A Probing Case and a Lightweight Solution
Zhiqiang Ma, Lin Gu
Zhiqiang Ma, Lin Gu
This paper describes that the MapReduce design allows a program
to scale up to handle extremely large data sets, but constrains a program’s
ability to process smaller data items and exploit variable-degrees of
parallelization opportunities which are likely to be the common case in general
application. In this paper, the author analyze the limitations of MapReduce and
present the design and implementation of a new lightweight parallelization framework,
MRlite. [5]
6. Paper
- Vision Paper: Towards an Understanding of the Limits of Map-Reduce
Computation, Foto N. Afrati, Anish Das Sarma
A significant amount of recent research work has addressed
the problem of solving various data management problems in the cloud. The major
algorithmic challenges in map-reduce computations involve balancing a multitude
of factors such as the number of machines available for mappers/reducers, their
memory requirements, and communication cost (total amount of data sent from
mappers to reducers). Most past work provides custom solutions to specific
problems, e.g., performing fuzzy joins in map-reduce, clustering, graph
analyses, and so on. While some problems are amenable to very efficient
map-reduce algorithms, some other problems do not lend themselves to a natural
distribution, and have provable lower bounds. Clearly, the ease of
"map-reducability" is closely related to whether the problem can be
partitioned into independent pieces, which are distributed across
mappers/reducers. What makes a problem distributable? Can we characterize
general properties of problems that determine how easy or hard it is to find
efficient map-reduce algorithms? This
is a vision paper that attempts to answer the questions described above.
[6]
Reference:
1. Writing and querying mapreduce views in couchdb, O’Reilly
Media
3. Data intensive text processing with mapreduce, Jimmy
Lin and Chris Dyer
4. In Advances in Neural Information Processing Systems21
(NIPS 2008)
Monday, October 8, 2012
Hi all,
I will be covering chapter 4 of NIST publication: security and privacy issues in public cloud computing.Please read it.
I will be covering chapter 4 of NIST publication: security and privacy issues in public cloud computing.Please read it.
Friday, October 5, 2012
MapReduce Implementations
Hi everyone, here are all MapReduce implementations I found online, all of which are either an independent Mapreduce Framework or coupled to a database or file system, a single programming language or a single domain, or made as a library or implement only a subset of mature MapReduce features. Hope this list might be helpful.
Hadoop
|
||
Hadoop mapreduce
|
The most commonly used open
source Mapreduce implementation by Apache
|
|
Other Mapreduce Implementations
|
||
Disco
|
Mapreduce implementation in
Erlang developed by Nokia
|
|
Cloud MapReduce
|
||
Mincemeat
|
||
Mapredus
|
simple mapreduce framework
using redis and resque
|
|
Qizmt
|
||
Peregrine
|
Fast Mapreduce framework for
running iterative jobs across partitions of data.
|
|
HTTPMR
|
||
R3
|
Mapreduce engine written in
python using a redis backend.
|
|
Octopy
|
||
Implementation of MapReduce in
the bash shell
|
||
Starfish
|
Open source ruby implimentatoin
of mapreduce
|
|
Skynet
|
open source Ruby implementation mapreduce created at
Geni
|
|
FileMap
|
system for applying Unix-style
file processing tool and provides full map-reduce functionality
|
|
Meguro
|
simple javascript Map/Reduce
implemention
|
|
Mapreduce-lite
|
lightweight C++ implementation of the MapReduce
|
|
Misco
|
Mobile Mapreduce framework
|
|
Galago's TupleFlow
|
search engine toolkit which
distribute execution across many processors
|
|
GPMR
|
Multi-core GPU mapreduce on GPU
clusters
|
|
Hadoop Pipes
|
Hadoop C++ API for mapreduce
|
|
Haloop
|
||
Mapreduce for storage and file system
|
||
Sphere
|
distributed data storage use
Mapreduce
|
|
MongoDB
|
Map/reduce database for batch
processing of data and aggregation operations
|
|
Greenplum
|
Next-gen data warehousing and
large-scale analytics processing database Supporting SQL and MapReduce
parallel processing
|
|
Plasma MapReduce
|
Distributed filesystem,
key/value db, and map/reduce system
|
Riak
|
Riak
|
open-source, distributed
database using mapreduce for Searching and Aggregating Data
|
|
Grid Grain
|
map reduce open source data
grid
|
|
InfiniDB
|
open-source free analytical
(columnar) DB implements
MPP in a MapReduce-like fashion
|
|
CouchDB
|
Database use Json for ducuments
and javascirpt for mapreduce queries
|
|
Mapreduce librarires
|
||
Boost
|
Boost C++ library for mapreduce
|
|
MapReduce library for Haskell
|
||
MPI-MR
|
Mapreduce-MPI library
|
Subscribe to:
Posts (Atom)