Catalog#
Reference#
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Introduction#
Abstract#
RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.
To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grainedtransformations.
Coarse-grained materials or systems have fewer, larger discrete components than fine-grained materials or systems.
- A coarse-grained description of a system regards large subcomponents.
- A fine-grained description regards smaller components of which the larger ones are composed.
Motivation#
Cluster computing frameworks, like MapReduce, lack abstractions for leveraging distributed memory.
The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently.
Resilient Distributed Datasets (RDDs)#
RDD Abstraction#
- an RDD is a read-only, partitioned collection of records.
- RDDs can only be created through determin- istic operations on either (1) data in stable storage or (2) other RDDs.
- RDDs do not need to be materialized at all times
- An RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage.
Spark Programming Interface#
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations.
Programmers can call a persist method to indicate which RDDs they want to reuse in future operations.
Advantages of the RDD Model#
- immutable na- ture lets a system mitigate slow nodes (stragglers) by run- ning backup copies of slow tasks as in MapReduce [10].
Applications Not Suitable for RDDs#
Spark Programming Interface#
Representing RDDs#
we propose representing each RDD through a common interface that exposes five pieces of information:
- a set of partitions, which are atomic pieces of the dataset;
- a set of dependencies on parent RDDs;
- a function for computing the dataset based on its parents;
- metadata about its partitioning scheme and data placement. For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on.
Narrow and Wide dependencies#
The most interesting question in designing this interface is how to represent dependencies between RDDs. We found it both sufficient and useful to classify depen- dencies into two types: narrow dependencies, where each partition of the parent RDD is used by at most one parti- tion of the child RDD, wide dependencies, where multiple child partitions may depend on it.
This distinction is useful for two reasons. First, narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce- like operation. Second, recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.
This distinction is useful for two reasons:
- Manage Pipiline: narrow dependencies allow for pipeline execution
- Manage Re-execution
RDD implementation examples#
We sketch some RDD implementations below.
HDFS files
The input RDDs in our samples have been files in HDFS. For these RDDs, partitions returns one partition for each block of the file (with the block’s offset stored in each Partition object), preferredLocations gives the nodes the block is on, and iterator reads the block.
partitions returns one partition for each block of the file
FEDB files can consider implement FEDBRDD at the same way.
map\
- Calling *map* on any RDD returns a MappedRDD object.
- has the same partitions and preferred locations as its parent
- applies the function to the parent’s records in its *iterator* method.
*\join*****
\Joining two RDDs may lead to either two narrow dependencies (if they are both hash/range partitioned with the same partitioner), two wide dependencies, or a mix (if one parent has a partitioner and one does not). In either case, the output RDD has a partitioner (either one inherited from the parents or a default hash partitioner).**
\lead to:**
- \narrow dependencies if parents have the same partitioner**
- \Wide if parent have different partitioner**
- \Mix if one parent has a partitioner and one does not**
\the output RDD has a partitioner**