William's blog: MapReduce

MapReduce is a programming model, used in large-scale data sets ( more than 1TB ) of the parallel operation. The concept of " Map" and" Reduce", and their main ideas, are from the functional programming language borrowed from vector programming language, and borrowed characteristics. He is greatly convenient for the programmer never distributed parallel programming in the case of, will own the program running on a distributed system. The current implementation is specified for a Map function, used to a set of key-value pairs are mapped into a new set of key-value pairs, the specified with Reduce function, used to ensure that all map key-value pairs in each share the same key group.

MapReduce

In simple terms, a mapping function is for a number of independent elements of the concept on the list ( for example, a test result list) of every element in the specified operation ( as in the previous example, it was found the achievement of all students are overvalued one, he can define a" minus one" of the mapping function, used to fix the error. ). In fact, each element is independently operated, and the original list has not been changed, because this creates a new list to save the new answer. That is to say, the Map operation can be highly parallel, the high performance requirement and the application of parallel computing needs are very useful. While simplifying operation refers to a list of elements are properly merged ( read the previous example, if someone wants to know the average mark of the class how to do? He can define a simplification function, by making the list elements with their adjacent elements added to list in half, so the recursive computation until the list is only one element, and then divided by the number of the elements, has been average. ). Although he is not a mapping function then the parallel, but because of Jane always have a simple answer, large-scale operation is relatively independent, so the simplification function in a highly parallel environment is also very useful.

Distribution and reliability

MapReduce by putting on a data set of large-scale operation is distributed to each node in the network reliability; each node periodically to complete the work and status updates. If a node silence exceeds a preset time interval, the master node ( similar to Google File System in the main server ) recorded this node state for death, and assigned to the nodes of the data to other nodes. Each operation using the named file atomic operation to ensure no conflict between parallel threads; when files are renamed, the system may copy them to the task name another name in addition to. ( to avoid side effects ).

Simplification of operation mode is very similar, but because the simplification operations in parallel ability is poorer, the master node attempts to simplify operation scheduling in a node, or from the need to operate the data as close as possible node; this property can satisfy the requirement of Google, because they have enough bandwidth, their the internal network is not so many machine.

Use

In Google, MapReduce used in a very wide range of applications, including" grep distribution distribution, sorting, web connected graph inversion, each machine word vector, web log analysis, reverse index construction, document clustering, machine learning, based on statistical machine translation ..." of note, MapReduce, it is used to regenerate the Google index, and replaces the old ad hoc program to update the index. MapReduce will generate a lot of temporary files, in order to improve efficiency, it uses the Google file system to manage and access the files.

William's blog

Friday, December 23, 2011

MapReduce

No comments:

Post a Comment