MapReduce in Software Generating barcode 39 in Software MapReduce

MapReduce using software tocreate bar code 39 on web,windows application ean8 master node splits key-value pairs A cluster in t his chapter is a group of tightly coupled computers that work together closely. This sense of the word is different from the use of cluster as a group of documents that are semantically similar in s 16 18..

P1: KRU/IRP ir Software barcode 39 book CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8. 4.4 Distributed indexing splits assign master assign postings a-f g-p q-z parser parser a-f g-p q-z a-f g-p q-z inve rter inve rter inve rter parser map phase a-f g-p q-z segment files reduce phase Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).

. complex than i Software Code 39 Extended n single-machine indexing. A simple solution is to maintain a (perhaps precomputed) mapping for frequent terms that is copied to all nodes and to use terms directly (instead of termIDs) for infrequent terms. We do not address this problem here and assume that all nodes share a consistent term termID mapping.

map phase The map phase of MapReduce consists of mapping splits of the input data to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parser parsers. Each parser writes its output to local intermediate les, the segment.

segment le l es (shown as a-f g-p q-z in Figure 4.5). reduce phase For the reduce phase, we want all values for a given key to be stored close.

together, so t 3 of 9 for None hat they can be read and processed quickly. This is achieved by partitioning the keys into j term partitions and having the parsers write keyvalue pairs for each term partition into a separate segment le. In Figure 4.

5, the term partitions are according to rst letter: a f, g p, q z, and j = 3. (We chose these key ranges for ease of exposition. In general, key ranges need not correspond to contiguous terms or termIDs.

) The term partitions are de ned by the person who operates the indexing system (Exercise 4.10). The parsers then write corresponding segment les, one for each term partition.

Each term partition thus corresponds to r segments les, where r is the number of parsers. For instance, Figure 4.5 shows three a f segment les of the a f partition, corresponding to the three parsers shown in the gure.

Collecting all values (here: docIDs) for a given key (here: termID) into one inverter list is the task of the inverters in the reduce phase. The master assigns each term partition to a different inverter and, as in the case of parsers, reassigns term partitions in case of failing or slow inverters. Each term partition.

P1: KRU/IRP ir book CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8. Schema of map Software Code 39 Extended and reduce functions map: input reduce: (k,list(v)). Index construction list(k , v) output Instantiation of the schema for index construction list(termID, docID) map: web collection reduce: ( termID1, list(docID) , termID2 , list(docID) , . . .

) (postings list1, postings list 2 , . . .

) Example for index construction map: d2 : C died. d1 : C came, C c ed. reduce: ( C,(d2 ,d1 ,d1 ) , died,(d2 ) , came,(d1 ) , c ed,(d1 ) ).

( C, d2 , Software Code 3/9 died,d2 , C,d1 , came,d1 , C,d1 , c ed,d1 ) ( C,(d1:2,d2:1) , died,(d2:1) , came,(d1:1) , c ed,(d1:1) ). Figure 4.6 Map and reduce functions in MapReduce. In general, the map function produces a list of key-value pairs.

All values for a key are collected into one list in the reduce phase. This list is then processed further. The instantiations of the two functions and an example are shown for index construction.

Because the map phase processes documents in a distributed fashion, termID docID pairs need not be ordered correctly initially as in this example. The example shows terms instead of termIDs for better readability. We abbreviate Caesar as C and conquered as c ed.

. (corresponding Code 3 of 9 for None to r segment les, one on each parser) is processed by one inverter. We assume here that segment les are of a size that a single machine can handle (Exercise 4.9).

Finally, the list of values is sorted for each key and written to the nal sorted postings list ( postings in the gure). (Note that postings in Figure 4.6 include term frequencies, whereas each posting in the other sections of this chapter is simply a docID without term frequency information.

) The data ow is shown for a f in Figure 4.5. This completes the construction of the inverted index.

Parsers and inverters are not separate sets of machines. The master identi es idle machines and assigns tasks to them. The same machine can be a parser in the map phase and an inverter in the reduce phase.

And there are often other jobs that run in parallel with index construction, so in between being a parser and an inverter a machine might do some crawling or another unrelated task. To minimize write times before inverters reduce the data, each parser writes its segment les to its local disk. In the reduce phase, the master communicates to an inverter the locations of the relevant segment les (e.

g., of the r segment les of the a f partition). Each segment le only requires one sequential read because all data relevant to a particular inverter were written to a single segment le by the parser.

This setup minimizes the amount of network traf c needed during indexing. Figure 4.6 shows the general schema of the MapReduce functions.

Input and output are often lists of key-value pairs themselves, so that several MapReduce jobs can run in sequence. In fact, this was the design of the Google indexing system in 2004. What we describe in this section corresponds to only one of ve to ten MapReduce operations in that indexing system.

Another MapReduce operation transforms the term-partitioned index we just created into a document-partitioned one.. P1: KRU/IRP ir book CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8.
Copyright © . All rights reserved.