The purpose of this course is to understand the system models, algorithms and protocols that allow computers to communicate and coordinate their actions to solve a problem. This course helps the learner to understand the distributed computation model and various concepts like global state, termination detection, mutual exclusion, deadlock detection, shared memory, failure recovery, consensus, file system. It helps the learners to develop solutions to problems in distributed computing environment.
Prerequisite: Basic knowledge in data structures and operating systems.
Course Outcomes: After the completion of the course the student will be able to
CO1 – Summarize various aspects of distributed computation model and logical time.
CO2 – Illustrate election algorithm, global snapshot algorithm and termination detection algorithm.
CO3 – Compare token based, non-token based and quorum based mutual exclusion algorithms.
CO4 – Recognize the significance of deadlock detection and shared memory in distributed systems.
CO5 – Explain the concepts of failure recovery and consensus.
CO6 – Illustrate distributed file system architectures.
Syllabus
Module – 1 (Distributed systems basics and Computation model)
Distributed System – Definition, Relation to computer system components, Motivation, Primitives for distributed communication, Design issues, Challenges and applications. A model of distributed computations – Distributed program, Model of distributed executions, Models of communication networks, Global state of a distributed system, Cuts of a distributed computation, Past and future cones of an event, Models of process communications.
Module – 2 (Election algorithm, Global state and Termination detection)
Logical time – A framework for a system of logical clocks, Scalar time, Vector time. Leader election algorithm – Bully algorithm, Ring algorithm. Global state and snapshot recording algorithms – System model and definitions, Snapshot algorithm for FIFO channels – Chandy Lamport algorithm. Termination detection – System model of a distributed computation, Termination detection using distributed snapshots, Termination detection by weight throwing, Spanning-tree-based algorithm.
Module – 3 (Mutual exclusion and Deadlock detection)
Distributed mutual exclusion algorithms – System model, Requirements of mutual exclusion algorithm. Lamport’s algorithm, Ricart–Agrawala algorithm, Quorum-based mutual exclusion algorithms – Maekawa’s algorithm. Token-based algorithm – Suzuki–Kasami’s broadcast algorithm. Deadlock detection in distributed systems – System model, Deadlock handling strategies, Issues in deadlock detection, Models of deadlocks.
Module – 4 (Distributed shared memory and Failure recovery)
Distributed shared memory – Abstraction and advantages. Shared memory mutual exclusion –Lamport’s bakery algorithm. Check pointing and rollback recovery – System model, consistent and inconsistent states, different types of messages, Issues in failure recovery, checkpoint based recovery, log based roll back recovery.
Module – 5 (Consensus and Distributed file system)
Consensus and agreement algorithms – Assumptions, The Byzantine agreement and other problems, Agreement in (message-passing) synchronous systems with failures – Consensus algorithm for crash failures. Distributed file system – File service architecture, Case studies: Sun Network File System, Andrew File System, Google File System.
Text Books
1. Ajay D. Kshemkalyani and Mukesh Singhal, Distributed Computing: Principles, Algorithms, and Systems, Cambridge University Press, 2011.
Reference Books
1. George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair. Distributed System Concepts and Design, Addison Wesley, Fifth edition.
2. Kai Hwang, Geoffrey C Fox, Jack J Dongarra, Distributed and Cloud Computing – From Parallel Processing to the Internet of Things, Morgan Kaufmann Publishers, 2012.
3. Sukumar Ghosh, Distributed Systems: An Algorithmic Approach, CRC Press, Second edition, 2015.
4. Maarten Van Steen, Andrew S. Tanenbaum, Distributed Systems, Prentice Hall of India,Third edition, 2017.
5. Randy Chow and Theodore Johnson, Distributed Operating Systems and Algorithm Analysis, Pearson Education India, First edition, 2009.
6. Valmir C. Barbosa, An Introduction to Distributed Algorithms, MIT Press, 2003.
Curriculum
- 5 Sections
- 24 Lessons
- 12 Weeks
- Module 01 - Distributed systems basics and Computation modelThis module provides an in-depth understanding of distributed systems, covering their definition, relationship to computer system components, and motivation for their use. It explores primitives for distributed communication, key design issues, challenges, and real-world applications. The course also delves into models of distributed computations, including distributed programs, execution models, communication networks, and the global state of distributed systems. Concepts such as computational cuts, past and future event cones, and process communication models are analyzed to provide a comprehensive framework for understanding distributed computing.5
- Module 02 - Election algorithm, Global state and Termination detectionThis module covers key concepts in distributed systems related to logical time, leader election, global state recording, and termination detection. It introduces logical clocks, including scalar time and vector time, as frameworks for ordering events in distributed environments. The Bully and Ring algorithms for leader election are explored, enabling coordination in distributed networks. The module also delves into global state and snapshot recording, with a focus on the Chandy-Lamport algorithm for FIFO channels. Finally, it examines termination detection, discussing models of distributed computation, snapshot-based detection, weight-throwing techniques, and spanning-tree-based approaches for ensuring correct process completion in distributed systems.5
- Module 03 - Mutual exclusion and Deadlock detectionThis module explores algorithms and strategies for ensuring mutual exclusion and detecting deadlocks in distributed systems. It covers fundamental system models and requirements for mutual exclusion, including classical algorithms such as Lamport’s and Ricart–Agrawala’s timestamp-based approaches, as well as quorum-based methods like Maekawa’s algorithm. Token-based solutions, such as Suzuki–Kasami’s broadcast algorithm, are also discussed. Additionally, the module delves into deadlock detection, examining system models, handling strategies, challenges, and different deadlock models in distributed environments.6
- 3.1Lesson 01 – Distributed Mutual Exclusion15 Minutes
- 3.2Lesson 02 – Lamport’s Algorithm15 Minutes
- 3.3Lesson 03 – Ricart Agrawala Algorithm20 Minutes
- 3.4Lesson 04 – Quorum-based mutual exclusion algorithms – Maekawa’s algorithm15 Minutes
- 3.5Lesson 05 – Token-based algorithm – Suzuki–Kasami’s broadcast algorithm25 Minutes
- 3.6Lesson 06 – Deadlock detection in distributed systems40 Minutes
- Module 04 - Distributed shared memory and Failure recoveryThis module covers the concept of Distributed Shared Memory (DSM) as an abstraction for interprocess communication, along with its advantages. It explores mutual exclusion in shared memory systems, focusing on Lamport’s Bakery Algorithm for ensuring process synchronization. The module also delves into checkpointing and rollback recovery, explaining system models, consistent vs. inconsistent states, message types, and key challenges in failure recovery. Various recovery techniques, including checkpoint-based and log-based rollback recovery, are discussed, addressing critical issues in distributed fault tolerance.3
- Module 05 - Consensus and Distributed file systemThis module explores fundamental concepts in achieving consensus in distributed systems. It covers key assumptions, the Byzantine agreement problem, and other challenges in reaching agreement among distributed nodes. It delves into consensus algorithms for message-passing synchronous systems, particularly in the presence of failures, including crash failure scenarios. Additionally, the module examines the architecture and design of distributed file systems, focusing on file service models and key case studies such as Sun Network File System (NFS), Andrew File System (AFS), and Google File System (GFS). It provides insights into how these systems ensure reliability, consistency, and efficiency in distributed environments.5
Requirements
- Basic knowledge in data structures and operating systems.
Target audiences
- Final year students of B. Tech Computer Science and Engineering
