Module 01 - Distributed systems basics and Computation model
This module provides an in-depth understanding of distributed systems, covering their definition, relationship to computer system components, and motivation for their use. It explores primitives for distributed communication, key design issues, challenges, and real-world applications. The course also delves into models of distributed computations, including distributed programs, execution models, communication networks, and the global state of distributed systems. Concepts such as computational cuts, past and future event cones, and process communication models are analyzed to provide a comprehensive framework for understanding distributed computing.
Module 02 - Election algorithm, Global state and Termination detection
This module covers key concepts in distributed systems related to logical time, leader election, global state recording, and termination detection. It introduces logical clocks, including scalar time and vector time, as frameworks for ordering events in distributed environments. The Bully and Ring algorithms for leader election are explored, enabling coordination in distributed networks. The module also delves into global state and snapshot recording, with a focus on the Chandy-Lamport algorithm for FIFO channels. Finally, it examines termination detection, discussing models of distributed computation, snapshot-based detection, weight-throwing techniques, and spanning-tree-based approaches for ensuring correct process completion in distributed systems.
Module 03 - Mutual exclusion and Deadlock detection
This module explores algorithms and strategies for ensuring mutual exclusion and detecting deadlocks in distributed systems. It covers fundamental system models and requirements for mutual exclusion, including classical algorithms such as Lamport’s and Ricart–Agrawala’s timestamp-based approaches, as well as quorum-based methods like Maekawa’s algorithm. Token-based solutions, such as Suzuki–Kasami’s broadcast algorithm, are also discussed. Additionally, the module delves into deadlock detection, examining system models, handling strategies, challenges, and different deadlock models in distributed environments.
Module 04 - Distributed shared memory and Failure recovery
This module covers the concept of Distributed Shared Memory (DSM) as an abstraction for interprocess communication, along with its advantages. It explores mutual exclusion in shared memory systems, focusing on Lamport’s Bakery Algorithm for ensuring process synchronization. The module also delves into checkpointing and rollback recovery, explaining system models, consistent vs. inconsistent states, message types, and key challenges in failure recovery. Various recovery techniques, including checkpoint-based and log-based rollback recovery, are discussed, addressing critical issues in distributed fault tolerance.
This module explores fundamental concepts in achieving consensus in distributed systems. It covers key assumptions, the Byzantine agreement problem, and other challenges in reaching agreement among distributed nodes. It delves into consensus algorithms for message-passing synchronous systems, particularly in the presence of failures, including crash failure scenarios. Additionally, the module examines the architecture and design of distributed file systems, focusing on file service models and key case studies such as Sun Network File System (NFS), Andrew File System (AFS), and Google File System (GFS). It provides insights into how these systems ensure reliability, consistency, and efficiency in distributed environments.