Big Data Infrastructure - free course from the School of Data Analysis, 4 semesters, Date: December 5, 2023.
Miscellaneous / / December 08, 2023
For those who love algorithms, working with data and enjoy programming, but would not like to connect their lives with machine learning.
Algorithms, programming, design of file systems, disks, networks and processors, as well as distributed systems.
In the creation and support of efficient and reliable distributed systems for storing and processing big data.
Each student must successfully complete at least three courses during the semester. For example, if there are two of them in the main program, then you need to choose one of the special courses.
Knowledge is tested primarily through homework - exams and tests are conducted only in some subjects.
First semester
Mandatory
Algorithms and data structures, part 1
01 Complexity and computational models. Analysis of accounting values (beginning)
02 Analysis of accounting values (end)
03 Merge-Sort and Quick-Sort algorithms
04 Ordinal statistics. Heaps (beginning)
05 Heaps (end)
06 Hashing
07 Search Trees (beginning)
08 Search Trees (continued)
09 Search trees (end). System of disjoint sets
10 Objectives of RMQ and LCA
11 Data structures for geometric search
12 Problem of dynamic connectivity in an undirected graph
Computer architecture and operating systems
01 UNIX and programming in C: command line, process control, channels, signals. Implementation of a command line shell.
02 x86 assembler: arithmetic, transitions, conditions and function calls. Stack, moving up the stack.
03 Linking programs and the ELF format. Dynamic linking.
04 The concept of context and execution flow. Implementing lightweight threads.
05 Preemptive multitasking: support from the x86 processor and implementation of processes in the UNIX kernel.
06 Multi-core architecture: cache coherence and memory models. Synchronization primitives in multithreaded programs.
07 Scheduling processes on one core and on many cores.
08 External memory: hard drives and solid state drives. Principles of operation of file systems.
09 Virtualization: hardware and software. Binary broadcast.
C++ language training, part 1
C++ is a powerful language with a rich heritage. For those who have just set out on the path of mastering this language, it is very easy to get lost in the abundance of techniques and techniques created over the past 30 years. The course teaches "Modern C++" - a modern subset of the language (standards 11, 14 and 17). A lot of attention is paid to tools and libraries - things that are not part of the language, but without which it will not be possible to build a large and complex project.
01 Introduction to C++.
02 Constants. Pointers and links. Passing arguments to a function.
03 Classes.
04 Dynamic memory management.
05 Variables, pointers and references.
06 Memory management, smart pointers, RAII.
07 Standard template library.
08 Inheritance and virtual functions.
09 Error handling.
10 Design patterns.
11 Namespaces Move semantics Perfect forwarding.
12 Representation of structures and classes in memory. Data alignment. Pointers to class members/methods. Variadic templates.
Second term
Mandatory
Algorithms and data structures, part 2
01 Bypass in width. Depth First Traversal (start)
02 Depth Traversal (continued)
03 Traversal in depth (end). 2-cuts
04 Finding shortest paths (beginning)
05 Finding shortest paths (continued)
06 Minimum spanning trees
07 Minimal cuts. Search for substrings (start)
08 Search for substrings (continued)
09 Search for substrings (end)
10 Suffix trees (beginning)
11 Suffix trees (ending). Suffix arrays (start)
12 Suffix arrays (ending)
13 Longest common substrings. Approximate substring search.
C++ language training, part 2
The second part of the C++ course, which covers advanced topics and language capabilities.
01 Multi-threaded programming. Synchronizing threads using mutexes and condition variables.
02 Atomic variables. C++ memory model. Examples of lock-free data structures.
03 Advanced meta-programming techniques in C++. Metafunctions, SFINAE, concepts.
04 Competitive programming, interaction with the network.
05 llvm architecture. Working with the C++ parse tree. Development of tools for analyzing C++ code.
To choose from
Theory and Practice of Concurrency
The course is devoted to competitive systems and tasks in the broadest sense: from the level of competition between processor cores for writing to one cell memory to distributed systems that want to replicate their state across multiple servers in a fault-tolerant and consistent manner.
01 https://gitlab.com/Lipovsky/shad-tpcc-course-2019/blob/master/lectures/syllabus.md
or
Go language
01 Introduction. Course program. Reporting on the course, evaluation criteria. Design philosophy. if, switch, for. Hello, world. Command line arguments. Word count. Animated gif. Fetching URL. Fetching URL concurrently. Web server. Tour of go. Local IDE setup. gofmt. goimports. linting
02 Basic language structures. names, declarations, variables, assignments. type declarations. packages and files. scope. Zero value. Memory allocation. Stack vs heap. Basic data types. Constants. Composite data types. Arrays. Slices. Maps. Structs. JSON. text/template. string and []byte. Working with unicode. Unicode replacement character. Functions. Functions with a variable number of arguments. Anonymous functions. Errors.
03 Methods. Value receiver vs pointer receiver. Embedding. Method value. Encapsulation. Interfaces. Interfaces as contracts. io. Writer, io. Reader and their implementations. sort. Interface. error. http. Handler. Interfaces as enumerations. Type assertion. Type switch. The bigger the interface, the weaker the abstraction. Error processing. panic, defer, recover. errors.{Unwrap, Is, As}. fmt. Errorf. %w.
04 Goroutines and channels. clock server. echo server. Channel size. Blocking and non-blocking read. select statement. Channel axioms. time. After. time. NewTicker. Pipeline pattern. Cancellation. Parallel loop. sync. WaitGroup. Error handling in parallel code. errgroup. Group. Concurrent web crawler. Concurrent directory traversal.
05 Advanced testing. Subtests. testing. B. (T).Logf. (T).Skipf. (T).FailNow. testing. Short(), testing flags. Generation of mocks. testify/{require, assert}. testify/suite. Test fixture. Integration tests. Goroutine leak detector. TestingMain. Coverage. Comparison of benchmarks.
06 Advanced testing. Subtests. testing. B. (T).Logf. (T).Skipf. (T).FailNow. testing. Short(), testing flags. Generation of mocks. testify/{require, assert}. testify/suite. Test fixture. Integration tests. Goroutine leak detector. TestingMain. Coverage. Comparison of benchmarks.
07 Package context. Passing request-scoped data. http middleware. chi. Router. Request cancellation. Advanced concurrency patterns. Async cache. Graceful server shutdown. context. WithTimeout. Batching and cancellation.
08 database/sql, sqlx, working with databases, redis.
09 Reflection. reflect. Type and reflect. Value. struct tags. net/rpc. encoding/gob. sync. Map. reflect. DeepEqual.
10 Package io, Reader and Writer implementations from the standard library. Low-level programming. unsafe. Package binary. bytes. Buffer. cgo, syscall.
11 GC architecture. Write barrier. Stack growth. GC pause. GOGC. sync. Pool. Goroutine scheduler. GOMACPROCS. Leaked threads.
12 Go tooling. pprof. CPU and Memory profiling. Cross compilation. GOOS, GOARCH. CGO_ENABLED=0. Build tags. go modules. godoc. x/analysis. Code generation.
13 Useful libraries. CLI applications with cobra. Protobuf and GRPC. zap logging.
Third semester
Mandatory
Algorithms in external memory
The course introduces students to the basic principles of constructing algorithms for working with data that does not fit into the computer's RAM.
01 Algorithms in external memory.
02 Cache-oblivious algorithms.
03 Algorithms for stream data processing.
Distributed systems
Recommended special courses
Strength of cryptographic systems
01 Basic approaches and principles of modern cryptography. The adversary model, formalization of the concept of strength, the problem of assessing strength and related problems, division into primitives and protocols, stages of the “life” of a cryptographic system.
02 Confidentiality. Everyday definitions of confidentiality, approaches to formalization (information-theoretic model of the enemy, models KR, PR, LOR, ROR, IND, CPA, CCA), symmetric encryption system, application of complexity-theoretic information to determine the relationship between models. Relationships between basic adversary models for assessing the strength of encryption systems.
03 Approaches to building encryption systems. Building from scratch. Constructions based on block ciphers, definition of a block cipher, main characteristics, approaches to construction and properties. Models PRP and PRF. The paradox of the birthday problem. Lemma on the relationship between resistance in the PRF and PRP models.
04 Encryption modes. Basic encryption modes: ECB, CBC, CFB, OFB, CTR. Basic performance properties. Durability of CTR in LOR-CPA, instability of ECB in LOR-CPA. Instability of basic modes in CCA models.
05 Integrity. Definition of the concept of integrity. Approaches to formalization (UF-CMA model, models based on the discrimination task, PRF model). Message authentication codes and functions for generating imitated inserts. Designs based on block ciphers: CBC-MAC, XCBC, TMAC, OMAC. Vulnerable modes.
06 Hash functions. Definition, basic properties, approaches to construction, formalization and related problems. Examples of using hash functions: password hashing, entropy extraction. Constructing collisions and preimages from sets of low cardinality.
07 HMAC, KDF, PRF, DRNG circuits. HMAC diagram, basic steps for obtaining resistance rating. Key diversification and the principle of key separation, KDF and PRF schemes. Pseudorandom generator, DRNG circuits.
08 Key load. Key load problem. The main methods for reducing the load on a key are external and internal key conversions. Parallel and serial re-keying schemes, basic properties. Key tree. Internal key change and CTR-ACPKM mode.
09 Encryption with imitation protection. Problem formulation. General structures (EtA, AtE, A&E) and their properties. Examples of vulnerable modes for ensuring confidentiality and integrity using a single key. AEAD encryption modes: GCM, MGM.
10 Secure communication channel. The concept of a secure communication channel: types of channels, basic properties (integrity and confidentiality of the data flow). Examples of vulnerable protocols. Record TLS 1.3 protocol.
Fourth semester
To choose from
Theory and Practice of Concurrency
The course is devoted to competitive systems and tasks in the broadest sense: from the level of competition between processor cores for writing to one cell memory to distributed systems that want to replicate their state across multiple servers in a fault-tolerant and consistent manner.
01 https://gitlab.com/Lipovsky/shad-tpcc-course-2019/blob/master/lectures/syllabus.md
or
Go language
01 Introduction. Course program. Reporting on the course, evaluation criteria. Design philosophy. if, switch, for. Hello, world. Command line arguments. Word count. Animated gif. Fetching URL. Fetching URL concurrently. Web server. Tour of go. Local IDE setup. gofmt. goimports. linting
02 Basic language structures. names, declarations, variables, assignments. type declarations. packages and files. scope. Zero value. Memory allocation. Stack vs heap. Basic data types. Constants. Composite data types. Arrays. Slices. Maps. Structs. JSON. text/template. string and []byte. Working with unicode. Unicode replacement character. Functions. Functions with a variable number of arguments. Anonymous functions. Errors.
03 Methods. Value receiver vs pointer receiver. Embedding. Method value. Encapsulation. Interfaces. Interfaces as contracts. io. Writer, io. Reader and their implementations. sort. Interface. error. http. Handler. Interfaces as enumerations. Type assertion. Type switch. The bigger the interface, the weaker the abstraction. Error processing. panic, defer, recover. errors.{Unwrap, Is, As}. fmt. Errorf. %w.
04 Goroutines and channels. clock server. echo server. Channel size. Blocking and non-blocking read. select statement. Channel axioms. time. After. time. NewTicker. Pipeline pattern. Cancellation. Parallel loop. sync. WaitGroup. Error handling in parallel code. errgroup. Group. Concurrent web crawler. Concurrent directory traversal.
05 Advanced testing. Subtests. testing. B. (T).Logf. (T).Skipf. (T).FailNow. testing. Short(), testing flags. Generation of mocks. testify/{require, assert}. testify/suite. Test fixture. Integration tests. Goroutine leak detector. TestingMain. Coverage. Comparison of benchmarks.
06 Concurrency with shared memory. sync. Mutex. sync. RWMutex. sync. Cond. atomic sync. Once. Race detector. Async cache. Working with the database. database/sql. sqlx.
07 Package context. Passing request-scoped data. http middleware. chi. Router. Request cancellation. Advanced concurrency patterns. Async cache. Graceful server shutdown. context. WithTimeout. Batching and cancellation.
08 database/sql, sqlx, working with databases, redis.
09 Reflection. reflect. Type and reflect. Value. struct tags. net/rpc. encoding/gob. sync. Map. reflect. DeepEqual.
10 Package io, Reader and Writer implementations from the standard library. Low-level programming. unsafe. Package binary. bytes. Buffer. cgo, syscall.
11 GC architecture. Write barrier. Stack growth. GC pause. GOGC. sync. Pool. Goroutine scheduler. GOMACPROCS. Leaked threads.
12 Go tooling. pprof. CPU and Memory profiling. Cross compilation. GOOS, GOARCH. CGO_ENABLED=0. Build tags. go modules. godoc. x/analysis. Code generation.
13 Useful libraries. CLI applications with cobra. Protobuf and GRPC. zap logging.
or
Database
01 Interfaces of modern databases: relational, key-value, document, graph. Relational algebra and SQL language.
02 Working with disk in classic relational DBMS: pages, page pool, eviction from the pool.
03 Executing SQL queries: expression parsing, planning, execution. Interpretation and code generation using LLVM.
04 Indexes in relational DBMS: types of indexes, storage methods, use in queries.
05 Transactions: ACID acronym, isolation levels, implementing transactions through locks and MVCC.
06 Disaster recovery: log, checkpoints, ARIES algorithm.
07 Data storage using the Log-Structured Merge Tree method.
08 Column-based DBMS: advantages, features, data compression algorithms.
09 Distributed DBMS: sharding, transactions, query execution.
10 DBMS located in main memory. Data structures for in-memory indexes.
or
Computer networks
01 Introduction to network technologies. The history of networks, network protocols, organization of network interaction in a peer-to-peer network and the connection of peer-to-peer networks with each other.
02 Transport. OSI/ISO network model. TCP, network connection establishment, comparison of TCP and UDP. Tcpdump analysis – bytes in fly, retransmits graphs. Methods for controlling data flow in a TCP session. Different types of TCP sessions and bandwidth management of transmitted data in packet networks.
03 Routing. The concept of routing in networks. Static and dynamic routing. Basics of dynamic routing. Dynamic routing protocol - OSPF. Distance vector routing protocols. Overview of the BGP routing protocol - message types, BGP attributes, choosing the optimal route in BGP.
04 How the Internet works: BGP and DNS. Internet routing. Overview of the DNS protocol.
05 Networks in large data centers. Features of the architecture of data center networks. Requirements for data center networks. CLOS architecture for data center networks.
06 Delays in networks. Features of building large backbone networks. Reasons for delays in data transmission over backbone networks.
07 Scaling and availability of Internet services. Load balancing technologies and service architecture.
08 MPLS and SR, Network Programmability. MPLS and Segment Routing technologies for building backbone networks. Purpose of MPLS technology, protocols used for label exchange.
09 Principles of operation of network devices. Router architecture, features of processing network traffic inside network devices.
10 Clouds. Software Defined Networking Fundamentals - Protocols used to build software defined networks. Integration of virtualization platforms and network infrastructure.
or
Cryptographic protocols
01 Basic ideas of asymmetric cryptography. The main difference between asymmetric cryptography and symmetric cryptography. Main ideas: protocol for generating a shared key, public key encryption, electronic signature (problems to be solved, intuitive understanding of security properties). Specific cryptographic schemes: Diffie-Hellman protocol, ElGamal and RSA encryption schemes, ElGamal and RSA signatures. The fundamental problem with asymmetric schemes is trust in the public key.
02 Strength of basic asymmetric cryptography schemes. Formal definition of resistance: models UF-CMA, IND-CPA, DLP, CDH, DDH. Relationships between them. Strength of the ElGamal encryption scheme. The instability of the RSA signature scheme without using a hash function.
03 Learn more about asymmetric cryptography. Lampart's signature, Merkle's diagram. DSKS attack.
04 Algebraic and number-theoretic foundations of asymmetric cryptography. Finite groups, cyclic groups, order of group element. Discrete logarithm problem (DLP). Multiplicative groups of finite fields. Basic information about elliptic curves.
05 Elliptic curves. Hasse's theorem. Addition of points on an elliptic curve. Group of points on an elliptic curve. Signature scheme GOST R 34.10-2012.
06 Discrete logarithm. Discrete logarithm algorithms (Pollard's Rho method, matching method, Polig-Hellman method, index calculation method).
07 PKI technology. Basic principles and concepts of public key infrastructure (PKI). Certificate, CA, CRL, OCSP, trust space.
08 TLS protocol. History of the TLS protocol. Protocol structure, basic operating principles. TLS protocol cryptographic suites based on Russian cryptographic algorithms.
09 Basics of building AKE protocols. The concept of the AKE protocol. Target properties. Basic approaches to construction.
10 Secure key storage. The problem of secure use of private keys. Key media, non-removable keys. The problem of the presence of an adversary in the channel, protocols of the PAKE family.
11 Basic concepts of blockchain technology. The task of coordinated decentralized interaction. Basic concepts of the concept of security. Security approaches.
12 Basic principles of quantum technologies and their applications in cryptography