- Admission : /en/education/bachelor/computer-science-and-communication-systems/admission/
- Study program : /en/education/bachelor/computer-science-and-communication-systems/study-program/
- Structure of studies : /en/education/bachelor/computer-science-and-communication-systems/structure-of-studies/
- Career perspectives : /en/education/bachelor/computer-science-and-communication-systems/career-perspectives/
- Exchange programs : /en/education/bachelor/computer-science-and-communication-systems/exchange-programs/
- People : /en/education/bachelor/computer-science-and-communication-systems/people/
- Admission : /en/education/bachelor/computer-science-and-communication-systems/admission/
- Study program : /en/education/bachelor/computer-science-and-communication-systems/study-program/
- Structure of studies : /en/education/bachelor/computer-science-and-communication-systems/structure-of-studies/
- Career perspectives : /en/education/bachelor/computer-science-and-communication-systems/career-perspectives/
- Exchange programs : /en/education/bachelor/computer-science-and-communication-systems/exchange-programs/
- People : /en/education/bachelor/computer-science-and-communication-systems/people/
Study program
Course description
Back-
Objectives
At the end of the course, the students should have a firm grasp of the structure and implementation of a data pipeline, allowing them to design their own. More precisely, the objectives are :
- Define the motivations, opportunities and challenges of distributed data storage and processing
- Explain the base concepts of distributed data systems: CAP, replication, partitioning, consensus
- Describe the structure of a data pipeline through the Data Engineering Lifecycle
- Distinguish the storage models (object, file, block, stream) and explain how HDFS and Kafka work in a detailed way, as well as basic concepts of object storage (S3)
- Compare data formats based on their structure level and differentiate structured storage formats : row/record, column, table
- Qualify data for ingestion in terms of source system type and characteristics to propose a model for data ingestion appropriate to a given situation
- Explain the basic concepts of stream processing (processing/event time, triggers, watermarks, windowing, correctness) and illustrate them using the Dataflow model
- Compare storage/transformation/serving abstractions such as Data Lake, Data Warehouse, Data Lakehouse, Streaming Kappa/Lambda, Data Integration Platform, Modern Data stack
- Implement distributed ingestion, storage, transformation and serving of data in ''batch'' and ''streaming'' modes using current technology such as Spark, Beam, Airbyte, Kafka, HDFS, Avro, Parquet, Iceberg, Trino, ...
-
Content
This course explores one of the main tasks of a Data Engineer: building data pipelines. When one wants to create scalable systems to manage sizeable data, distributed storage and processing is needed.
We focus on the following stages: distributed data generation, ingestion, storage, transformation and serving. The course is composed of theory/discussions/exercises sessions, practical labs, and a mini-project common to the module.
The course is divided into the following chapters:
- Motivation and concepts of distributed data systems
- Data Engineering and Data Engineering Lifecycle (''Data pipeline'')
- Distributed storage and batch processing 1: Distributed file/object/block storage, MapReduce, Data Lake, ...
- Distributed storage and batch processing 2: Structured formats, Spark, ...
- Serving: SQL Engines (Hive, Trino, ...) and ''table'' formats (Iceberg, ...)
- Streaming storage (Kafka ecosystem)
- Generation (source systems) and ingestion in ''batch'' and ''streaming''
- Stream processing: Concepts, Dataflow model, Apache Beam
- Mini-project integrated with the other courses of the module to build an end-to-end pipeline
Type of teaching and workload
Course specification
Evaluation methods
- Continuous assessment Written work, Practical exercises / Evaluated reports, mini-project common to module (independent grade per course)
Course grade calculation method
The continuous assessment mark corresponds to the weighted average of all of the semester's exams.
Reference work
Will be distributed through the Cyberlearn course platform.
Intructor(s) and/or coordinator(s)
Philippe Joye