Logo ČVUT
CZECH TECHNICAL UNIVERSITY IN PRAGUE
STUDY PLANS
2019/2020

Distributed Data Mining

The course is not on the list Without time-table
Code Completion Credits Range Language
MI-DDM KZ 4 3C
Lecturer:
Tomáš Borovička (guarantor)
Tutor:
Tomáš Borovička (guarantor)
Supervisor:
Department of Applied Mathematics
Synopsis:

Course focuses on state-of-the-art approaches for distributed data mining and parallelization of machine learning algorithms. Students will gain hands on experience with large scale data processing framework Apache Spark and with existing distributed DM / ML algorithms. They will learn principles of their parallel implementations and will be capable to propose approaches to parallelize other algorithms.

Requirements:

Knowledge of at least one of the programming languages Python, Java or Scala. Knowledge of fundamentals of machine learning algorithms.

Syllabus of lectures:
Syllabus of tutorials:

1) Introduction to MapReduce, Apache Spark and cluster infrastructure

2) Data structures of Apache Spark framework: RDDs, Dataframes, Datasets

3) Apache Spark ML pipelines, ML Lib

4) Distributed data, data exploration, basic statistics

5) Distributed data-preprocessing (feature extraction and transformation, feature selection, dimensionality reduction)

6) Association rule mining, collaborative filtering, alternating least squares

7) Distributed classification and regression algorithms

8) Distributed clustering algorithms

9) Distributed ensemble algorithms

10) Algorithms for information retrieval and text mining

11) Deep learning and artificial neural networks

12) Stream processing, online algorithms

Study Objective:
Study materials:

Pentreath, Nick. Machine Learning with Spark. Packt Publishing Ltd, 2015.

Note:
Further information:
No time-table has been prepared for this course
The course is a part of the following study plans:
Data valid to 2019-10-18
For updated information see http://bilakniha.cvut.cz/en/predmet5463206.html