Data Warehouse Systems, Big Data Processing
Code | Completion | Credits | Range | Language |
---|---|---|---|---|
18DWH | ZK | 4 | 2+2 | Czech |
- Course guarantor:
- Lecturer:
- Tutor:
- Supervisor:
- Department of Software Engineering
- Synopsis:
-
The data warehouse architecture, implementation, methods for extraction of the data from various sources, transformation procedures and loading as well as data processing, archiving, data searching and evaluating are deeply studied in these lectures.
- Requirements:
- Syllabus of lectures:
-
1.What the data warehouse is, system architecture, requirements on sytem software and hardware, high availability architecture
2.data heterogeneity
- binary formats particularities and their interpretation
- extraction of textual data for full-text search and meta-data creation (raw texts from formats like office, pdf, exif(jpeg) etc.)
- data acquisition methods: CSV, TSV, XML, unstructured data, database connectors
- the use of the file system and meta-data
- versioning principles
- meta-data sources, generation of meta-data during imports to DWH and during operation of DWH, audit trails
3.heterogeneous data sources, acquisition methods (manual/automatic)
- period of data acquisition (regular, irregular, one-time)
- data input forms, database connectors, data source files (local file system, URL , FTP, SCP, connected network volumes)
4.Importing data to DWH, ETL methodology
- extraction, transformation, loading
- landing-base-core architecture, processes for preparing heterogeneous data to be imported to unified system
- reference control, data type control, structure control, data type unification methods
- utilization of file system and database in ETL process
5.ETL methodology part II
- automation of ETL process
- validation of individual ETL steps
- audit trail records for each step of the process
- data versioning
6.data archiving
- long term data archiving of DWH data
- methods for data reconstruction from the long term data archives
- backups (live/continuous data replication, data exports)
- non-sql archive
7.Analysis and solution proposal of building a DWH
- list of functional requirements (what functions should the DWH offer?)
- relevant sample data for all data elements in the data domain
- anticipations of progressive data growth (new date, actualization, versioning, batches)
- business model of the application (functions for the customer, the end user)
- object model of the application (architecture for developers)
8.Analysis and solution proposal of building a DWH (part II)
- architecture of the operating environment (hardware, software, services, configuration)
- E-R database model (database, schemes, tables, references, indexes)
- directory structures on the file volumes
- data marts
- ORM - Object relational mapping
9.Top layered applications
- service application
- executive summaries and reports
- Advanced data visualizations (over maps etc.)
10.Communication with external systems
- web services
- XML and JSON formats, SOAP and RESTfull communication
- CSV exports
- automation of exports (periodical / on demand)
11.Presentations of the results in building a DWH in teams
12.Recapitulation
- Syllabus of tutorials:
-
1.Seminar intro - Setting up the development environment
- Apache server
- Programming language (PHP / JAVA / C# / C++)
- Database (MySQL / PostgreSQL)
- Creating a database and some tables to work with
2.Extraction and import of data from different formats
- Explore tools for data extraction from different formats
- Extraction and import data to prepared db schemas
3. Extraction an import of data from different sources
- Intro to DB connectors
- Data import using DB connectors or FTP connection
4.ETL
- Preparing sachems: loading, base, core
- Writing simple code for automatic data extraction from at least dwo different sources and two different formats
- Preparing for recording audit trails and archiving
5.ETL
- Writing simple code for automatic transformation and loading of the data
- Recording the audit trail
6.Archiving
- Backups using DB exports
- Non-sql backups
7.New data input processing, analysis of the system, documenting
- Adding a new data source and integrating it into the system
- description of the system architecture
- E-R Model
8.ORM and datamarts
- Exploring ORM tools
- Preparing DB VIEWS (datamarts) for exposing data to the management views
9. Top layer applications
- Presenting data
- Write simple web page with graph / table
10.Support for external systems
- Implementation of simple web service for accessing data from external applications (REST / SOAP)
- Writing a simple external application to test the web service
11.Tuning, testing, consulting
12.Getting credits
- Study Objective:
-
The lecture provides a complex view on methods and approaches of processing, management and analytical retrieval from a mass of structured and non-structured data taken from heterogeneous data sources. Students will deeply use their previous knowledge of the graph theory, database systems, algorithms, software architecture as well as process optimizations.
- Study materials:
-
Compulsory literature:
[1] W. H. Inmon: Building the Data Warehouse. 4th Ed. Indianapolis: Wiley Publishing Inc. 2005. ISBN 978-0764599446
Optional literature:
[1] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2nd Ed. New York: John Wiley & Sons, Inc.
ISBN: 978-0-471-20024-6
[2] Google search engine http://google.com
- Note:
- Further information:
- No time-table has been prepared for this course
- The course is a part of the following study plans: