Scripting and Data Analysis in R Language
- Lubomír Štěpánek
- Department of Biomedical Informatics
The course is aimed at students interested in programming language and environment R and the field of data science as well, as R is widely used for data science applications. R is not only a programming language designed for statistical computing and graphics purposes, but also a Turing-complete general-purpose programming language suitable for complex tasks solutions. Advantages of R over commercial systems such as MATLAB are (i) open-source distribution ? both free in the sense of costing no money (?free-as-in-beer?) and having absolutely no restrictions on source code editing or commercial use (?free-as-in-speech?). Among other benefits, (ii) there is a large online community congregated around R ready to help and answer user?s questions; R also provides (iii) an easy development of R web applications or (iv) user-friendly TeX documents typesetting directly via R code. The syntax of R language is simple, intuitive and quite similar to the syntax of MATLAB language. According to the recent kaggle.com worldwide statistics, R became the most popular programming language chosen for data analysis, data science and machine learning. Let?s say R is the lingua franca of data science. Class is practise-based and focused on problem-solving, number-crunching exercises and on real-data analyses solved via hands-on R programming and scripting; assigned tasks follow an easy-to-difficult schedule.
The course has no formal prerequisites. No prior experience with R is necessary, although some familiarity with procedural or even scripting programming languages such as MATLAB, Octave or Python would be helpful.
- Syllabus of lectures:
- Syllabus of tutorials:
Introduction, installation, R data types and structures overview; basic operations, numbers, vectors and simple manipulation.
2ndMore on data types in R, data structures and structures manipulation. Matrices, data frames, lists.
3rdLoading external data into R. Saving data from R to a file. Data (pre)processing.
4thFunctions in R. Useful built-in functions. User-defined functions in R.
5thR as a programming language. Scoping, if-statement, loops, for-do, while-do, repeat-until. Warnings. Errors. Flow-control. The R apply() function.
6thElements of statistics and data analysis in R. Probability distributions. Measures of average and variability. Hypothesis testing in R.
7thAdvanced statistics and data analysis in R. Linear models including generalized ones (GLM). Linear regression. Logistic regression. Survival analysis.
8thSelected advanced statistical methods in R, both linear and nonlinear. Cluster analysis. Discriminant analysis. Time series. Jacknife. Bootstrap.
9thSelected methods of machine learning in R. Na?ve Bayes classifier. Support Vector Machine (SVM). Cross Validation (CV). Principal Component Analysis (PCA). Decision trees. Random forests. Neural networks. Association rules.
10thGraphical outputs in R. Low-level and high-level graphical commands. Multivariate data displaying. Parameters of plots and diagrams.
11thOverview of plots and diagrams in R and how to save a plot to a file. Choosing the most appropriate type of chart to use.
12thText processing in R. Handling and processing strings in R. Regular expressions in R. Tokenization, n-gramming. TeX code included within R code. How to add R code or results of data analysis and plots outputted by R into TeX code and typeset a pdf.
14thReviewing topics covered over the course. End of the course summary.
- Study Objective:
- Study materials:
- Time-table for winter semester 2018/2019:
Fri Thu Fri
- Time-table for summer semester 2018/2019:
- Time-table is not available yet
- The course is a part of the following study plans: