Note: This unit version is currently being edited and is subject to change!

DATA3404: Data Science Platforms (2019 - Semester 1)

Download UoS Outline

Unit: DATA3404: Data Science Platforms (6 CP)
Mode: Normal-Day
On Offer: Yes
Level: Senior
Faculty/School: School of Computer Science
Unit Coordinator/s: A/Prof Roehm, Uwe
Session options: Semester 1
Versions for this Unit:
Site(s) for this Unit:
Campus: Camperdown/Darlington
Pre-Requisites: DATA2001 OR DATA2901 OR ISYS2120 OR INFO2120 OR INFO2820.
Prohibitions: INFO3404 OR INFO3504.
Brief Handbook Description: This unit of study provides a comprehensive overview of the internal mechanisms data science platforms and of systems that manage large data collections. These skills are needed for successful performance tuning and to understand the scalability challenges faced by when processing Big Data. This unit builds upon the second year DATA2001 – 'Data Science - Big Data and Data Diversity' and correspondingly assumes a sound understanding of SQL and data analysis tasks.

The first part of this subject focuses on mechanisms for large-scale data management. It provides a deep understanding of the internal components of a data management platform. Topics include: physical data organization and disk-based index structures, query processing and optimisation, and database tuning.

The second part focuses on the large-scale management of big data in a distributed architecture. Topics include: distributed and replicated databases, information retrieval, data stream processing, and web-scale data processing.

The unit will be of interest to students seeking an introduction to data management tuning, disk-based data structures and algorithms, and information retrieval. It will be valuable to those pursuing such careers as Software Engineers, Data Engineers, Database Administrators, and Big Data Platform specialists.
Assumed Knowledge: This unit of study assumes that students have previous knowledge of database structures and of SQL. The prerequisite material is covered in DATA2001, INFO2120, or ISYS2120. Familiarity with a programming language (e.g. Java or C) is also expected.
Lecturer/s: A/Prof Roehm, Uwe
Tutor/s: Harshana Randeni, William Zhang
Timetable: DATA3404 Timetable
Time Commitment:
# Activity Name Hours per Week Sessions per Week Weeks per Semester
1 Lecture 2.00 1 13
2 Tutorial 1.00 1 12
3 Independent Study 3.00 1 13
4 Practical assignment 6.00 1 4
T&L Activities:
  • Students are expected to attend all scheduled lectures, and laboratory classes. You should expect to spend a minimum of twelve hours per week including scheduled lectures and laboratory times.
  • Students are expected to undertake any prescribed reading, to carry out exercises and laboratory tasks and to submit selected work for assessment as directed. It should be realised that some laboratory exercises can take longer than just the time scheduled for classes.
  • Students are expected to be able to work independently and to make effective use of a range of resources including the library, eLearning, the Internet and relevant on-line help facilities.
  • Students are expected to check their progressive results regularly. Results will be published through USyd eLEarning. Any errors or omissions must be reported to the unit coordinator, with appropriate evidence, as soon as possible. Please note: Marks are considered to have been confirmed ten days after being published and will not subsequently be altered.

Learning outcomes are the key abilities and knowledge that will be assessed in this unit. They are listed according to the course goal supported by each. See Assessment Tab for details how each outcome is assessed.

(4) Design (Level 3)
1. Ability to make effective physical data design decisions.
2. Ability to identify a performance problem and be able to effectively tune the performance of a (distributed) data processing system.
(2) Engineering/ IT Specialisation (Level 3)
3. Experience with using/tuning data science platforms.
4. Understanding of disk-based indexing structures such as B-Trees, extensible hashing and bitmap indexes.
5. Understanding of the principles of query processing and query optimization.
6. Understanding of data sharing algorithms and data replication protocols.
(1) Maths/ Science Methods and Tools (Level 3)
7. Understanding of different physical data organisations including data partitioning and data replication
8. Understanding of the principles of (distributed) data science platforms.
Assessment Methods:
# Name Group Weight Due Week Outcomes
1 DB Programming (PASTA) No 10.00 Multiple Weeks 1, 4, 5,
2 Mid-Semester Quiz No 10.00 Week 8 3, 4, 5, 7,
3 Assignment Yes 20.00 Week 12 1, 2, 3,
4 Final Exam No 60.00 Exam Period 2, 4, 5, 6, 7, 8,
Assessment Description: DB Programming Exercises (PASTA): weekly short programming exercises to implement selected database algorithms and data structures, submitted and auto-tested via PASTA (practical)

Mid-Semester Quiz: online quiz to be solved in Wk9 by students and marked by tutors; includes electronic review questions on the concepts taught in this unit (Data Storage and Indexing, Query Processing and Optimization, Concurrency Control and Crash Recovery, Information Retrieval)

Assignment: Practical Programming/Tuning Assignment

Final Exam: Written examination (two hours)
Assessment Feedback: Mid-Semester Quiz will be auto-graded and returned in subsequent weeks.
Programming exercises will be assessed according to unit tests, with guidance on failing tests. Multiple submissions can be made.
Assignment will receive written feedback according to marking rubric.
Grading:
Grade Type Description
Standards Based Assessment Final grades in this unit are awarded at levels of HD for High Distinction, DI (previously D) for Distinction, CR for Credit, PS (previously P) for Pass and FA (previously F) for Fail as defined by University of Sydney Assessment Policy. Details of the Assessment Policy are available on the Policies website at http://sydney.edu.au/policies . Standards for grades in individual assessment tasks and the summative method for obtaining a final mark in the unit will be set out in a marking guide supplied by the unit coordinator.
Minimum Pass Requirement It is a policy of the School of Computer Science that in order to pass this unit, a student must achieve at least 40% in the written examination. For subjects without a final exam, the 40% minimum requirement applies to the corresponding major assessment component specified by the lecturer. A student must also achieve an overall final mark of 50 or more. Any student not meeting these requirements may be given a maximum final mark of no more than 45 regardless of their average.
Policies & Procedures: IMPORTANT: School policy relating to Academic Dishonesty and Plagiarism.

In assessing a piece of submitted work, the School of Computer Science may reproduce it entirely, may provide a copy to another member of faculty, and/or to an external plagiarism checking service or in-house computer program and may also maintain a copy of the assignment for future checking purposes and/or allow an external service to do so.

Other policies

See the policies page of the faculty website at http://sydney.edu.au/engineering/student-policies/ for information regarding university policies and local provisions and procedures within the Faculty of Engineering and Information Technologies.
Recommended Reference/s: Note: References are provided for guidance purposes only. Students are advised to consult these books in the university library. Purchase is not required.
  • Architecture of a Database System
  • Database Management Systems
Online Course Content: USyd e-Learning
Note on Resources: Other material and prescribed readings may be specified through the unit of study web page.

Note that the "Weeks" referred to in this Schedule are those of the official university semester calendar https://web.timetable.usyd.edu.au/calendar.jsp

Week Description
Week 1 Architecture of Database Systems
Organisation and Administrativa
Week 2 Storage Layer: Physical Data Organisation
Week 3 Tree-based Index Structures
Week 4 Hash and Bitmap Indexes
Week 5 Introduction to Query Processing and External Sorting
Week 6 Query Execution and Join Algorithms
Week 7 Query Optimization
Week 8 Distributed Data Management
Assessment Due: Mid-Semester Quiz
Week 9 Distributed Computation and Data Processing
Week 10 Dataflow Platforms
Week 11 Data Stream Processing
Week 12 NoSQL
Assessment Due: Assignment
Week 13 UoS Review
Exam Period Assessment Due: Final Exam

Course Relations

The following is a list of courses which have added this Unit to their structure.

Course Year(s) Offered
Bachelor of Advanced Computing (Computational Data Science) 2018, 2019, 2020
Bachelor of Advanced Computing/Bachelor of Commerce 2018, 2019, 2020
Bachelor of Advanced Computing/Bachelor of Science 2018, 2019, 2020
Bachelor of Advanced Computing/Bachelor of Science (Health) 2018, 2019, 2020
Bachelor of Advanced Computing/Bachelor of Science (Medical Science) 2018, 2019, 2020
Bachelor of Advanced Computing (Computer Science Major) 2018, 2019, 2020
Bachelor of Advanced Computing (Information Systems Major) 2018, 2019, 2020
Bachelor of Advanced Computing (Software Development) 2018, 2019, 2020
Bachelor of Computer Science and Technology 2016, 2017
Bachelor of Computer Science and Technology (Advanced) 2016, 2017
Bachelor of Computer Science & Tech. Mid-Year 2016, 2017
Biomedical Mid-Year 2016, 2017, 2018, 2019, 2020
Biomedical 2016, 2017, 2018, 2019, 2020
Software 2020
Bachelor of Information Technology 2015, 2016, 2017
Bachelor of Information Technology/Bachelor of Arts 2015, 2016, 2017
Bachelor of Information Technology/Bachelor of Commerce 2015, 2016, 2017
Bachelor of Information Technology/Bachelor of Medical Science 2016, 2017
Bachelor of Information Technology/Bachelor of Science 2015, 2016, 2017
Bachelor of Information Technology/Bachelor of Laws 2015, 2016, 2017

Course Goals

This unit contributes to the achievement of the following course goals:

Attribute Practiced Assessed
(5) Interdisciplinary, Inclusiveness, Influence (Level 3) No 0%
(6) Communication and Inquiry/ Research (Level 3) No 0%
(4) Design (Level 3) No 25.33%
(3) Problem Solving and Inventiveness (Level 3) No 0%
(2) Engineering/ IT Specialisation (Level 3) No 52.67%
(1) Maths/ Science Methods and Tools (Level 3) No 22%

These goals are selected from Engineering & IT Graduate Outcomes Table 2018 which defines overall goals for courses where this unit is primarily offered. See Engineering & IT Graduate Outcomes Table 2018 for details of the attributes and levels to be developed in the course as a whole. Percentage figures alongside each course goal provide a rough indication of their relative weighting in assessment for this unit. Note that not all goals are necessarily part of assessment. Some may be more about practice activity. See Learning outcomes for details of what is assessed in relation to each goal and Assessment for details of how the outcome is assessed. See Attributes for details of practice provided for each goal.