Project Specification

Group formation

The project group may consist of up to 3 students.

Project grading

The project is worth 35% of the class grade, with the following details:

Due date Proportion (%)
Proposal May 5 10
Check-up May 19 10
Presentation June 2 and 4 25
Report June 10 45
Peer experience summary June 10 10
Total 100

Project proposal

Develop a proposal that includes your topic, the selection of relevant dataset(s), and a plan to answer questions of interests. Keep in mind the timeline of the quarter and set achievable goals.

Guidelines for proposal

Review the guidelines below carefully.

Overarching theme: Design a data process that involves ingesting data, preprocessing, analysis, and service. Consider a process that can scale and can be reused given data update.

The proposal should follow this format (5 points):

  • Include full names of all teammates.
  • Between 1 to 2 pages, single-spaced, 11-point type, 1-inch margins.
  • Do not include graphics.
  • Decide on the name of your group. A pdf named [GroupName]_proposal.pdf should be submitted.

The proposal should cover the following components (but in a narrative format, not Q/A):

  • Introduction (12 points): Introduce your project and motivation, e.g.,
    • What is the main issue you are interested in?
    • Why is this topic important?
    • In what way does this project provide a solution?
  • Data source(s) (10 points): Describe the data sources you have chosen, e.g.,
    • How can the data be retrieved?
    • How are the data related to the topic?
    • State the amount of data you will be working with.
  • Goal definitions (18 points): State two to three goals you are interested in achieving with the selected data sources.
    As you tackle the goals, you should use at least one of the data engineering techniques covered in class:
    • Missing data imputation
    • Database system (relational or non-relational)
    • Distributed computing framework (e.g., pyspark)
      For each goal, answer the following:
    • How do you intend to achieve your goal?
    • How do you intend to use the above techniques?
    • What can you answer (and not answer) with the proposed datasets?
  • Reference (5 points): Provide valid references for your data source(s). (Reference does not count toward the page limit.)
    • For example, using APA guidelines, the MIMIC-III database can be cited as > Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
    • or the original article > Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., … & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.

Potential data sources

Below are some possible avenues for finding data sources:

Project check-up

Two elements are required for the project check-up:

  1. Group, submitted on Canvas as a pdf. (30 points)
    • For each of your set goals in the proposal, provide a brief summary of progress. If you have not worked on a particular goal yet, feel free to say that you have not made progress there.
    • Include any questions or roadblocks you are encountering and require assistance.
    • If you find it helpful to set up a time to meet with Prof. Chan, indicate so.
  2. Individual. (20 points)

Note: While the check-up form would not necessarily change your individual project grades, adjustments will be considered combining the feedback from the check-up and the peer experience summary.

Presentation

A presentation will be due on the last day of class. The presentation should follow these guidelines:

  • Up to fifteen minutes.
  • Tips: Distribute your content and do a quick dry run in advance. This will smoothen out your presentation.

The presentation should include but not limited to the following:

  • Motivation of your project.
  • Introduction to your dataset(s) and its relevance.
  • Goals, method of analyses, and findings.
  • Summary.

The presentation is evaluated via the following criteria (20 each):

  • Organization of content
  • Appropriate use of language
  • Delivery
  • Use of supporting materials (visuals, statistics, etc.)
  • Clarity of central idea

Note that while the evaluation is primarily for the entire group, evaluation may differ among students if significant discrepancy is observed. You may refer to the sample reference rubric for details.

Report

Prepare a project report to clearly outline your project choice and its importance, your approach to achieving the set of goals, your results, and a summary.

The report should follow these guidlines:

  • Include full names of all teammates.
    • This is a reminder that it is an academic violation to include your name on any work you have not contributed or performed.
  • Up to 15 pages (not including Appendix), single-spaced, 11-point type, 1-inch margins.

Submission: A single pdf named [Groupname]_report.pdf should be submitted.

Include all of your code in a zipped folder in your submission. Make sure your codes and code organization are understandable.


The report should include the following components in a narrative format:

  • Introduction (5 points): Introduce your project and motivation, e.g.,
    • What is the main issue you are interested in?
    • Why is this topic important?
    • In what way does this project provide a solution?
  • Data source(s) (5 points): Describe the data sources you have chosen, e.g.,
    • How can the data be retrieved?
    • How are the data related to the topic?
    • State the amount of data you will be working with.
  • Goals and approaches (60 points): State the specific goals you are attempting to achieve with the selected data sources. For each goal, provide the approach in achieving said goal, its implementation, and your results. Use any graph, table, or other visualization to support your narrative.
  • Summary (10 points): Provide a brief summary of your project, and describe any loose ends or future opportunities.
  • Generative AI statement and reference (10 points):
    • Generative AI statement: If you have, in any way, employed the use of generative AI tools, report all usage according to the syllabus. This statement does not count toward the page limit.
    • Reference: Provide valid references for your data source(s). (Does not count toward the page limit.)
      • For example, using APA guidelines, the MIMIC-III database can be cited as > Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
      • or the original article > Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., … & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
  • Code legibility and organization (10 points): Give a brief summary or provide a README to use and evaluate your code. (Does not count toward the page limit.)