Capstone projects

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz



  • come to class next time prepared to show a draft of your vignette

  • after class today look for abstracts on the course site

  • office hours in place of section meetings this Wednesday

Capstone projects


Amgen is an international biotechnology company.

Project: evaluation of natural language processing algorithms used for knowledge graph generation

  • NLP algorithms are used to recognize and link entities and identify relations based on text data

  • outputs can be used to extract networks (‘knowledge graphs’) from text corpora

Goals: generate a benchmarking dataset from PubMed database and evaluate performance of NLP-based methods for knowledge graph construction

See this example of related work.


Appfolio is a local property management software company. Among other things, clients use their software for accounting purposes.

Project: anomaly detection from property management transaction histories

  • clients would like to flag smaller/higher transactions than typical without having to inspect full transaction records each month

Goals: determine applicable anomaly/outlier detection methods, evaluate performance, and develop dashboard based on best method(s)


CalCOFI stands for California Cooperative Oceanic Fisheries Investigations. They run a long-term monitoring program of the California current ecosystem.

Project: an eDNA window into larval fish habitat, ecosystem structure, and function

  • CalCOFI collects physical data and environmental DNA (eDNA) across depth in the water column at multiple monitoring sites longitudinally

  • a major interest is on impacts of environmental conditions on fisheries

Goals: develop dashboard for exploration of eDNA data and develop a model for prediction of fish larvae based on eDNA and physical data (or derived variables)

Carpe Data

Carpe Data is a local company focusing on data-driven solutions for insurance carriers.

Project: business characteristics classification models

  • insurance carriers use risk categorization as a factor in determining premiums for property loss and general liability insurance for businesses

Goals: classify businesses according to characteristics and/or risk levels based on basic business information (name, description, hours, images, etc.) and develop software pipelines for preprocessing and prediction

Caves visual ecology lab

Example closeup of bee eye.

The Caves lab studies visual acuity and its evolutionary and ecological drivers in animals. Bees are great model organisms for studying the relationship between ecology and acuity due to wide variation in lifestyles and ecologies.

  • Project: measuring visual acuity in bees from high-resolution images

    • acuity quantification is derived from physical measurements that could potentially be inferred from photographs rather than measured directly

    • specifically, radius of curvature of the eye and width of ommatidia (compound eye facets)

  • Goals: utilize computer vision techniques to infer acuity measurements from photographs and merge with ecological data to explore correlates


The Cheadle Center for Biodiversity and Ecological Restoration (CCBER) contributes to the Big Bee Project, aimed at creating >1M 2D and 3D high-resolution images of bees for the study of anatomical variation.

Project: constructing three-dimensional bee models from high-resolution images

  • several software tools for constructing 3D models from 2D images are available, but performance with bees specifically is not well understood

  • several image sets of ~100 photographs each are available for construction of models

Goals: after receiving training on 3D modeling tools, students will experiment with parameter tuning for optimal rendering of bee models; students will then derive physical measurements from the models.

Climate Hazards Center

The Climate Hazards Center housed in the geography department is a multidisciplinary research center focusing on climate risk analysis and response.

Project 1: identifying the drivers of food insecurity in the developing world

  • precipitation and potential evapotranspiration together can help identify deficits in available water and risk of food shortages

  • explore the relationship between precipitation and potential evapotranspiration globally over time and identify drivers after accounting for relationship

Project 2: evaluating and validating station- and satellite-based daily precipitation datasets

  • CHC supports precipitation datasets based on interpolating measurements from satellites and ground stations

  • new data releases go through an evaluation/validation process with benchmarking data prior to release

  • students will carry out evaluation/validation of a new release based on prior strategies

EEMB/Patrick Green

Patrick Green is a research scientist in EEMB studying animal behavior and competition, and is developing a pilot project studying contests among mantis shrimp.

  • Project: how do mantis shrimp fight in a community of competitors?

    • feasible to carry out continuous video monitoring of small populations in tanks to capture interactions and other behavior

    • time-consuming to review footage

    • not obvious how to define/code interactions

  • Goal: develop heuristic methodology for detecting time intervals in which interactions occur; generate tracking data summarizing movements of each individual.

Evidation Health

Evidation is a California-based company focusing on health data analytics for individuals and for researchers.

Project: Impact of case definition on early detection systems for COVID-19

  • model-based early detection systems for infectious disease use inconsistent criteria for case onset

  • the definition of when a case starts may impact the efficacy and other features of early detection systems

Goals: assess the impact of case onset definition on existing early detection models and find optimal case onset points.


Inogen is a medical device company that builds portable oxygen concentrators.

Project: analysis of portable oxygen concentrator patient use data

  • Inogen collects a variety of data on patient use of their POC devices and is interested in identifying areas of potential improvement

Goals: identify patterns of device use from patient data with particular focus on adherence.

MOVE lab

The MOVE lab at UCSB focuses on movement data science in general and in particular human mobility in response to disruptions.

Project: detecting changes in human mobility and movement patterns associated with wildfires in California

  • movement and mobility are often markers of behavior and changes in movement patterns may capture information about behavioral responses to events

  • natural disaster in general and wildfire in particular are likely to produce shifts in movement patterns

Goals: assess suitability of several candidate datasets for studying behavioral responses to wildfires; identify movement patterns and explore the hypothesis that change points occur in connection with wildfire events; potentially explore demographic covariates.

Peak Performance Project (P3)

P3 is a local company focusing on applied sports science and technology for biomechanical analysis of athletic performance.

Project: understanding links between biomechanical data and on-court NBA production

  • P3 collects biomechanical data – force plate and motion capture – on professional athletes and maintains a large proprietary database on NBA athletes

  • historically, have focused on injury risk, but interested in finding biomechanical correlates of real performance

Goals: scrape publicly available on-court data, merge with biomechanical data, and identify correlates of on-court production; develop visualization tools.

SLAC National Accelerator Lab

Stanford Synchrotron Ratiation Lightsource (SSRL) is a DOE facility at Stanford supporting a wide range of fundamental research involving bright X-rays.

Project: diffraction image selector

  • X-ray diffraction data provides insight into the atomic and molecular structure of crystals; serial crystallography involves merging diffraction patterns from several crystals in order to analyze the material structure

  • experimenters tend to select diffraction images manually from serial experiments; this selection has an impact on experimental outputs

Goals: build a regression model to label images during data collection in real time; develop model from simulated data and validate on real data.

Project Preferences

Please read abstracts first and then fill out the preference form by Friday 12/2.

  • 3 top lab choices

  • 3 top industry choices

We’ll try to accommodate preferences, but we can’t guarantee you’ll get your top choices.

Target date for assignments: Friday 12/9.