Kafka Summit 2019 talk

I gave a talk entitled “Building a newsfeed from the Universe: Data streams in astronomy” at the Kafka Summit 2019 Conference held in San Francisco on October 1st.

The talk is a case study on using Kafka to build a streaming data processing framework that scientists can use to apply analytics to real-time data.

You can watch it below:

And also view slides on Confluent’s website.

Summary:

The field of astronomy is rapidly changing away from the traditional notion of a lone astronomer pointing a telescope at a single object in a static sky. Initiatives such as the Sloan Digital Sky Survey have ushered in a collaborative big data era of wide-field sky surveys, in which telescopes collect observations continuously while sweeping across the visible night sky. This method of data collection enables not only very deep imaging of far and faint objects but is also optimal for searching for objects that might be changing or moving. By analyzing the differences in astronomical image data from one night to the next, astronomers can detect “transient” objects, such as variable stars, supernova, and near Earth asteroids. New sky surveys provide a wealth of scientific value for astronomers but not without technical challenges. Survey data need to be automatically processed and the results immediately distributed to the scientific community in order to enable rapid follow-up observations as transient astronomy can be highly time sensitive. Detection alert data distribution mechanisms need to be robust and reliable to maintain scientific integrity without data loss. Additionally, alerting systems need to be scalable to support a data volume unprecedented in astronomy, as transient detection rates have increased to exceed all historical data in a single night. A streaming architecture is an ideal architecture for automated distribution and processing of transient data in real time as it is being collected. In this talk, we will discuss how Kafka and Avro are being used in wide-field astronomical sky survey pipelines to serialize and distribute transient data, the design choices behind this system, and how this alert stream system has been successfully deployed in production to distribute transient detection alerts to the scientific research community in excess of 1 million events per night.

Speaker bio:

Since earning a PhD in astronomy, Maria Patterson has worked to help scientists optimize their data analysis pipelines. She worked with NASA on cloud-based pipelines for automated processing of daily satellite imagery. During initial stages of NOAA’s Big Data Project, she led the technical development of a science “data commons” to ensure programmatic data access for researchers. Maria then returned to astronomy to architect a system for distributing alert streams of real-time object detection from large-scale telescopes. Currently, Maria is a Data Scientist with the venture studio High Alpha, supporting new cloud companies augment their products with data science.

Updated: