We present Brut, an algorithm to identify bubbles in infrared images of the Galactic midplane. Brut is based on the Random Forest algorithm, and uses bubbles identified by >35,000 citizen scientists from the Milky Way Project to discover the identifying characteristics of bubbles in images from the Spitzer Space Telescope. We demonstrate that Brut's ability to identify bubbles is comparable to expert astronomers. We use Brut to re-assess the bubbles in the Milky Way Project catalog, and find that 10-30% of the objects in this catalog are non-bubble interlopers. Relative to these interlopers, high-reliability bubbles are more confined to the mid plane, and display a stronger excess of Young Stellar Objects along and within bubble rims. Furthermore, Brut is able to discover bubbles missed by previous searches -- particularly bubbles near bright sources which have low contrast relative to their surroundings. Brut demonstrates the synergies that exist between citizen scientists, professional scientists, and machine learning techniques. In cases where "untrained" citizens can identify patterns that machines cannot detect without training, machine learning algorithms like Brut can use the output of citizen science projects as input training sets, offering tremendous opportunities to speed the pace of scientific discovery. A hybrid model of machine learning combined with crowdsourced training data from citizen scientists can not only classify large quantities of data, but also address the weakness of each approach if deployed alone.
The Milky Way Project enlists the public to analyze star-forming regions of our galaxy in infrared images taken by the Spitzer Space Telescope. Stars form in dense regions of gas and dust called molecular clouds. As new stars burst into life, these young stellar objects send a shockwave of photons and particles - the stellar wind - that pushes away the surrounding cloud. This looks like bubbles in pictures of the clouds.
Without help, computers have a hard time recognizing the bubbles. That’s where the public comes in. The 35,000 people participating in the Milky Way Project produced a catalog of about 5,000 bubbles - almost 10 times more than the largest professional catalog.
The project’s blog has a high-level explanation of what the algorithm does - and how the public, the professionals, and the machine intelligences work together. But the paper itself is very readable, especially for the background information on molecular clouds and crowdsourced astronomy. Some highlights:
- Humans still outperform computers in pattern recognition
- Computers outperform humans when faced with huge datasets
- The Milky Way Project analyzed 45GB of images, but next-generation observatories will produce terabyte and petabyte data sets.
- Citizen scientists make their biggest impact on science when they teach the computers
Go to the blog post and the arXiv preprint to get the full details.