# Abstract Details

## (2020) Digging into Deep Time & Deep Cover

__Williams M__, Klump J, Barnes S & Huang F

https://doi.org/10.46427/gold2020.2870

06h: Room 5, Wednesday 24th June 05:33 - 05:36

Morgan Williams
View all 3 abstracts at Goldschmidt2020
View abstracts at 7 conferences in series

Jens Klump View all 2 abstracts at Goldschmidt2020 View abstracts at 4 conferences in series

Steve Barnes View all 2 abstracts at Goldschmidt2020 View abstracts at 3 conferences in series

Fang Huang View all 9 abstracts at Goldschmidt2020 View abstracts at 19 conferences in series

Jens Klump View all 2 abstracts at Goldschmidt2020 View abstracts at 4 conferences in series

Steve Barnes View all 2 abstracts at Goldschmidt2020 View abstracts at 3 conferences in series

Fang Huang View all 9 abstracts at Goldschmidt2020 View abstracts at 19 conferences in series

Listed below are questions that have been submitted by the community that the author will try and cover in their presentation. To submit a question, ensure you are signed in to the website. Authors or session conveners approve questions before they are displayed here.

*Submitted by Nicholas Barber on Wednesday 24th June 14:00*

This has been one of my favorite presentations so far at Goldschmidt. I'm an avid user of pyrolite, and I'm so excited to read such a clear distillation of the need for more robust computational workflows in geochemistry. Thank you for leading the way! My question pertains to the classifier notebook. What would be your advice to newcomers who want to get started with machine learning classification? For instance, is a radial SVC always an advisable model to apply to geochemical data? And why is not necessary to transform isotope data for your classifier?

Thanks for the question, and nice to hear that pyrolite is getting some use! In general terms, I'd suggest start with the simplest models (often those derived from the statistics world, such as logistic regressions and support vector classifiers) - these are typically easier to link logically to what's happening geologically, are easier to visualize and communicate, and have properties which can make them more robust when applied to new data (including relatively well-defined continuous class boundaries - unlike e.g. random forests). The radial basis function SVC has some tradeoffs, but will often be more performant when you have multiple slightly inter-mingled classes next to one another - if you can train it with a diverse and representative training set, they can work quite well. But you also may as well start with the linear equivalent, as the difference is typically one line of code, and it will give you a feel for the subtleties and limitations of your model. In the end what we're after isn't often the 'most accurate model possible' - it's something which we can practically put to use to solve research problems, and the ability to interpret the model and apply it to new data will be more useful that an additional few % accuracy. As for the transformations - what we want to use as inputs to our model are log-ratios, as these covariance structure of these variables is most appropriate given the normality assumptions of the classifier models. The isotope data are typically ratios already, so I log-transform them to turn them into log-isotope-ratios to adjust the distribution towards something closer to 'normal' - note that we can't input them directly to a compositional log-transform, as we'd then be taking ratios of ratios - e.g. something like 143Nd/144Nd/MgO which doesn't make much sense. The elemental and oxide abundance data require a log-ratio transformation (CLR or ILR both work well), but the covariance of these isn't affected by the isotope ratios themselves - just the abundances of the respective elements. You could transform the isotope ratios to bulk isotope abundances (e.g. use Nd abundance and 143/144 Nd ratio to give x ppm 143Nd and y ppm 144Nd, but these would be highly correlated) and use a single log-ratio transform for everything including components which are individual isotopes, but given that we're scaling everything after this step (to [-1, 1]) you aren't likely to see much of a difference.

*Submitted by Julin Zhang on Wednesday 24th June 15:45*

Hi, I am Julin Zhang, a Phd candidate with geochemistry and data science background at Rice University. It is a great project and I particularly like the examples you gave in the jupyter notebook, but I have a question on the classification example. I noticed that you used compositional log-ratio transforms for the major elements. Log-ratio transform is very necessary when we are aimed to explore the correlation between major oxides since the weight percentages of all the major oxides sum to 1 and it may introduce the fake correlation. But for this classification problem (or other regression problems), the goal here is to accurately discriminate (or predict) the data, not exploring the correlation between different features. So is log-ratio transforms for the major oxides still necessary in this case or it may introduce other advantages I am missing? Thanks for your attention! -Julin

Thanks for the question Julin! The classifier models typically exploit the covariance between different features, perform best with normally distributed data and themselves know nothing about compositional data. The logratio transforms (especially for the major oxides) will remove some distortion due to closure, and produce a dataset with properties more amendable to input into standard ML models. This includes the removal of spurious correlation (which would otherwise be leveraged by your model as 'information'), transformation of log-normally distributed data towards something more normal, the 'straightening' of linear mixing trends, and a parameter space in which both distance metrics and measures of central tendency (e.g. a mean) are meaningful. The inclusion of a compositional transform in your classifier might also enable you to better deal with dataset shift and generalisability to new data - as it takes care of both the closure effect (e.g. if you have new data with features which sum to 0.9) and adjusts for the non-linearity inherent in a compositional space (the effects of which are amplified away from your training set). Do you need compositional transforms to get an accurate classifier? Probably not, but they will often improve performance. And the limitations on where you can use compositional transforms are similar to those for ML models anyway - missing data is the killer for both.

Sign in to ask a question.