You are here

What do data scientists do: The data science diaries

Definitions of what constitutes ‘data science’ are apt to differ according to the source you consult. IFoA President John Taylor has described the discipline as “a dynamic field that as soon as you try and pin down what it is today, it’s changed tomorrow”. This being the case, how does this apparently elusive definition reflect the actual job of data scientists? What goes into their daily roles and responsibilities, and how do data scientists differ from organisation to organisation?

First-hand accounts are the best way to gain insight into what any professional remit entails, and so the IFoA invited three data scientists who work in contrasting sectors to outline the basics of a typical working week. Their accounts reveal that, as well as being core members of multidisciplinary teams, data scientists have to be multidisciplinary individuals, adept at a range of data analysis, software and business skills.


Gousto: Irene Iriarte Carretero

IRENE IRIARTE CARRETERO

Irene Iriarte Carretero is a data scientist at the UK’s largest recipe box company, Gousto*. Her work has focused on the development and implementation of different data science products, such as a menu recommendation engine and a forecasting algorithm that allows Gousto to predict the recipes that customers will order, and to ensure that the company minimises the food waste in its supply chain.

Monday

AM: I’m working from home today. It’s a great way to start the week, as I’m really able to focus on tasks that require more concentration. After a quick call with the team to finalise what we are working on this sprint, I do some research on how top companies are applying personalisation. There are some really interesting blog posts and papers – I summarise my thoughts so that I can share them later in the week. 

PM: I work on doing some analysis to understand how our customers are interacting with their recommendations – we are working on the implementation of a new collaborative filtering method, and these findings will feed into how we end up implementing the algorithm.

Tuesday

AM: My calendar looks pretty clear today  – I have time to focus on starting the implementation of the new algorithm. Luckily, the data we need is straightforward to obtain and is already pretty clean, so I can get straight to using it. Given that, as data scientists, we take ownership of the entire process, from ideation of the product to deployment, I need to ensure that my code is production-ready. I send my code to one of the Machine Learning Engineers in the team who gives me some suggestions to make the code more efficient. Once I make these changes, I check that everything is working as expected in our testing environments before pushing it to production.

Wednesday

AM-PM: A bit of a change of pace today. We are spending the day at an offsite location where we are going to have several workshops to brainstorm on the long-term vision for our menu page. Our team is cross-functional and includes members from the Design, Food, Software Engineering and Proposition teams, as well as Data Science, which ensures that we think about our ideas from different perspectives. As data scientists, I think it’s easy to get too caught up on improving a model’s accuracy, so I always find it useful to have opportunities like this where we can really think about how our products slot into the wider picture. 

Thursday

AM-PM: It’s really important that we understand the impact of our products, so today I am focusing on analysing the results of an experiment we ran on the website, in which 50 per cent of our customers saw a slightly different experience of how recommendations were presented to them. We make sure that, as well as internal algorithm metrics, we deep-dive into more commercial metrics that we can easily communicate with stakeholders across the business, and that are more useful to understand how our products impact our customers.

Friday

AM: After a much-needed coffee (it’s been a long week!), I have one-to-one meetings with the team and catch up with some outstanding emails.

PM: We have a team retrospective to discuss what has worked well and what improvements we can make to ensure we are working more effectively. Finally, I work on preparing a short presentation for our monthly Tech Showcase – it’s a great opportunity to share our work with the whole company over some drinks and nibbles.

More from our data science series:

*Based on volumes and revenues, based on third party data from Reward Insight.


Centre For Environmental Data Analysis: Graham Parton

Graham Parton

As a Senior Environmental Data Scientist at the Centre for Environmental Data Analysis (CEDA), Graham Parton curates observational data from its atmospheric community, and ensures they are accessible and future-proofed for CEDA’s users. His role also scopes development of the content and structure of CEDA’s data cataloguing service, a data discovery tool and link to supporting materials. CEDA services are provided on behalf of the Natural Environment Research Council via the National Centre for Atmospheric Science and the National Centre for Earth Observation. CEDA is based within the RAL Space department of the Science and Technology Facilities Council.

Monday

AM: Start the week with a quick catch-up on the helpdesk where I can see our regular batch of users struggling to use the surface weather observations data from the National Met Agency. It’s brilliant data – it’s just that they need to cross reference the station metadata with the data itself before they can get on with their research. The open version of the data collection tool resolves most of those usability issues, which is helping new users to access these data… I’ll remind our Met Agency partner about this when we catch up about the upcoming new release of those data, as he’ll be pleased to see the rewards of his efforts there! The rest of the day I’ll get on with coding to improve our catalogue service. There are a few niggly bugs I reckon I can fix this afternoon.

Tuesday

AM: Focused on weekly ’Data Management Plan’ (DMP) related tasks. Checking our internal DMP tool, I see I’ve got a couple of new projects that have come in from the latest Research Council funding round for me to get in touch with. Will look at their project details and their outline DMPs to figure out what data they may want to archive, and then make contact later; but, for now, I’ve got some new sample files from another project to look at. Hopefully, these will be an improvement on the last ones they sent, where the internal metadata was, well, a bit sparse to say the least!

Thankfully, though, I can refer them to our help documentation to steer them in the right direction to resolve those issues. Just hope they can find their notes about the instrument’s deployment last year – without the instrument calibration details the data’s reusability could be questionable

PM: still not managed to get to look over those new projects that came in yesterday, as one of my ingest scripts encountered issues with the storage system; so I’ve had to spend most of the morning checking over the issues and getting the ingest restarted. But my diary is clear later this afternoon, so I should be able to fire off those introductory emails at last.

Wednesday

AM: Day of meetings. For starters, our developer group catch-up – usually tough for me, as I’m not a seasoned code developer, so some of the stuff is a bit difficult to follow. After that a Google Hangout to catch up with my line manager to review the outstanding developer task lists for the data catalogue. Hopefully with our new archive access database in place later this month I’ll get the go-ahead to develop the catalogue service: this will ensure up-to-date access and licence information can be fed through from our new tool. Then we can finally stop having to record this information in more than just one place.

After that I’ve a 16:30 meet with a colleague to see if she can help review our 70+ data licences to classify them with our new scheme. That would really help users to filter our 6,500+ datasets to find ones which they can use for work purposes (e.g., commercial or personal use), and not just assume that it’s just for academic use. This classification stuff is quite exciting, though, as it’s a new approach that is getting lots of interest across the international research data community, and not just for environmental data either! If it gets adopted more widely users will be able to do something akin to Google image searches, which allows searches to be limited by permitted uses.

Thursday

AM: Monthly group meeting. Hear about the wide range of work that our group is engaged with, from working our cloud portals and development of our high-performance data analysis system, through to involvement with international metadata standards.

PM: Crack on with data management tasks. Before that, though, I’ll spend a bit of time checking our Elasticsearch index of our entire 200 million files to check for occurrences of some site names used in filenames to aid a project’s file naming scheme. It would be like looking for the tiniest of needles in the mother of all haystacks if it wasn’t for our Big Data tools to scan and index everything – but that’s what’s needed these days to manage such vast and diverse archives.

Friday

AM-PM: Today was pretty full-on. One minute I was setting-up new data extractions to pull in forecast model data into the archive (a quick task, but needs regular checks, as not all extractions run smoothly), the next covering the helpdesk, aiding users to find relevant data and sorting out their account issues. Then a brief Google Hangout with a developer to check on the adjustments to the new access control system, before doing a handful of data catalogue record reviews and DOI (Digital Object Identifier) minting, and getting a new dataset finally published after the last few weeks spent persuading the provider to actually follow the metadata guidelines. Thankfully I’ve had a quiet last hour of this week to  catch up on my colleague’s blog post about data citation. It’s important stuff, essential for researchers to follow – otherwise, how can we rely on the science if we can’t find the underlying data that supports the results?

More from our data science series: 


Bloom & Wild: Dave Marshall

Dave Marshall has been Lead Data Scientist at online florist Bloom & Wild for more than three years. The focus of his work is to use data to drive decision making in the company, and automate this wherever possible. This feeds into the main company objective of maximising the lifetime value of customers.

Monday

Make coffee first thing, to get the day going, and then look into an experimental project we are running with a new product recommendation system, which personalises the search for our customers to recommend the perfect bouquet for them. I discuss potential improvements on the experiment with my data analysis colleague, and whether to rollout to all customers, with another colleague, Bloom & Wild’s Retention Product Manager. We decide to keep A/B testing, where 50 per cent of our customers see the recommended products and the other 50 per cent do not, so that we can quantify data.

Tuesday

I catch up with my data analysis colleague on our priorities for the week. He’s working on an exciting project to improve our product metadata – the properties of the bouquets and plants that we sell. This will allow us to automatically calculate a similarity score for different items which could feed future improvements for our product recommendations. He has worked closely with backend developers to get data in place. We agree the project is nearly complete, and that he’ll present the work at next week’s company-wide meeting. We think we’ll get lots of ideas for other areas of the business where this new dataset can also be implemented.

Wednesday

We have a weekly company-wide meeting at midday. In 20 minutes we hear how we are performing against our key metrics, and also get an update on any career opportunities currently open across the company. It’s traditional that the presenter sprinkles-in fun facts on a theme. After this we head-off for lunch.

Thursday

Busy day. Kicked-off a project to improve our website speed with two frontend developers and our VP of Product. A faster loading and more responsive web shop should make customers (even) happier and purchases more likely. My role is to crunch site performance data from our UK & Ireland, German and French websites: this is to identify where we’ve negatively impacted the experience, or if certain pages suffer on specific web browsers. I have also created a dashboard for developers to check they are making progress!

Friday

AM: headphones on, and time to work on my own for a bit. I write some code to help our operations team better understand our stock position at any given time, and to predict what bouquets will be left unsold at the end of the day. Dealing in perishable goods means that we need to forecast very accurately to avoid wastage. Meanwhile, this week it’s Bloom & Wild’s 6th anniversary, so at 18:00 the whole company heads off to eat some birthday cake and to do a flower arranging workshop!
 

Filter or search events

Start date
E.g., 03/06/2020
End date
E.g., 03/06/2020

Events calendar

  • Protection, Health and Care Conference 2020

    The Grand Brighton, 97-99 Kings Rd, Brighton BN1 2FW
    02-03 June 2020
    Spaces available

    Due to COVID-19 we have taken the decision to turn a selection of the conference programme into a webinar series. Further announcements about this event will be made soon.

    Thank you for your patience and understanding.

  • Mortality and Longevity 2020

    1 Birdcage Walk, Westminster, London SW1H 9JJ
    08-08 June 2020
    Spaces available

    Due to COVID-19 this event has been postponed until later in the year. A new date will be announced soon.

    Thank you for your patience and understanding.

  • PHC Conference webinar: CMI Update

    Online webinar
    10 June 2020

    Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series

    An update from the Continuous Mortality Investigation (CMI) with content focused on the work of the Assurances Committee and the Income Protection Committee.

  • Webinar: IFoA Annual General Meeting

    Virtual meeting
    15 June 2020

    Spaces available

    Notice is hereby given that the Annual General Meeting of the Institute and Faculty of Actuaries will be held online via a live webinar on Monday 15 June 2020. This will be followed by a Q&A session with the IFoA Leadership Team.

  • Spaces available

    Mark Williams will provide insight on this topic as IFoA Pensions Board Chair, lead actuarial representative on the PASA Working Group and head of Buck’s Square solution.

  • Pensions Conference 2020

    Worldwide
    18-19 June 2020
    Spaces available

    Due to COVID-19, we are running this programme via a series of webinars commencing 16 June.

    Focussing on topics including funding and savings, pension law current issues, ESG, the end game and transferring risk, investment issues, data visualisation and data science. Expect a great line-up of speakers and a wide range of thought-provoking topics. 

  • Webinar: IFoA Presidential Address

    Online webinar
    18 June 2020

    Spaces available

    Tan Suee Chieh will be inaugurated as President of the IFoA at this year’s AGM. Join us for Suee Chieh’s Presidential address on Uncertainty, Culture and Imagination. Using the Covid-19 crisis as the backdrop, he will invite IFoA and the profession to re-imagine our roles in a digital age and a future fraught with uncertainty and opportunity.

  • TIGI 2020 (Technical Issues in General Insurance)

    Webinars
    22 June 2020 - 7 July 2020

    Spaces available

    Technical Issues in General Insurance provides content across all key areas of the general insurance sector whilst also offering cross-practice area technical topic deep-dives.

  • Spaces available

    This is the third webinar in the Extreme Mortality Events series presented by Chair of the Life Board of the Institute and Faculty of Actuaries, Colin Dutkiewicz.

    This discussion will have a look at the much talked about R0 number.  It will dissect why this average number is so poorly understood, and why it is a bad indicator of the pandemic progression.  In particular it will discuss how serious errors in decision making can be made by relying on this number.

     

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series

    In this session we will present the results of an international survey of claims assessors. We will explore key similarities and differences in claims practices in different locations, and their impact on the management of disability claims. 

  • Spaces available

    Climate change represents a material risk to future financial stability and creates implications for the work, actions and decision making of actuaries. Join us to learn about the science behind global warming and what that means for policy makers and society over the coming decades.

  • Spaces available

    The release of the PRA consultation paper may signal a good time for insurers to review and update their liquidity risk management framework. Within this presentation we discuss our interpretation of what the PRA’s liquidity consultation paper means for insurers and how this compares to our understanding of current market practice through answering the following questions: 

  • Spaces available

    Converting lifetime savings into a lifelong income are a fundamental part of pensions. Yet this is often not part of DC pension plans. One possibility is to offer a post-retirement, pooled annuity option, perhaps structured as a Collective Defined Contribution (CDC) plan, to retirees.  Pooled annuities convert lump-sum savings to a life annuity by collectively pooling longevity risk.

  • Spaces available

    Although ESG has many buyers across the asset allocation community, from pension funds to sovereign wealth funds, it still hasn’t found its place within the core asset management strategy desks where the money is actually invested. The problem as well as the opportunity is Fixed Income.

  • Spaces available

    Technical Issues in General Insurance

    The presentation will cover a summary of findings over the past few years of working on the boundary between data science and actuarial. The webinar will describe and review several data science approaches and discuss their relevance and use in actuarial pricing.

  • Spaces available

    A practical and theoretical look at risk from wider perspectives, drawing on theories and examples from other fields, as well as social experiments, the session aims to stimulate members to consider information asymmetry, bias, Bayesian methods, behavioural finance and behavioural psychology when determining pension scheme risk. 

  • Spaces available

    This talk expands on recent changes to Solvency II regulation to include sustainability risks and explores whether private equity offers an opportunity for making l long-term and impactful investments. 
    We will answer the following questions: 

    How sustainable private equity could be used in your portfolio to qualify for more favourable capital treatment under Solvency II?

  • Spaces available

    Technical Issues in General Insurance

    Across the General Insurance market actuarial pricing tools are being migrated from Excel to web-based technologies. Web based pricing technology offers many synergies with AI and Machine Learning. But they are often hampered by poor design and practically always perform worse than their Excel counterparts!

    The following will be presented: 

  • Spaces available

    In this talk we cover the practicalities of implementing ESG within real asset investment decisions. Capital deployed in real assets is invested for a long term and has far reaching impact on the environment and society.  However, implementing ESG into real asset investment decisions is not straightforward and requires a different approach to public market assets.

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series

    This webinar will provide an overview of how and why we built a Death Toolbox using Shiny.  This includes a live demonstration of the tool that enables a user to explore mortality datasets without knowing specialist coding techniques.

  • Spaces available

    An update from the Continuous Mortality Investigation (CMI). Content will focus on the work of the Self-administered Pensions Schemes (SAPS) Mortality Committee and the Mortality Projections Committee and will cover the most relevant and up-to-date outputs.

  • Spaces available

    The PPF’s Purple Book provides the most comprehensive data on the UK universe of Defined Benefit (DB) pension schemes in the private sector.  Analysing how this landscape has changed over time shows that whilst the risk profile of DB schemes has reduced underfunding has persisted.  The session will explore these trends and provide an insight into how funding may evolve in the future under a number of our modelled scenarios.   

  • Spaces available

    The IFoA is running a webinar on Thursday 9 July 2020 which will focus on the challenges and issues facing defined contribution savers who wish to access illiquid investments.

    Our speakers will provide perspectives from their diverse experience, including DC fund implementation, platforms, master trusts, the role of trustees, regulation, and parallels with other fund types. They will then come together in a panel discussion with audience interaction. 

    The webinar is likely to be of interest to actuaries working across the spectrum of financial services, including investment advisers, asset managers, insurers, fund platforms and DC master trusts.

  • Spaces available

    An online webinar delivered by John Taylor, President of the Institute and Faculty of Actuaries who will look at the prospects for the actuarial profession in the era of unprecedented technological innovation.

  • Spaces available

    Climate change risks are likely to become material for many risk management and investment decisions. This will require to incorporate explicitly climate change in the tools used for risk management and investment decisions. At present existing climate change tools are often too crude for decision making.

  • Spaces available

    Because of Covid-19, forecasters predict a severe recession in 2020, followed by a V or U-shaped recovery. This impacts both individuals and companies. However, compared to previous recessions, the impact on banks of higher credit losses should be mitigated to some extent by government actions. 

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series

    This session will provide an overview of the Population Health Management Working Party's research including defining impactability and impactability modelling, discussing some examples of specific modelling approaches, considering the practical challenges across the NHS as well as wider public perception and ethical issues.

  • Spaces available

    Many actuaries consider career opportunities in the Finance and Investment practice area after having started off in more traditional actuarial roles such as valuations, capital management or pricing. This session is aimed at helping actuaries to better understand roles in Finance and Investment and how they can fine tune their skills to pursue such careers.

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series

    With the rising prevalence of dementia, how can we manage this risk effectively and can insurance do more? Matt Singleton, Ageing Lead at Swiss Re, will cover these topics and demonstrate how insurance could help people address their concerns.

  • PHC Conference webinar: Gene Modification

    Online webinar
    6 August 2020

    Spaces available

    Chief Medical Officer (CMO) for Gen Re Life/Health Research and Development, Dr John O'Brien, will discuss the impacts of Gene Modification for life/health insurance. 

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series. Modelling the structure and trends of cancer morbidity risk is important for pricing and reserving in related health insurance fields such as critical illness insurance and care provision. We model the dynamics of cancer incidence over time in different regions in England, using 1981-2016 ONS data. The modelling allows estimation of cancer rates at various age, year, gender and region levels, following a Bayesian setting to account for statistical uncertainty. Our analysis indicates significant regional variation in cancer incidence rates. 

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series. In this talk we will outline the steps Aviva took in pulling together our first large-scale disclosures on the exposure of our business to climate change published in March 2019; in line with the recommendations of the Taskforce on Climate-related Financial Disclosures. After touching on why insurers have such an important role in climate change, we'll cover a brief “how-to” guide for those who have not yet embarked on thinking about these topics before giving a case study of how the learnings from a TCFD disclosure exercise can be applied to investment portfolios.

  • Spaces available

    Part of the Protection, Health and Care Conference 2020 webinar series. 

    The insurance industry currently underwrites customers with diabetes based on a range of factors, medical expertise and various medical studies. The work undertaken by the Diabetes Working Party would help the industry to approach this using current research findings to update and enhance how potential risk from diabetes is considered. This includes the need to understand the underwriting implications as treatments improve, and potentially to develop new products that are tailored to those with diabetes. This webinar will present our latest findings in the management of this important chronic condition which will include research in collaboration with the ARC.