I should preface this that the blog post should of been published around the end of 2023. But due to me letting it slip through the gaps it is here now, better late then never is something that I have heard somewhere before.

This is a story about my first prototype. My first real piece of software that went through a proper development cycle (whatever that is) with real stakeholders and money involved. I have done various university assignments and built some large systems within google app scripts and google drive, however this one felt unique. It is an ongoing story of a project that started relatively small but has carried on now for almost a year.

Start

As part of my studies at Massey University I had to complete a final year project. This was completed during July to October. It made up a third of my workload for that semester. In choosing my project I wanted something that was both novel and hard. This made me reluctant to accept any of the professors “prepared projects” as they were nothing new and just larger assignments. There was an opportunity to do a project on some education software which seemed somewhat interesting. But before committing to this I was speaking to my father who worked at the Transport Accident Investigation Commission. He was looking to build a tool that uses some of the new AI technology (namely ChatGPT) to make the investigation process easier and smarter. I thought that this sounded like a challenge that ticked the boxes of what I was looking for. Therefore a proposal was written up and accepted by my advisor and work commenced.

Project

Goal

The end goal was to have a tool that would help agencies like the New Zealand Transport Accident Investigation Commission. It would help them in two ways by supporting them in conducting inquiries and writing the reports as well as increasing the insights and effectiveness of each published report. The more direct goal of the project was to use a LLM from OpenAI to create summaries and metrics on each of the reports. This would constitute a proof of concept to show me, my father and potentially TAIC that it was possible and even useful. To keep the scope data size manageable we limited the scope to only Marine reports from 2010-2020.

Progress

I was completing this project as a requirement of a paper I was taking. This meant that I had to devote a third of my study time to it with the rest taken up by my other courses in Mathematics and Programming. However this project also took on the role as my part-time job meaning I could devote another 5-10 hours to it a week. Lastly I was living in a campervan in Wanaka NZ so I could train for my Snowboarding competitions. Thus with 3 days up the mountain I was having very busy evenings and 4 days “at home” (If a campervan counts as a home).

To get some software that is actually useful I needed to complete a few basic features: - Data collection - Basic Analysis - Displaying analysis (As discussed later this requirement was missed in initial planning)

Firstly data scrapping and cleaning had to be done. This involved web scraping of the TAIC website to find all the PDF reports. There was nothing really new or interesting here. But the cleaning of the PDFs into clean text files was more frustrating than first meets the eye. Even though everyone and their dog uses PDFs actually interacting with them programmatically is quite another thing.

Now quickly I learnt that you cant just give the whole report to a LLMs model because a report is about 50 pages which is just way above the context limit of the LLMs at the time GPT 3.5 etc. This meaning that I was going to need to extract the “useful” stuff of the reports. I followed guidance from an investigator that it is only the analysis and findings sections of a report that have important bits of information. Fortunately for me all the reports followed the same structure with these two sections.

Lastly was doing something useful with the engine (some might say the important bit). My first idea was about making the engine use the smartness to summarize the reports and create a dataset where further analysis will really extract out the insights. This meant that I wanted the model to assign a percent for how much a given safety theme was involved in the accident. This worked surprisingly well. However the immediate question was what should these predefined themes be. This made me want to go back to GPT 3.5 and ask it after it has read all the reports what it thinks the main themes are. To do this I asked gpt 3.5 to summarize each reports safety themes, then from there I gave it all the summaries and asked it to generate global themes. This produced what one could call something useful. Yet as I came to learn there wasn’t time for Post-hoc analysis and so the created dataset had to be informative as possible. This meant as discovered below that a good UI for searching and reading was needed.

Now with all of this done it was time that I could show my father and see what he thought about it. He had heard through emails what was going on but this was the first time I was seeing him in person to show him the state of affairs. So I sat down opened up my laptop had it ready to go with vscode and I start showing some tabs of code and the output folders and it was going great for maybe 3-5 seconds. Then I realized that I am speaking to a Marine accident investigator not a fellow programmer. So I let the demo be “postponed”. Then I rapidly started building a web interface that gave a user friendly experience to using the engine. Thus the Viewer was born and my project turned from just python script files to a complete app! This webapp was developed later and is available live here: https://taic-viewer-72e8675c1c03.herokuapp.com.

This was more or less the end of my University project work. There is a much longer report that can be found here.

TAIC extension

Once the work was completed it was time to have a meeting with TAIC. As we were only dealing with a rough proof of concept this was a small meeting with some researchers at TAIC rather than an official demo to leadership. Even though I didn’t check anything before hand fate was on my side and the proof of concept software worked exactly as I needed when giving a mini demo, the stars must of been lined up that day. TAIC liked what they saw and wanted further work.

This extension was about a month and took me from Mid November to 22nd of December which was going to be Demo day. There were a few item they wanted to accomplish in this sprint. These were:

Expand support to aviation and rail
Upgrade the web viewer to be more user friendly
strengthen analysis by adding quotes and reasoning to the weighing.

Firstly the challenge of adding in aviation and rail. Due to my attempts at following good programming practice adding in these reports was akin to just adding in two extra input pipes into the data pipeline. It was satisfying to have the fruits of my past selfs diligence. For a great blog post I would probably put a link into the commit that shows all the changes but alas I had bad discipline and squashed a merge with too many changes.

With the dataset getting larger it was even more important to make the webapp useful and clean. Now this was a difficult part not because it required a great feet of software engineering but simply because I am not a web developer. This meant that the work here involved a fight between me and various python web app frameworks. Good news I won with the help of LLM upgraded the webapp to be pretty and professional looking. But even though the webapp works it is certainly not the most elegant in the back end and will probably require a revamp as it moves forward.

The last big ticket item was strengthening the engines analysis. This is due to the ever present question of “how can we trust what this model says”, which was becoming ever more present as the engine is saying more and more. I saw two directions to tackle this from. Firstly is in the generation phase, that is I can make something that is not only correct but verifiable. Secondly it can be done by actually verifying the output from the model. This idea is the bread and butter of statistics where you check your model output with what it should output. I had done this when doing the initial university project and it showed the outputs were familiar. However as I changed the output of the model the validation set would have to be recreated and due to time constraints this was not possible. Going into the this TAIC extension I wanted to focus on the first method to make each of the summaries, ranking etc self contained and human verifiable.

Making it human verifiable means that I needed to have reasoning and more importantly references within any of the text outputs. Changing the GPT 4 prompts so that it was explaining itself was not hard. The problem was to make sure that it wan’t just making things up. This part is what took the time. Fortunately checking a quote is possible as we have access to the original reports. Therefore in theory we just need to check the reference material at the given location and see if we can find the reference. Yet as with all of life’s theories, reality is a bit harder. The main problem was really just getting the GPT to output the quotes and references in a format that was consistent and machine readable. However after a good battle with the prompts and doing a bit of prompt engineering I got some where that works well enough. It cant verify and find all quotes but a vast majority of them and even repair the ones that have problems! The large outputs of the models like theme summaries and safety themes weightings for each report now had an explanation which is supported by direct references to the accident report.

Further work

At the end of this work there was a meeting organized for the 22nd of December with the leadership of TAIC (CEO, chief investigator etc). Unlike my other meetings this was a proper product demo. There was diagrams, power points the whole lot. However one of the craziest things about it was that I would say things in the presentation and then the attendees would scribble down notes on what I said! The presentation went well from a technical perspective as almost everything worked (except a filter to only include reports from certain year range in search results, not sure what happened as I couldn’t for the life of me recreate the error at home later). From a “business perspective” it also went well as the CEO was very keen on it and wanted to sign me up again with another contract.

However I had a 2 month trip around Europe booked so it would have to wait until I got back. This is where I am right now just returning from Europe and about to get started on Monday with further work.

This further work is going to focus its efforts on expanding the analyzed reports to also have other none NZ transport accident investigation commission. Likely the Australian Transport Safety Bureau. As well as deepen the models in other areas.

Where the future of this project wll lead is unknown however I am excited to still be working on my university project and making it better with every sprint.

Conclusion

This project has had some unique aspects to it that I have not had to deal with before. I learnt and experience a much more sophisticated development cycle which involved lots of writing proposals, reports etc. but it has also given me an opportunity to really focus on a project with my full attention for some non trivial time.

If I started this again there are things I would do different. I wish I had built a better testing suite because I have basically no unit tests. Furthermore I am not comparing anything to a benchmark our validation set, so most of the output is only judged on whether it “looks reasonable”, which is a shame that I cant have a stronger defense.

I am also disappointed that I haven’t been able to incorporate true learning into this model. As rather training my own model on a corpus of data I am instead using a pre trained model. There is very short term learning in the process of the engine, yet it is not the same scale and meaning as a “real” Machine Learning project. However I have no idea what is a real machine learning project or what not. Luckily life moves forwards so I have opportunities to incorporate this and refine what is already there. It shouldn’t matter what an ideal project would look like as long as there is real usefulness it provides.

If you have managed to power through al of this you may be interested in all the documents that have been written in relation to this project. These can be found inside hte github repo here: https://github.com/1jamesthompson1/TAIC-report-summary/tree/main/Project%20documents.