Leveraging LLMs for Data Analysis

In today’s data-driven world, the ability to extract valuable insights from large datasets is essential across a wide range of industries - from energy and climate to mobility and gaming. However, traditional manual approaches to data analysis are often costly, time-consuming, and struggle to scale effectively. That’s where Large Language Models (LLMs) come in, offering an automated, efficient, and cost-effective solution for analysing data.

The AI Workflow: Leveraging LLMs for Data Analysis

At the core of our approach is an AI workflow specifically designed to harness the power of LLMs for data analysis. This workflow consists of several key components, including crafting effective system prompts that clearly define the LLM’s role, desired tone, primary goal and how to interpret the data. For example, our system prompts would instruct the LLM on who/what it is, the tone to use, the end goal of the analysis, what information should be avoided and specific guidance on how to process the dataset.

Data preprocessing played a crucial role in optimising performance and costs. One approach we found effective was reducing the data sampling frequency of datasets. By validating that useful insights could still be extracted even with lower granularity data, we could significantly reduce token lengths and associated costs. Our experiments showed comparable analysis quality when reducing sampling by up to 6 times.

Additionally, we experimented with providing preprocessed data summaries alongside the raw data, highlighting important columns, minimums, maximums, and key differences. This steered the LLM’s focus towards the most relevant data points enhancing the quality of the insights generated.

Perhaps the most impactful technique was the discretisation of data before feeding it to the LLM. Through this process, we identify high and low values, and only include those critical data points in the final prompt. The LLM could still understand the data as points in a line, enabling relevant analysis and comparisons. This approach allowed us to reduce token lengths by up to 33 times compared to processing the full raw dataset, making it viable to use smaller open-source models locally for certain analysis tasks.

Building this AI workflow was an iterative process through continuous refinement and optimisation. We followed an approach based on Proof of Concepts (PoCs), allowing us to test and validate each component individually. By breaking down the workflow into elements, we could rapidly experiment with different techniques, evaluate their effectiveness and identify the most promising strategies. This iterative approach was crucial in shaping the final AI workflow and ensuring it delivered reliable and impactful results.

Benefits of the LLM Approach

Adopting an LLM-based solution for data analysis offers a lot of advantages compared to traditional manual methods. First off, it enables much faster analysis, allowing the processing of large datasets and real-time data streams. This speed not only saves time but also opens up new business opportunities that were previously impractical or just didn’t make sense financially.

Beyond efficiency and cost savings, LLMs empower organisations to generate insights, recommendations, and detect anomalies with greater accuracy and consistency. LLMs also bring a level of scalability that manual analysis can’t match. As data volumes grow, an LLM-based solution can easily handle the increasing workload without compromising on quality or timeliness. This scalability ensures the data analysis capabilities keep pace with business growth.

Implementation and Integration

Integrating an LLM-based data analysis solution into existing workflows requires addressing certain challenges. One major challenge we faced was handling large datasets as the token lengths could quickly exceed the context window limits of certain models. For instance, a dataset with 17 columns and 700 rows could consume nearly 66k tokens, immediately ruling out models like GPT-3.5 with its 16k token limitation.

To overcome this, we explored models like GPT-4-turbo with larger context windows of 128k tokens. Additionally, we experimented with open-source models fine-tuned for extended context lengths, though this came with the trade-off of higher RAM requirements for processing such large context windows.

Another consideration was the cost implications of using GPT-4-turbo for data analysis. With an average cost of €19 per 1 million tokens, even a modest 66k token dataset would incur €0.66 in processing costs. As the need for analysing multiple datasets arose, this AI automation would not be cost-effective with scale.

To address the cost factor, we experimented with reducing the data sampling frequency of datasets and making data discrete, greatly reducing the total input length and therefore the cost per analysis while maintaining comparable analysis quality.

Use Cases and Applications

The potential of using LLMs for efficient data analysis can be applied across several industries such as energy, climate, mobility, and gaming. Example applications include live analysis of data feeds for anomaly detection, generating insights and recommendations based on raw data and enabling comparisons between datasets. While we did not implement full-scale solutions, the results demonstrated the feasibility of leveraging LLMs for data analysis in sectors such as energy, climate, mobility, and gaming.

As the field progresses, we anticipate the emergence of increasingly specialised LLMs tailored for specific domains or data types, further enhancing the accuracy and relevance of the analysis.

Best Practices and Lessons Learned

Throughout our journey of developing and implementing LLM-based data analysis solutions, we have gathered valuable lessons and best practices that extend beyond technical aspects. One crucial lesson has been the importance of collaboration and interdisciplinary work. Effective data analysis with LLMs requires a synergy between subject matter experts, data scientists and AI practitioners, ensuring that domain knowledge, data understanding and AI expertise are all brought together.

Mitigating the potential for hallucinations, or incorrect outputs generated by the LLM, is an ongoing challenge that requires continuous monitoring and adjustment of the AI workflow. As new models and training approaches emerge, we anticipate many improvements in this area.

Another key best practice is the iterative and agile approach we adopted. By embracing a cycle of continuous refinement and testing through Proof of Concepts, we could rapidly adapt and optimise our AI workflows based on feedback and results.

We are excited about the future developments and trends of using LLMs for data analysis. One notable trend is the rise of specialised LLMs tailored for specific domains or data types, such as genomics, customer interactions, or financial analysis. Additionally, we anticipate advancements in model architectures and training approaches that will further reduce the occurrence of hallucinations and improve overall performance, as exemplified by the latest Claude models from Anthropic, which come in three sizes, each optimised for different tasks like research + analysis, search + retrieval + coding, or customer interactions.

As these new models get released, we expect even better opportunities for leveraging LLMs to unlock valuable insights from data, drive innovation, and inform strategic decision-making across a wide range of industries and domains.