Top 10 Data Science and AI Trends to Watch in 2025

0
6
Science
1. We live in a world without reason (Tomas)
Just three years into AI, we’re starting to see businesses create value in some of the expected areas — but not all. According to Tomasz, the current state of AI can be divided into three categories:
1. Prediction: AI assistants that can complete sentences, fix errors in code, and so on.
2. Search: Tools that use a corpus of data to answer questions.
3. Inference: A multi-step process that can perform complex tasks.

While AI assistants and search engines have had moderate success (especially the early ones) among enterprises, inference models are still lagging. And according to Tomasz, the reason is obvious:

the accuracy of the models.

As Tomasz explains, current models have a hard time effectively breaking down tasks into steps unless they’ve seen a similar pattern many times before. But that’s not always the case with the work these models might be tasked with.

“Today… if you ask a big model to create a planning and analysis chart, it can do it. But if there’s a significant difference — like moving from software-based billing to usage-based billing — it’s lost.”

So for now, we can rely on AI assistants and partially accurate search results.

2. Process > Tools (Barr)

A new tool is only useful if it has the right process to support it. Over time, as the “modern data stack” has matured, data teams have sometimes found themselves in a constant state of testing. They’ve focused on the tool itself, without paying enough attention to how it will be used.
But as the enterprise moves toward production-level AI, the question of how to implement all these new tools has become increasingly pressing.

Take data quality, for example. As the data that powers AI has taken center stage in 2024, data quality has also become a higher priority. Faced with the real possibility of production-level AI, enterprise data leaders don’t have time to try out different data quality options — a small test here, a few point solutions there. They need to deliver value now, and that requires proven solutions that can be deployed quickly and used effectively.

If you can’t get your organization up and running quickly with a new tool, it will remain just a budget expense and a new shortcut on your desktop.

Over the next 12 months, we expect data teams to use proven end-to-end solutions instead of a collection of disparate tools to focus on more critical tasks like data quality management, incident management, and long-term development of domain knowledge.

And the solution that can deliver on these priorities will be the winner in AI.

3. AI Drives ROI – Not Revenue (Tomasz)

Like any other product, the value of generative AI comes in two forms: cost reduction or revenue growth.
In terms of revenue growth, this could be tools like AI assistants, data enrichment systems, or recommendations. According to Tomasz, these tools can create a lot of sales potential… but it won’t be healthy potential. So if AI isn’t generating revenue, it should be cutting costs, and this is where this emerging technology has really found its niche.

“Not many companies are closing deals because of it. It’s mostly cost reduction. Klarna cut two-thirds of their staff. Microsoft and ServiceNow increased the productivity of their engineering teams by 50–75%.”

According to Tomasz, the use of AI can help reduce costs if one of three criteria is met:

  • Recurring tasks
  • Difficult situation on the labor market
  • Urgent hiring needs

Tomasz cited the example of EvenUp, a transactional law firm that automates the drafting of claim letters. Organizations like EvenUp, which support templated but highly specialized services, may be particularly well-positioned to see significant impact from AI in its current form.

4. AI adoption is slower than expected, but leaders are taking their time (Tomasz)

In contrast to the excitement around “AI strategies” a year ago, today’s leaders seem to have unanimously taken a step back from the technology.
“There was a wave last year where people were trying all kinds of software just to see what it could do. Their boards were asking about their AI strategy. But now there’s a significant churn from that first wave.”

Some organizations simply didn’t see the value in their early experiments, while others were struggling with the rapid advancement of the underlying technology. Tomasz believes this is one of the biggest challenges facing those investing in AI companies. It’s not that the technology isn’t valuable in theory, but that organizations haven’t yet figured out how to use it effectively in practice.

Tomasz believes the next wave of adoption will be different from the first because leaders will be better informed about what they need and where to find it.

Like a dress rehearsal for a big show, teams now know exactly what they’re looking for. They have addressed some of the legal and procurement challenges — especially around data loss and security — and are ready to act when the opportunity arises.

The biggest challenge tomorrow? “How can I find and sell value faster?”

5. Small Data – the Future of AI (Tomasz)

The debate between using open source and managed solutions is as old as time. However, when it comes to AI, the issue becomes much more complex.
At the enterprise level, it’s not just a question of control or compatibility – although that matters – it’s primarily a question of operational costs.

The company believes that the largest B2C companies will use off-the-shelf models, while B2B tends to build their own proprietary and open source models.

“In B2B, you’ll see smaller models and more open source. That’s because it’s much cheaper to run a small, open model.”

But it’s not just about costs. Small models also improve performance. As with Google, large models are designed for a variety of use cases. Users can ask a large model questions about anything, and that model must be trained on a large amount of data to give a relevant answer – whether about water polo, Chinese history, or French toast.

Unfortunately, the more topics a model learns, the more likely it is to mix different concepts, and the more likely it is to get errors in its results over time.

“You can take a model like llama 2 with 8 billion parameters, tune it based on 10,000 support tickets, and it will do significantly better,” says Tomasz.

Additionally, ChatGPT and other managed solutions are often the target of lawsuits over claims that their creators did not have legal rights to the data on which their models were trained.

And in many cases, that’s probably true.

This, along with cost and performance, will likely have an impact on the long-term adoption of proprietary models — especially in heavily regulated industries — but the extent of that impact is still unclear.

Of course, proprietary models aren’t going away. Especially given Sam Altman’s stance on the issue. (And if Twitter has taught us anything, it’s that Sam Altman is always willing to speak out.)

Proprietary models are already aggressively lowering their prices to increase demand. For example, models like ChatGPT have already cut prices by about 50% and plan to cut them by another 50% in the next six months. Such cost reductions could be a much-needed boon for B2C companies hoping to compete in the AI ​​race.

6. The lines between analysts and data engineers are blurring (Barr)

When it comes to scaling up data production, these teams face two main challenges: analysts who lack technical expertise and data engineers who lack time.
This seems to be a challenge for AI.

We predict that the future evolution of data science teams could lead to the unification of engineering and analytics responsibilities in 2025, thanks to two important changes:

  • Growing Demand – As business leaders become more interested in data-driven products and AI, data teams will be forced to do more with fewer resources. In an effort to minimize process bottlenecks, leaders will empower previously specialized teams with greater authority to manage their processes and stakeholders.
  • Improvements in automation – New demand always drives new innovation. (In this case, that means AI-enabled pipelines.) As technology becomes more automated, engineers will be able to do more with less, while analysts will be able to do more tasks themselves.

The argument is simple — as demand increases, pipeline automation will naturally evolve to meet that need. As automation advances, the barriers to creating and managing pipelines are lowered. The skills gap narrows, and the opportunity to add new value increases.

The move to self-managed, AI-enabled pipelines means automating the most labor-intensive parts of the job, expanding the opportunity to create and deliver new value. That sounds like a good future.

7. Synthetic data matters – but it comes at a cost (Tomasz)

You’ve probably seen the image of a snake eating its own tail. If you look closely, it’s strikingly similar to modern AI.
There are currently an estimated 21–25 trillion tokens (words) on the internet. Modern AI models have used them all. For data to continue to evolve, it requires a much larger corpus of training data. The more data a model has, the more context it has to generate results, and the more accurate those results will be.

So what do AI researchers do when they run out of training data? They create their own.

As training data becomes increasingly scarce, companies believe that synthetic data will be an important part of the training process for their models in the future. An entire industry has grown to support this vision in the last 24 months — including companies that create synthetic structured data and enable the creation of relevant data for regulated industries like finance and healthcare.

But is synthetic data a long-term solution? Probably not.

Synthetic data is created by models that create artificial data sets that reflect what would be found organically (in some alternate reality where more data actually exists), and then use that new data to train their own models. At a small scale, this makes sense. But as the saying goes, too much of a good thing…

Think of it as a lack of contextual nutrition. Just like with food, if a fresh organic data source is the most nutritious for training a model, then data derived from existing data sets is inherently less nutritious than the original data.

A little artificial “flavor enhancer” is fine – but if the diet of synthetic training data continues indefinitely without the addition of new organic data, the model will eventually fail (or at least produce inferior results).

It’s not a question of “if,” it’s a question of “when.”

Tomasz believes that we’re far from model failure. But as AI research continues to push models to their functional limits, it’s not hard to imagine a world where AI reaches its functional plateau—perhaps sooner rather than later.

8. A stack of unstructured data will appear (Barr)

The idea of ​​using unstructured data in production is nothing new — but in the age of AI, unstructured data has taken on a whole new role.
According to a report from IDC, only about half of organizations’ unstructured data is currently analyzed.

But that’s about to change.

For generative AI, success depends heavily on the diversity of the unstructured data used to train, tune, and amplify it. As more organizations look to incorporate AI into enterprise processes, interest in unstructured data — and the emerging “unstructured data stack” — will continue to grow.

Some teams are even exploring how they can use complementary large language models (LLMs) to add structure to unstructured data, allowing it to scale its usefulness to other training and analytics use cases.

Identifying the first-party unstructured data that exists within an organization — and potential ways to activate it for stakeholders — presents an opportunity for data leaders to demonstrate the business value of their data platform (and hopefully secure additional budget for priority initiatives).

If 2024 was about exploring the potential of unstructured data, 2025 will be about realizing its value. The question is… which tools will come to the fore?

9. Agent AI is good for conversation – but not for implementation (Tomasz)

If you’re anywhere near a venture capital firm these days, you’ve probably heard the terms “copilot” and “agents” a lot. A “copilot” is an AI that’s used to perform a single step (e.g. “fix my bad code”), while “agents” are a multi-step process that can collect information and use it to perform a task (e.g. “write a blog about my bad code and publish it to WordPress”).
We’ve certainly seen plenty of successful use cases for AI copypilots in 2024, as companies like Github, Snowflake, and Microsoft can attest. But what about AI agents? While

“AI agents” have been a pain in the ass for customer support teams, that’s likely where their role will end for the foreseeable future. While these early AI agents are an important step forward, their accuracy is still far from perfect.

To put that in perspective, 75%-90% accuracy is the current state of the art for AI. Most AIs are comparable to a high school student. But if you have three steps with 75-90% accuracy, then the final accuracy is about 50%.

We trained elephants to draw with higher accuracy.

Instead of being a source of income for organizations, most AI agents will be harmful if released into production at their current performance. According to Tomasz, this is the first problem to solve.

It is important to be able to talk about them, because no one has succeeded outside of demonstrations. Although Silicon Valley loves to talk about AI agents, these conversations do not translate into real performance.

10. Pipelines are getting bigger, but quality is not improving (Tomasz)
“At a dinner with AI leaders, we asked how many people were happy with the quality of the results, and no one raised their hand. There is a real problem in ensuring consistent quality of results.”

Every year, data scientists are surveyed about the state of their data quality. This year, we turned our attention to the AI ​​space, and the message was clear.

Data quality risks are changing, but data quality management is not keeping up.

“We see teams building vector databases or embedding models at scale. SQLLite at scale. All these 100 million small databases. They are starting to be designed at the CDN level to run all these small models. iPhones will have machine learning models. We will see an explosion of total pipelines, but with much smaller volumes of data.”

The fine-tuning pattern will create an explosion of data pipelines within organizations. However, the larger the pipelines get, the more difficult it becomes to ensure data quality.

Data quality increases in direct proportion to the size and complexity of your pipelines. The more pipelines you have (and the more complex they become), the more opportunities there are for problems to arise—and the less likely you are to find them in time.

LEAVE A REPLY

Please enter your comment!
Please enter your name here