Elephant in the room: The data cost of AI

Now you will forgive the slightly overdramatic nature of this title, but it’s something that had me practically jumping up and down during the main demo in the latest Salesforce conference, and it’s something that comes to light in every single AI Demo that I’ve seen.

And that is the use of AI when it comes to data in a demo verses a corporate environment or doing something serious. Ultimately the problem is data access. As Salesforce themselves have just said, “AI is only as good as the data”, and I’m not talking about Bias or AI hallucinations or any of that. I’m simply talking about large scale AI use in a corporate environment and the expectations of the business.
AI has been sold as basically the solution to all things of life on Earth. but to give those kind of delivery’s all of your data has to be available to AI near Instantly, and all of IT know that is not the case.

let’s take the Salesforce demo as a perfect example. They were showing that AI had access to four different data sources, and it was using that access to construct the perfect interactions with customers and other humans. Brilliant. Perfect.. . But all of that data is already in Salesforce. AWS‘s demo was exactly the same. All of the vendors that are showing these beautiful AI’s using data are effectively doing it from local sources in modern formats and with high quality data. If we would move this back to our corporate environment , the sheer volume of legacy data that is scattered all over the place used in various different systems would initially make this impossible. if you don’t believe me, think of when you’re trying to merely do reports cross data sources and how much of a pain that it is.
But we are not home to Impossible, so before selling AI to the business make sure you set the expectations and work out how you are going to get round current limitations

These Limitations are grounded in the following:

Interfaces: Lots of legacy data does not have standard interconnects and does not have standard API’s to get it all in one place, particularly in terms of real time access, sorting this out is additional cost and time on your integrations.
The underlying infrastructure: A lot of the legacy data you will be pulling in ¹ will not work in the way you think it does. So if you’re thinking your AI will be able to access all data 24/7 then you are in for a shock. Some legacy systems will be down or slow for backups. Some legacy systems will be running at 80% or 90% of their utilisation because no one’s bought them bigger servers. Some systems will be suffering from their own heavy loads because of quarter and year end reports. These will mean that source systems are already under tremendous strain and you adding additional load is not going to make the business happy.
Data Transfer: There is also the matter of data transference and accountability. when feeding AI’s we will often be pulling data from all over the planet. that’s fine most of the time but you do have to be careful. The EU tends to only be bothered when you’re shovelling data outside of its borders. But there are countries that get really really hot and sweaty about moving data across borders. Italy and Turkey are two that leap to mind, and the German Workers Council is famously very fighty when it comes to moving peoples data around and exposing it in different ways. So that has to be handled. Some companies like Salesforce are aware of these problems. They made a big thing on their recent secure layer, but that secure layer is for data leaving salesforce not the other way round, be careful people, there be dragons!.
Syncing: While we are talking about moving data, don’t think you’re going get away with merely syncing data to a giant central repository. it can get huge and out of control very quickly, the costs can sky rocket and sooner or later an accountant is going to ask you to justify the reason you are maintaining the same data twice, Then you have then got the constant syncing to deal with, which in itself can trigger some of the points I mentioned above, and finally a lot of the legasy data stores do not maintain transactional logging, or even update logging. a feature needed for reliable syncing.
Cost: All this data pumping around for integration costs money and the nearer you want it to real time the more money it costs. Is the new AI features going to give you a good Return on investment or a competitive advantage?
Silo Owners: Lastly there are the owners of the current data sets, these source systems are often that person(s) career and job and they will fight tooth and nail to maintain control of it. thus introducing a political element in your deliverable that can get very messy.
Meta Data & Data Quality: ² Mark Forster made the following wise addition:
I would add another issue to those described here, which is metadata management. Are all data used being described in the same way with rigourous and applicable ontologies? In many cases this is not true and data from different sources are described in different ways. Maybe it’s trivial in that the units differ (Kg Vs Stones). Maybe the same attributes are given different names. As always it must be cleaned to a high standard to achieve meaningful results.
I could not agree with this comment more and its why I spend so much time on the Insurance Dictionary, as data definitions and quality are a terrible plague on integrations.

And that’s it, this was merely a short rant. It’s not to say that AI is not going to produce all of the awesomeness for us that has been promised. But you’ve got to manage the business expectations on what it can do with the information you can get to it. And the core of that is to make sure it has all the data in a timely fashion, costed out, and confirmed on a legal side.

and I say this from decades of real life experience,[↩]
Edit: This entry was added after I cross posted to linkedin[↩]

Leave a Reply Cancel reply